Skip to content

v0.4.9#345

Merged
D4Vinci merged 25 commits into
mainfrom
dev
Jun 7, 2026
Merged

v0.4.9#345
D4Vinci merged 25 commits into
mainfrom
dev

Conversation

@D4Vinci

@D4Vinci D4Vinci commented Jun 7, 2026

Copy link
Copy Markdown
Owner

A maintenance update packed with community-reported fixes 🛠️

🚀 New Stuff and quality of life changes

  • Updated all browsers and fingerprints. Run scrapling install --force after updating to refresh them.
  • Added a --version flag to the CLI by @ETM-Code in #303 (Solves #299)

🐛 Bug Fixes

  • Fixed the session-level proxy argument being silently ignored in HTTP sessions, which could leak your real IP (Solves #295). Note that mixing a session-level proxy with a per-request proxies argument (or vice versa) now raises an error instead of one being silently dropped.
  • Fixed browser navigations failing when combining init_script with user_data_dir (Solves #294).
  • Fixed encoding detection when websites quote the charset value in the Content-Type header by @Bortlesboat in #323.
  • Fixed an IndexError in adaptive element relocation when auto_save is enabled by @Mubashirrrr in #340.
  • Fixed spiders' checkpoint and cache saving crashing on Windows by @MrStarkEG in #344.
  • Fixed incorrect similarity scoring in find_similar for elements with mismatched attribute counts (Solves #322).

Docs

  • Clarified that the default installation includes the parser engine only, and the fetchers/spiders need the extras (Solves #343).
  • Fixed the Docker image name in the remaining examples by @evanclan in #315.
  • Fixed a broken link in the contribution guide by @Bortlesboat in #320.

🙏 Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

ETM-Code and others added 25 commits May 30, 2026 11:37
- Add a `--version` flag to the main CLI group using click's `version_option`
- Prints `Scrapling, version <version>` and exits, sourcing the version from `scrapling.__version__`
- Add a CLI test asserting the flag's output

Closes #299
PR #283 fixed the bare `scrapling` image name in docs/ai/mcp-server.md;
the agent-skill MCP reference and CLI extract docs still used the
unqualified name, which Docker resolves against the official library
namespace and fails with pull access denied.
`ResponseFactory.__extract_browser_encoding` matched the charset with
`charset=([\w-]+)`, which stops at a quote character. RFC 7231 permits the
charset value to be a quoted-string (e.g. `content-type: text/html;
charset="ISO-8859-1"`), so for any quoted charset the regex failed to match
and the function silently fell back to the `utf-8` default. A page served as
quoted ISO-8859-1 / windows-1252 / Shift_JIS would then be decoded as UTF-8,
producing mojibake.

Allow an optional surrounding quote in the pattern (`charset=["']?([\w-]+)`)
so the value is captured without the quote. Unquoted headers are unaffected.

The existing `content_type_map` fixture in tests/fetchers/test_utils.py was
unused; add focused tests covering unquoted, quoted, and missing charsets.
When `css()`/`xpath()` are called with both `adaptive=True` and
`auto_save=True`, the relocation branch guarded the re-save with
`if elements is not None`. However `relocate()` returns an empty
list (never `None`) when no candidate clears the `percentage`
threshold, so the guard always passed and `self.save(elements[0], ...)`
raised `IndexError: list index out of range`.

This crashes exactly when adaptive resilience is needed most: the page
structure changed enough that nothing matches above the threshold.

Fix: use a truthiness check (`if elements and auto_save`) so the
re-save is skipped when relocation yields nothing. The successful
relocation path (which re-saves the relocated element) is unchanged.

Added a regression test that fails before the fix (IndexError) and
passes after.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ndows

CheckpointManager.save() and ResponseCacheManager.put() write to a temp
file and then move it into place with Path.rename(). On Windows, os.rename
cannot overwrite an existing destination and raises FileExistsError
(WinError 183), so every write after the first one fails: checkpoint
saving raises and breaks resume, while the development response cache
swallows the error and keeps returning the stale entry.

Path.replace() (os.replace) overwrites the destination atomically on every
platform and behaves identically to rename() on POSIX, so this is a no-op
on Linux and macOS and only fixes the broken overwrite on Windows.

Add a regression test for the cache overwrite path; the checkpoint
overwrite is already covered by test_multiple_saves_overwrite.
…enominator

Candidates with fewer attributes than the original got inflated scores because the denominator counted candidate attributes only, while the extra-attributes penalty direction worked as intended. Using `max()` on both counts fixes the inflation while keeping the penalty.

Closes #322
The quickstart examples import from `scrapling.fetchers`, which raises `ModuleNotFoundError` on a bare install since those dependencies live in the fetchers extra. Make the consequence explicit in the installation section across the docs and all README translations.

Closes #343
The per-request proxy resolution never fell back to the session default, so FetcherSession(proxy=...) was silently ignored, and requests went direct. Same fix in the sync and async paths, with regression tests asserting on the proxy that reaches curl_cffi.

Closes #295
@D4Vinci D4Vinci merged commit 1490506 into main Jun 7, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants