Skip to content

fix: handle compound extensions in LinkExtractor#351

Open
axelray-dev wants to merge 1 commit into
D4Vinci:devfrom
axelray-dev:fix/349-compound-extension-filter
Open

fix: handle compound extensions in LinkExtractor#351
axelray-dev wants to merge 1 commit into
D4Vinci:devfrom
axelray-dev:fix/349-compound-extension-filter

Conversation

@axelray-dev

Copy link
Copy Markdown

Proposed change

_url_extension() returns only the last suffix of a URL path segment (e.g. "gz" for dataset.tar.gz). This means compound extensions like "tar.gz" that are already listed in IGNORED_EXTENSIONS never match during deny_extensions filtering, so archive files such as .tar.gz and .tar.bz2 pass through the extractor undetected.

This PR adds a _url_extensions() helper that returns all dot-suffixes for the last path segment (e.g. {"gz", "tar.gz"} for dataset.tar.gz) and uses it in _url_passes() so compound extensions are correctly rejected. The existing _url_extension() function is preserved unchanged for backward compatibility.

Type of change:

  • Bugfix (non-breaking change which fixes an issue)
  • Dependency upgrade
  • New integration (thank you!)
  • New feature (which adds functionality to an existing integration)
  • Deprecation (breaking change to happen in the future)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests
  • Add or change doctests?
  • Documentation change?

Additional information

Checklist:

  • I have read CONTRIBUTING.md.
  • This pull request is all my own work -- I have not plagiarized.
  • I know that pull requests will not be merged if they fail the automated tests.
  • All new Python files are placed inside an existing directory.
  • All filenames are in all lowercase characters with no spaces or dashes.
  • All functions and variable names follow Python naming conventions.
  • All function parameters and return values are annotated with Python type hints.
  • All functions have doc-strings.

_url_extension() returns only the last suffix (e.g. 'gz' for
'dataset.tar.gz'), so compound extensions like 'tar.gz' listed in
IGNORED_EXTENSIONS never match during deny_extensions filtering.

Add _url_extensions() helper returning all dot-suffixes for the last
path segment and use it in _url_passes() so that compound extensions
are correctly rejected.

Fixes D4Vinci#349
@yetval

yetval commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Confirmed this fixes it. .tar.gz gets dropped by default now, and explicit deny_extensions={"tar.gz"} works too while plain .gz is still kept. Tests cover the cases well.

Only nit: _url_extension (singular) has no callers after this, so it could be dropped, but no big deal. Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants