Skip to content

refactor: unify input URL fetching with the link-checker's HostPool#2100

Merged
mre merged 3 commits intomasterfrom
refactor/unify-input-resolver-client
Mar 26, 2026
Merged

refactor: unify input URL fetching with the link-checker's HostPool#2100
mre merged 3 commits intomasterfrom
refactor/unify-input-resolver-client

Conversation

@mre
Copy link
Copy Markdown
Member

@mre mre commented Mar 25, 2026

This is the follow-up to #2099, which took a conservative approach to fixing #1886.

Previously, the reqwest::Client used by UrlContentResolver (which fetches the body of remote CLI input URLs) was built without a user-agent, rate limiting, retries, TLS settings, or per-host configuration. This meant that passing a URL directly as a CLI argument silently diverged from how link checking works. For example, Wikipedia returns a 403 with no user-agent set, so lychee https://en.wikipedia.org/wiki/... would find zero links and report success.

The fix in #2099 was intentionally minimal: store the configured user-agent on the Collector and use it when building the resolver's reqwest::Client. It fixes the immediate issue but treats the two code paths separately, which isn't great.

This PR takes the approach I described in #2099 as the "alternative": instead of the Collector maintaining its own reqwest::Client, it now shares the same Arc<HostPool> that the link checker uses. The lychee_lib::Client is built before the Collector in main.rs, and its pool is handed to the Collector via the new .host_pool() builder method. Both input fetching and link checking now go through the same pool, so all configuration (user-agent, custom headers, per-host headers, TLS, cookies, rate limiting, retries) is applied consistently to both paths.

As a side effect, fetching a remote input document now counts against the per-host rate limit bucket for that host. This is actually the correct behavior since we want lychee to be a good web citizen regardless of whether a request is for input fetching or link checking. =)

One tradeoff worth noting: Collector::default() and Collector::new() (which are used in tests without a full ClientBuilder setup) now fall back to HostPool::default() instead of reqwest::Client::new(). HostPool::default() is equally lightweight because it just wraps a default reqwest::Client with lazy host creation, so this should not be a big deal in practice, but it's worth mentioning.

I believe now that this is the superior approach to resolve the issue. wdyt?

Fixes #1886
Fixes #1673

@katrinafyi @thomas-zahner @cristiklein feedback welcome!

Previously, the UrlContentResolver used its own bare reqwest::Client
to fetch remote input documents (e.g. `lychee https://example.com`).
This separate code path silently missed several important features
compared to the link-checking path:

- No user-agent was set (#1886)
- Custom headers were forwarded but per-host headers were not
- No rate limiting, retries, or backoff
- No cookie jar, TLS settings, or redirect policy

This commit replaces the bare reqwest::Client in UrlContentResolver
with the same Arc<HostPool> used by the link checker.

In main.rs, the lychee Client is now built before the Collector so its
HostPool can be shared with the Collector via the new .host_pool()
builder method. Both the Collector (for input fetching) and the
WebsiteChecker (for link checking) now use the same HostPool instance,
so all configuration is automatically applied to both paths.

As a side effect, fetching a remote input document now counts against
the per-host rate limit bucket for that host. This is intentional: we
want to be a good citizen of the web regardless of whether a request
is for input fetching or link checking.

The Collector::default() and Collector::new() cases (used in tests and
library code) fall back to HostPool::default(), which is a lightweight
default-configured pool -- no heavier than the previous bare
reqwest::Client::new().
Copy link
Copy Markdown
Contributor

@cristiklein cristiklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mre . Thanks for involving me.

Overall, I like the idea of lychee having a single, shared HostPool which controls all host-related parameters and applies them uniformly to both fetching input URLs and collected URLs.

I have two comments:

  1. It noticed that #2099 contains a few tests. Would be great to add them to this PR to show that something which was previously broken is now fixed.
  2. I'm surprised by the need to go from pub(create) to pub. Is that really necessary?

Also make CacheableResponse and execute_request crate-private.
@mre
Copy link
Copy Markdown
Member Author

mre commented Mar 25, 2026

I'm surprised by the need to go from pub(create) to pub. Is that really necessary?

Oh, you're right. It was necessary, but thanks to some refactoring it's not necessary anymore. 😄 👍 Done.

It noticed that #2099 contains a few tests. Would be great to add them to this PR to show that something which was previously broken is now fixed.

Makes sense. Brought them over.

Copy link
Copy Markdown
Member

@katrinafyi katrinafyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite happy with this. It is also a happy surprise how small the change is. I was definitely expecting something way bigger.

But we may not always get so lucky, and there's definitely a discussion to be had about refractors or big changes. There's now 3 ish big ticket items in my mind (recursion, base url, now status enum) which will need extensive changes. Maybe it's worth discussing in a dedicated issue.

@mre
Copy link
Copy Markdown
Member Author

mre commented Mar 26, 2026

Cool. My vote goes to merging this and in turn closing #2099. This will resolve a few issues and it's a relatively straightforward change. It shouldn't draw us into a corner when tackling the bigger, architectural issues.

Copy link
Copy Markdown
Contributor

@cristiklein cristiklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the test in place and seeing how many bugs are resolved by this PR, my vote also goes to merging this and closing #2099 .

And I agree with @katrinafyi . It turned out rather small in comparison to the effect it has.

(Note that I'm not a lychee maintainer, so my approval doesn't really count. 😄)

Copy link
Copy Markdown
Member

@thomas-zahner thomas-zahner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mre this is definitely the way to go. I've had something like this in the back of my mind too.

@thomas-zahner
Copy link
Copy Markdown
Member

thomas-zahner commented Mar 26, 2026

@mre I've created a54f81d to address my comments, you can of course amend/force push if you wouldn't agree with something.

@mre
Copy link
Copy Markdown
Member Author

mre commented Mar 26, 2026

Yes! Thanks for contributing @thomas-zahner. All great changes.

Copy link
Copy Markdown
Member Author

@mre mre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to merge. Thanks everyone for the review and to @thomas-zahner for finalizing the PR. 👍

@mre mre merged commit 48663cb into master Mar 26, 2026
8 checks passed
@mre mre deleted the refactor/unify-input-resolver-client branch March 26, 2026 22:18
@mre mre mentioned this pull request Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

User agent not set when fetching command-line URLs --insecure flag does not work for URL args

4 participants