Skip to content

Conversation

addie9800
Copy link
Collaborator

This pull request introduces functionality to filter publishers based on their policies regarding AI training and enhances error handling in the crawling process.

AI Training Filtering Enhancements:

  • README and Documentation Updates: Added a disclaimer in README.md and detailed instructions in docs/5_advanced_topics.md about verifying publisher policies before using their content for AI training. Introduced a new section, "Filtering publishers for AI training," explaining the skip_publishers_disallowing_training parameter. [1] [2] [3]
  • robots.txt Parsing: Enhanced the CustomRobotFileParser class in src/fundus/publishers/base_objects.py to detect AI training disallowance keywords in comments and added the disallow_training attribute to indicate such restrictions. The Robots class was also modified to self-ensure the readyness of the Robots File Parser instead of requiring us to check if the parser is ready and calling the read() function if necessary.
  • Publisher Object Updates: Added a disallows_training attribute to the Publisher class so that known disallowances can be hardcoded into the Publisher setup. [1] [2]

Crawling Logic Improvements:

  • Skipping Restricted Publishers: Updated the crawl() and _fetch_articles() methods in src/fundus/scraping/crawler.py to support the skip_publishers_disallowing_training parameter, ensuring publishers with AI training restrictions are excluded. [1] [2] [3]
  • Error Handling for Interruptions: A previous change solving the deadlock issue by calling task_done() technically solved the problem, but only by causing the thread to crash. While this did have the intended effect, it is a messy solution. The new solution introduces a RequestInterruptedError exception in src/fundus/scraping/session.py and updates the crawling logic to handle interrupted requests gracefully. [1] [2] [3]

These updates provide better compliance with publisher policies and improve the robustness of the crawling process.

Comment on lines 495 to 496
# register default events
__EVENTS__.register_event("stop")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit, I'm not entirely comfortable with moving the event register here. It feels somewhat disconnected from the rest of the code. While I understand the necessity behind this choice, it does strike me as a potential design flaw that we should aim to address.

More broadly, enforcing strict access to events, requiring them to be registered before they can be checked, might not be the best approach. The original intention was to introduce a level of control, ensuring that accessing an unregistered event would result in an error. However, I'm not sure whether this tradeoff is worth it. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you perhaps give me an example, where it might cause problems accessing unregistered events? Because especially with regard to the stop event, I feel like the process of registering is devalued by the fact that, we do it by default for all threads when crawling. I could imagine in that it becomes more important for future events, that could be added. So I would also tend towards finding a solution that perhaps automatically registers the stop event, because we will likely need it in most situations.

Comment on lines 636 to 643
if skip_publishers_disallowing_training:
publishers = tuple(
[
publisher
for publisher in publishers
if not (publisher.disallows_training or publisher.robots.disallows_training())
]
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performing this check for every WARC path introduces unnecessary overhead. It would be far more efficient to verify the publishers once during initialization and then propagate the result. However, this should be implemented in a way that doesn’t significantly slow down the CCNewsCrawlers initialization (i.e., not running sequentially) and doesn’t depend on the availability of each publisher’s robots.txt.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you already have an idea of how we would implement that? If we want to have it non-blocking, I have one idea for now. During initialization of the CCNewsCrawler, we start getting the publishers preferences in the background. We can then start crawling as normally, especially since we have to stream the entire WARC file anyway. We can then already start parsing articles for the supported publishers and either caching them until we have the results from the permission check or start yielding/discarding them, if we have the results. I don't really see a way to do that without crawling the robots.txt. I guess caching the preferences might also be an option, but I don't know, if that makes too much sense, since we'll have to crawl it at least once anyway and if people want to comply with the preferences, I think we should ensure they are up to date.

Comment on lines 65 to 67
if isinstance(response, Exception):
raise response
return response
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to add that to the try-except block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was probably included there by accident.

Comment on lines 18 to 21
class RequestInterruptedError(Exception):
"""Raised when a request is interrupted by a stop event."""

pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this part of the code to formulate my general concerns.

  1. Thanks a lot for the stability improvements. Getting threading under control seems to be a real hassle.
  2. I would appreciate implementing a best practice to keep pull requests functionally isolated whenever possible. Combining new crawler functionality with stability improvements (like threading changes) makes the review process significantly harder, and now the features depend on each other.
  3. In general, I prefer your solution, introducing a custom error for the interrupt and catching it when possible to control code flow and refrain from intentionally crashing threads with uncought errors. But I see a problem here, as it would require us to handle interruptions at every potential iteration point.

For example, take this part of the BaseScraper:

for html in source.fetch(url_filter=url_filter):
    parser = self.parser_mapping[html.source_info.publisher]

To prevent it from looping over every source (which it would do now), we would have to check for the stop event here as well, introducing the flag to the scope of the CCNewsCrawler, which I would like to prevent.
Having this in mind, I would prefer crashing the thread intentionally by raising uncought RequestInterruptedErrors, but what's your opinion on this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are definitely correct, sorry for the extra confusion with the combined PR. I would also very much appreciate to have them separate at the moment.

While I see the issue you raised, I'm not sure I prefer the alternative of purposely crashing the thread. Maybe there is some alternative to handle it a bit more gracefully? What is your opinion on a similar solution as in the queue setting, where we pass the exception back through the queue, we could do something similar for the iterator. Then we would avoid reusing the flag out of scope, but we can cleanly exit the loop with a break, instead of crashing out.

@addie9800
Copy link
Collaborator Author

Realization Ideas:

  • Register Default Events like 'stop'
  • CrashThread Exception
  • Move publisher robots testing away from _fetch_articles in CCNewsCrawler, ideally in background

@addie9800 addie9800 requested a review from MaxDall August 30, 2025 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants