Add `skip_publishers_disallowing_training` #772

addie9800 · 2025-06-12T20:37:25Z

This pull request introduces functionality to filter publishers based on their policies regarding AI training and enhances error handling in the crawling process.

AI Training Filtering Enhancements:

README and Documentation Updates: Added a disclaimer in README.md and detailed instructions in docs/5_advanced_topics.md about verifying publisher policies before using their content for AI training. Introduced a new section, "Filtering publishers for AI training," explaining the skip_publishers_disallowing_training parameter. [1] [2] [3]
robots.txt Parsing: Enhanced the CustomRobotFileParser class in src/fundus/publishers/base_objects.py to detect AI training disallowance keywords in comments and added the disallow_training attribute to indicate such restrictions. The Robots class was also modified to self-ensure the readyness of the Robots File Parser instead of requiring us to check if the parser is ready and calling the read() function if necessary.
Publisher Object Updates: Added a disallows_training attribute to the Publisher class so that known disallowances can be hardcoded into the Publisher setup. [1] [2]

Crawling Logic Improvements:

Skipping Restricted Publishers: Updated the crawl() and _fetch_articles() methods in src/fundus/scraping/crawler.py to support the skip_publishers_disallowing_training parameter, ensuring publishers with AI training restrictions are excluded. [1] [2] [3]
Error Handling for Interruptions: A previous change solving the deadlock issue by calling task_done() technically solved the problem, but only by causing the thread to crash. While this did have the intended effect, it is a messy solution. The new solution introduces a RequestInterruptedError exception in src/fundus/scraping/session.py and updates the crawling logic to handle interrupted requests gracefully. [1] [2] [3]

These updates provide better compliance with publisher policies and improve the robustness of the crawling process.

MaxDall · 2025-07-29T15:59:45Z

src/fundus/scraping/crawler.py

+        # register default events
+        __EVENTS__.register_event("stop")


I have to admit, I'm not entirely comfortable with moving the event register here. It feels somewhat disconnected from the rest of the code. While I understand the necessity behind this choice, it does strike me as a potential design flaw that we should aim to address.

More broadly, enforcing strict access to events, requiring them to be registered before they can be checked, might not be the best approach. The original intention was to introduce a level of control, ensuring that accessing an unregistered event would result in an error. However, I'm not sure whether this tradeoff is worth it. What do you think?

Can you perhaps give me an example, where it might cause problems accessing unregistered events? Because especially with regard to the stop event, I feel like the process of registering is devalued by the fact that, we do it by default for all threads when crawling. I could imagine in that it becomes more important for future events, that could be added. So I would also tend towards finding a solution that perhaps automatically registers the stop event, because we will likely need it in most situations.

MaxDall · 2025-07-29T16:07:41Z

src/fundus/scraping/crawler.py

+            if skip_publishers_disallowing_training:
+                publishers = tuple(
+                    [
+                        publisher
+                        for publisher in publishers
+                        if not (publisher.disallows_training or publisher.robots.disallows_training())
+                    ]
+                )


Performing this check for every WARC path introduces unnecessary overhead. It would be far more efficient to verify the publishers once during initialization and then propagate the result. However, this should be implemented in a way that doesn’t significantly slow down the CCNewsCrawlers initialization (i.e., not running sequentially) and doesn’t depend on the availability of each publisher’s robots.txt.

Did you already have an idea of how we would implement that? If we want to have it non-blocking, I have one idea for now. During initialization of the CCNewsCrawler, we start getting the publishers preferences in the background. We can then start crawling as normally, especially since we have to stream the entire WARC file anyway. We can then already start parsing articles for the supported publishers and either caching them until we have the results from the permission check or start yielding/discarding them, if we have the results. I don't really see a way to do that without crawling the robots.txt. I guess caching the preferences might also be an option, but I don't know, if that makes too much sense, since we'll have to crawl it at least once anyway and if people want to comply with the preferences, I think we should ensure they are up to date.

MaxDall · 2025-07-29T16:09:07Z

src/fundus/scraping/session.py

                if isinstance(response, Exception):
                    raise response
                return response


What's the reason to add that to the try-except block?

It was probably included there by accident.

MaxDall · 2025-07-29T16:29:47Z

src/fundus/scraping/session.py

+class RequestInterruptedError(Exception):
+    """Raised when a request is interrupted by a stop event."""
+
+    pass


I use this part of the code to formulate my general concerns.

Thanks a lot for the stability improvements. Getting threading under control seems to be a real hassle.

I would appreciate implementing a best practice to keep pull requests functionally isolated whenever possible. Combining new crawler functionality with stability improvements (like threading changes) makes the review process significantly harder, and now the features depend on each other.

In general, I prefer your solution, introducing a custom error for the interrupt and catching it when possible to control code flow and refrain from intentionally crashing threads with uncought errors. But I see a problem here, as it would require us to handle interruptions at every potential iteration point.

For example, take this part of the BaseScraper:

for html in source.fetch(url_filter=url_filter): parser = self.parser_mapping[html.source_info.publisher]

To prevent it from looping over every source (which it would do now), we would have to check for the stop event here as well, introducing the flag to the scope of the CCNewsCrawler, which I would like to prevent.
Having this in mind, I would prefer crashing the thread intentionally by raising uncought RequestInterruptedErrors, but what's your opinion on this?

You are definitely correct, sorry for the extra confusion with the combined PR. I would also very much appreciate to have them separate at the moment.

While I see the issue you raised, I'm not sure I prefer the alternative of purposely crashing the thread. Maybe there is some alternative to handle it a bit more gracefully? What is your opinion on a similar solution as in the queue setting, where we pass the exception back through the queue, we could do something similar for the iterator. Then we would avoid reusing the flag out of scope, but we can cleanly exit the loop with a break, instead of crashing out.

addie9800 · 2025-08-04T11:47:56Z

Realization Ideas:

Register Default Events like 'stop'
CrashThread Exception
Move publisher robots testing away from _fetch_articles in CCNewsCrawler, ideally in background

addie9800 added 2 commits June 12, 2025 22:19

add skip_publishers_disallowing_training

899f403

minor fixes

a30892f

MaxDall requested changes Jul 29, 2025

View reviewed changes

addie9800 and others added 4 commits July 31, 2025 18:04

remove code from try

3d83628

Merge branch 'master' into add-training-disclaimer

4a67efa

black

583b021

use default request header

8a78912

addie9800 added 5 commits August 25, 2025 19:34

introduce CrashThread exception

2896c61

introduce ThreadEventDict

3f78145

move publisher robots verification to background

d09d9da

fix typing

3ad9a63

fix typing

ca44d33

addie9800 requested a review from MaxDall August 30, 2025 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `skip_publishers_disallowing_training` #772

Add `skip_publishers_disallowing_training` #772

Uh oh!

addie9800 commented Jun 12, 2025

Uh oh!

MaxDall Jul 29, 2025

Uh oh!

addie9800 Jul 31, 2025

Uh oh!

MaxDall Jul 29, 2025

Uh oh!

addie9800 Jul 31, 2025

Uh oh!

MaxDall Jul 29, 2025

Uh oh!

addie9800 Jul 31, 2025

Uh oh!

MaxDall Jul 29, 2025

Uh oh!

addie9800 Jul 31, 2025

Uh oh!

addie9800 commented Aug 4, 2025

Uh oh!

Uh oh!

Add skip_publishers_disallowing_training #772

Are you sure you want to change the base?

Add skip_publishers_disallowing_training #772

Uh oh!

Conversation

addie9800 commented Jun 12, 2025

AI Training Filtering Enhancements:

Crawling Logic Improvements:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

addie9800 commented Aug 4, 2025

Uh oh!

Uh oh!

Add `skip_publishers_disallowing_training` #772

Add `skip_publishers_disallowing_training` #772