-
Notifications
You must be signed in to change notification settings - Fork 90
Add skip_publishers_disallowing_training
#772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
src/fundus/scraping/crawler.py
Outdated
# register default events | ||
__EVENTS__.register_event("stop") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit, I'm not entirely comfortable with moving the event register here. It feels somewhat disconnected from the rest of the code. While I understand the necessity behind this choice, it does strike me as a potential design flaw that we should aim to address.
More broadly, enforcing strict access to events, requiring them to be registered before they can be checked, might not be the best approach. The original intention was to introduce a level of control, ensuring that accessing an unregistered event would result in an error. However, I'm not sure whether this tradeoff is worth it. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you perhaps give me an example, where it might cause problems accessing unregistered events? Because especially with regard to the stop event, I feel like the process of registering is devalued by the fact that, we do it by default for all threads when crawling. I could imagine in that it becomes more important for future events, that could be added. So I would also tend towards finding a solution that perhaps automatically registers the stop event, because we will likely need it in most situations.
src/fundus/scraping/crawler.py
Outdated
if skip_publishers_disallowing_training: | ||
publishers = tuple( | ||
[ | ||
publisher | ||
for publisher in publishers | ||
if not (publisher.disallows_training or publisher.robots.disallows_training()) | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performing this check for every WARC path introduces unnecessary overhead. It would be far more efficient to verify the publishers once during initialization and then propagate the result. However, this should be implemented in a way that doesn’t significantly slow down the CCNewsCrawlers
initialization (i.e., not running sequentially) and doesn’t depend on the availability of each publisher’s robots.txt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you already have an idea of how we would implement that? If we want to have it non-blocking, I have one idea for now. During initialization of the CCNewsCrawler, we start getting the publishers preferences in the background. We can then start crawling as normally, especially since we have to stream the entire WARC file anyway. We can then already start parsing articles for the supported publishers and either caching them until we have the results from the permission check or start yielding/discarding them, if we have the results. I don't really see a way to do that without crawling the robots.txt. I guess caching the preferences might also be an option, but I don't know, if that makes too much sense, since we'll have to crawl it at least once anyway and if people want to comply with the preferences, I think we should ensure they are up to date.
src/fundus/scraping/session.py
Outdated
if isinstance(response, Exception): | ||
raise response | ||
return response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason to add that to the try-except block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was probably included there by accident.
src/fundus/scraping/session.py
Outdated
class RequestInterruptedError(Exception): | ||
"""Raised when a request is interrupted by a stop event.""" | ||
|
||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use this part of the code to formulate my general concerns.
- Thanks a lot for the stability improvements. Getting threading under control seems to be a real hassle.
- I would appreciate implementing a best practice to keep pull requests functionally isolated whenever possible. Combining new crawler functionality with stability improvements (like threading changes) makes the review process significantly harder, and now the features depend on each other.
- In general, I prefer your solution, introducing a custom error for the interrupt and catching it when possible to control code flow and refrain from intentionally crashing threads with uncought errors. But I see a problem here, as it would require us to handle interruptions at every potential iteration point.
For example, take this part of the BaseScraper
:
for html in source.fetch(url_filter=url_filter):
parser = self.parser_mapping[html.source_info.publisher]
To prevent it from looping over every source (which it would do now), we would have to check for the stop event here as well, introducing the flag to the scope of the CCNewsCrawler
, which I would like to prevent.
Having this in mind, I would prefer crashing the thread intentionally by raising uncought RequestInterruptedError
s, but what's your opinion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are definitely correct, sorry for the extra confusion with the combined PR. I would also very much appreciate to have them separate at the moment.
While I see the issue you raised, I'm not sure I prefer the alternative of purposely crashing the thread. Maybe there is some alternative to handle it a bit more gracefully? What is your opinion on a similar solution as in the queue setting, where we pass the exception back through the queue, we could do something similar for the iterator. Then we would avoid reusing the flag out of scope, but we can cleanly exit the loop with a break, instead of crashing out.
Realization Ideas:
|
This pull request introduces functionality to filter publishers based on their policies regarding AI training and enhances error handling in the crawling process.
AI Training Filtering Enhancements:
README.md
and detailed instructions indocs/5_advanced_topics.md
about verifying publisher policies before using their content for AI training. Introduced a new section, "Filtering publishers for AI training," explaining theskip_publishers_disallowing_training
parameter. [1] [2] [3]robots.txt
Parsing: Enhanced theCustomRobotFileParser
class insrc/fundus/publishers/base_objects.py
to detect AI training disallowance keywords in comments and added thedisallow_training
attribute to indicate such restrictions. TheRobots
class was also modified to self-ensure the readyness of the Robots File Parser instead of requiring us to check if the parser is ready and calling theread()
function if necessary.disallows_training
attribute to thePublisher
class so that known disallowances can be hardcoded into thePublisher
setup. [1] [2]Crawling Logic Improvements:
crawl()
and_fetch_articles()
methods insrc/fundus/scraping/crawler.py
to support theskip_publishers_disallowing_training
parameter, ensuring publishers with AI training restrictions are excluded. [1] [2] [3]task_done()
technically solved the problem, but only by causing the thread to crash. While this did have the intended effect, it is a messy solution. The new solution introduces aRequestInterruptedError
exception insrc/fundus/scraping/session.py
and updates the crawling logic to handle interrupted requests gracefully. [1] [2] [3]These updates provide better compliance with publisher policies and improve the robustness of the crawling process.