Hi,
Thanks for Colly! :)
I have a task with hundreds of thousands of pages, so obviously I am using Colly's caching, but it's basically 'too much' for the filesystem. (Wasted disk space, a pain to manage, slow to backup, etc)
I'd like to propose a pluggable cache system, similar to how you've made other Colly components.
Perhaps with an API like this:-
type Cache interface {
Init() error
Get(url string) (*colly.Response, error)
Put(url string, r colly.Response) error
Remove(url string) error
}
...or...
type Cache interface {
Init() error
Get(url string) ([]byte, error)
Put(url string, data []byte) error
Remove(url string) error
}
The first one won't be possible if you then wish to implement FileSystemCache in a subpackage to Colly though.
The reason I also need a Remove method is because one project has a site that sometimes serves maintenance pages, and whilst I can detect these, Colly currently has no method of saying stop, or of rejecting a page after processing. Obviously, the last thing I want part way through a crawl is to have my cache poisoned. But that's probably a separate issue, that I can live with if I can do the removal of bad pages myself.
If pluggable caches were to be implemented, I have existing code from another project that has a cache built using SQLite as a key-value store, compressing/decompressing the data with Zstandard (it's both surprisingly quick and super efficient on disk space), that I would happily port over. This can either become part of Colly, or a separate thing on my own Github.
I did start implementing this myself, but ran into a problem with how I went about it. (I followed the pattern you have established of having the separate components as subpackages, I then got bitten because my FileSystemCache couldn't easily reach into the Collector to get the CacheDir, I was trying to preserve existing behaviour / API compatibility. Maybe that's not an issue. Maybe these bits shouldn't be subpackages. Obviously once I started thinking things like that I figured it was best to discuss before/instead of proceeding any further.)
— Your thoughts?
Hi,
Thanks for Colly! :)
I have a task with hundreds of thousands of pages, so obviously I am using Colly's caching, but it's basically 'too much' for the filesystem. (Wasted disk space, a pain to manage, slow to backup, etc)
I'd like to propose a pluggable cache system, similar to how you've made other Colly components.
Perhaps with an API like this:-
...or...
The first one won't be possible if you then wish to implement FileSystemCache in a subpackage to Colly though.
The reason I also need a Remove method is because one project has a site that sometimes serves maintenance pages, and whilst I can detect these, Colly currently has no method of saying stop, or of rejecting a page after processing. Obviously, the last thing I want part way through a crawl is to have my cache poisoned. But that's probably a separate issue, that I can live with if I can do the removal of bad pages myself.
If pluggable caches were to be implemented, I have existing code from another project that has a cache built using SQLite as a key-value store, compressing/decompressing the data with Zstandard (it's both surprisingly quick and super efficient on disk space), that I would happily port over. This can either become part of Colly, or a separate thing on my own Github.
I did start implementing this myself, but ran into a problem with how I went about it. (I followed the pattern you have established of having the separate components as subpackages, I then got bitten because my FileSystemCache couldn't easily reach into the Collector to get the CacheDir, I was trying to preserve existing behaviour / API compatibility. Maybe that's not an issue. Maybe these bits shouldn't be subpackages. Obviously once I started thinking things like that I figured it was best to discuss before/instead of proceeding any further.)
— Your thoughts?