Skip to content

Pluggable cache system? #103

@jimsmart

Description

@jimsmart

Hi,

Thanks for Colly! :)

I have a task with hundreds of thousands of pages, so obviously I am using Colly's caching, but it's basically 'too much' for the filesystem. (Wasted disk space, a pain to manage, slow to backup, etc)

I'd like to propose a pluggable cache system, similar to how you've made other Colly components.

Perhaps with an API like this:-

type Cache interface {
	Init() error
	Get(url string) (*colly.Response, error)
	Put(url string, r colly.Response) error
	Remove(url string) error
}

...or...

type Cache interface {
	Init() error
	Get(url string) ([]byte, error)
	Put(url string, data []byte) error
	Remove(url string) error
}

The first one won't be possible if you then wish to implement FileSystemCache in a subpackage to Colly though.

The reason I also need a Remove method is because one project has a site that sometimes serves maintenance pages, and whilst I can detect these, Colly currently has no method of saying stop, or of rejecting a page after processing. Obviously, the last thing I want part way through a crawl is to have my cache poisoned. But that's probably a separate issue, that I can live with if I can do the removal of bad pages myself.

If pluggable caches were to be implemented, I have existing code from another project that has a cache built using SQLite as a key-value store, compressing/decompressing the data with Zstandard (it's both surprisingly quick and super efficient on disk space), that I would happily port over. This can either become part of Colly, or a separate thing on my own Github.

I did start implementing this myself, but ran into a problem with how I went about it. (I followed the pattern you have established of having the separate components as subpackages, I then got bitten because my FileSystemCache couldn't easily reach into the Collector to get the CacheDir, I was trying to preserve existing behaviour / API compatibility. Maybe that's not an issue. Maybe these bits shouldn't be subpackages. Obviously once I started thinking things like that I figured it was best to discuss before/instead of proceeding any further.)

— Your thoughts?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions