Common Crawl Foundation

All

73 repositories

cc-crawl-statistics
Public
Statistics of Common Crawl monthly archives mined from URL index files
commoncrawl common-crawl statistics
Python
•
Apache License 2.0
•13•192•1•1•Updated Sep 2, 2025Sep 2, 2025
web-languages
Public
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
crawling language-detection dataset
69•56•2•5•Updated Aug 30, 2025Aug 30, 2025
cc-pyspark
Public
Process Common Crawl data with Python and Spark
pyspark sparksql wet commoncrawl common-crawl spark warc-files wat-files
Python
•
MIT License
•90•440•4•2•Updated Aug 29, 2025Aug 29, 2025
ia-hadoop-tools
Public
Web archiving tools on Hadoop
Java
•28•4•2•1•Updated Aug 28, 2025Aug 28, 2025
ia-web-commons
Public
Web archiving utility library
cdx-files warc-files wat-files
Java
•
Apache License 2.0
•74•11•4•1•Updated Aug 28, 2025Aug 28, 2025
nutch
Public
Common Crawl fork of Apache Nutch
big-data hadoop web-crawler commoncrawl java
Java
•
Apache License 2.0
•1.3k•37•6•0•Updated Aug 27, 2025Aug 27, 2025
cc-webgraph
Public
Tools to construct and process Common Crawl webgraphs
pagerank webgraph commoncrawl common-crawl centrality-measures webgraph-framework
Java
•
Apache License 2.0
•5•96•2•0•Updated Aug 26, 2025Aug 26, 2025
cc-webgraph-statistics
Public
Statistics of Common Crawl monthly Web Graphs
Python
•
Apache License 2.0
•1•5•0•0•Updated Aug 21, 2025Aug 21, 2025
cc-downloader
Public
A polite and user-friendly downloader for Common Crawl data
rust downloader commoncrawl
Rust
•
Apache License 2.0
•4•57•1•2•Updated Aug 17, 2025Aug 17, 2025
cc-citations
Public
Scientific articles using or citing Common Crawl data
bibtex bibliography opendata
Jupyter Notebook
•3•25•0•0•Updated Aug 15, 2025Aug 15, 2025
cc-host-index
Public
Tools for working with the host index
Python
•2•7•1•0•Updated Aug 11, 2025Aug 11, 2025
cc-index-table
Public
Index Common Crawl archives in tabular format
columnar-storage aws-athena apache-parquet commoncrawl sql spark
Java
•
Apache License 2.0
•14•122•7•1•Updated Jul 31, 2025Jul 31, 2025
cc-nutch-example
Public
Apache Nutch example project to archive content in WARC files
Shell
•
Apache License 2.0
•2•3•0•0•Updated Jul 30, 2025Jul 30, 2025
whirlwind-python
Public
A whirlwind tour of Common Crawl's data using Python
python tutorial archive warc
Python
•
Apache License 2.0
•6•26•0•0•Updated Jul 29, 2025Jul 29, 2025
crawler-commons
Public
A set of reusable Java components that implement functionality common to any web crawler
Java
•
Apache License 2.0
•86•2•0•0•Updated Jul 29, 2025Jul 29, 2025
web-languages-code
Public
The code used to generate templates for the web-languages repo https://github.com/commoncrawl/web-languages
Python
•
Apache License 2.0
•2•2•0•2•Updated Jul 11, 2025Jul 11, 2025
warcio-s3
Public
Streaming WARC/ARC library for fast web archive IO
Python
•
Apache License 2.0
•65•0•0•0•Updated Jul 6, 2025Jul 6, 2025
cc-warcinfo-index-builder
Public
Code to build an index that maps warcinfo-id to crawl / warc
Python
•0•0•0•0•Updated May 27, 2025May 27, 2025
cc-index-annotations
Public
Example code to join an annotation to a host or url index
Python
•0•1•0•0•Updated May 25, 2025May 25, 2025
cc-host-index-media
Public
Media files used in the README.d of cc-host-index
HTML
•0•0•0•0•Updated Apr 20, 2025Apr 20, 2025
wac2025-cc-annotator-poster
Public
A proof of concept pipeline for WARC annotation
Rust
•
Apache License 2.0
•0•1•0•0•Updated Apr 10, 2025Apr 10, 2025
wac2025-webgraph-workshop
Public
Introduction to WebGraphs - Workshop at the IIPC Web Archiving Conference 2025
Shell
•
MIT License
•0•3•0•0•Updated Apr 10, 2025Apr 10, 2025
arc2warc-conversion
Public
Experiences converting Common Crawl's ARC files from the crawls 2008 - 2012 to the WARC format
arc warc webarchive webarchiving warc-files warc-format arc-files
0•0•0•0•Updated Apr 3, 2025Apr 3, 2025
cc-notebooks
Public
Various Jupyter notebooks about Common Crawl data
jupyter-notebook aws-athena commoncrawl common-crawl webarchiving webgraph-framework
Jupyter Notebook
•
Apache License 2.0
•11•57•0•0•Updated Apr 1, 2025Apr 1, 2025
robotstxt-experiments
Public
How is the Robots Exclusion Protocol (robots.txt) used in the WWW? This projects tries to get some insights mining Common Crawl's robots.txt captures of the years 2016 – 2024.
robots-txt robotstxt
Jupyter Notebook
•
MIT License
•0•0•0•0•Updated Mar 27, 2025Mar 27, 2025
cc-index-server
Public
Common Crawl Index Server
cc-index
HTML
•28•70•6•1•Updated Feb 28, 2025Feb 28, 2025
warcio
Public
Streaming WARC/ARC library for fast web archive IO
Python
•
Apache License 2.0
•65•1•0•0•Updated Feb 27, 2025Feb 27, 2025
news-crawl
Public
News crawling with StormCrawler - stores content as WARC
news apache-storm warc commoncrawl common-crawl storm-crawler crawler web-crawler
Java
•
Apache License 2.0
•39•355•15•1•Updated Feb 19, 2025Feb 19, 2025
webarchive-indexing
Public
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
cc-index
Python
•
MIT License
•11•6•0•2•Updated Jan 27, 2025Jan 27, 2025
uap-core
Public
The regex file necessary to build language ports of Browserscope's user agent parser.
JavaScript
•
Other
•459•0•0•0•Updated Jan 17, 2025Jan 17, 2025