Skip to content

laurian19/Large-Scale-Web-Crawl-Analysis

Repository files navigation

Large-Scale-Web-Crawl-Analysis

Description

This repository contains the resources and findings of the "Big Data Environmental Terms Analysis" project, conducted as part of the coursework at Radboud University. The project utilizes big data technologies to analyze the frequency of environmental terms across various web pages, focusing on keywords associated with climate change, pollution, and renewable energy. The analysis aims to aid environmental advocacy groups, policymakers, and marketers in understanding public engagement with environmental issues.

Objective

The main objective of this project is to identify, count, and analyze the frequency of specific environmental terms on the internet. This analysis helps measure public interest and awareness concerning key environmental challenges such as climate change, sustainability, and renewable energy.

Technologies Used

Apache Spark: Used for large-scale data processing.
Scala: Programming language for application development.
Jsoup: For parsing HTML content from web pages.
Zeppelin Notebook: For interactive data exploration and visualization.
WARC Files: Web ARChive files used as the data source.

Installation

git clone https://github.com/laurian19/Big-Data-Environmental-Terms-Analysis
cd Big-Data-Environmental-Terms-Analysis

Usage

Please see the Zeppelin Notebook for detailed steps on running the analysis, located within the repository. Follow the instructions within the notebook to perform the analysis.

Authors:

Laurian Duma - Initial work - Radboud University

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages