Description
This repository contains the resources and findings of the "Big Data Environmental Terms Analysis" project, conducted as part of the coursework at Radboud University. The project utilizes big data technologies to analyze the frequency of environmental terms across various web pages, focusing on keywords associated with climate change, pollution, and renewable energy. The analysis aims to aid environmental advocacy groups, policymakers, and marketers in understanding public engagement with environmental issues.
Objective
The main objective of this project is to identify, count, and analyze the frequency of specific environmental terms on the internet. This analysis helps measure public interest and awareness concerning key environmental challenges such as climate change, sustainability, and renewable energy.
Technologies Used
Apache Spark: Used for large-scale data processing.
Scala: Programming language for application development.
Jsoup: For parsing HTML content from web pages.
Zeppelin Notebook: For interactive data exploration and visualization.
WARC Files: Web ARChive files used as the data source.
Installation
git clone https://github.com/laurian19/Big-Data-Environmental-Terms-Analysis
cd Big-Data-Environmental-Terms-Analysis
Usage
Please see the Zeppelin Notebook for detailed steps on running the analysis, located within the repository. Follow the instructions within the notebook to perform the analysis.
Authors:
Laurian Duma - Initial work - Radboud University