This project is a Python-based script to scrape product and location data from different restaurant websites. It supports parallel execution for faster processing and makes it easy to add new parsers.
- Scrapes data from specified restaurants.
- Saves the scraped data to
.hldsfiles for further use. - Supports parallel and sequential execution modes, configurable via a boolean flag.
- Designed to be extendable with new restaurant parsers.
- Python 3.x installed on your system.
- Install the required dependencies (e.g.,
requests,BeautifulSoup, etc.).
-
Clone the Repository:
git clone https://github.com/ZeusWPI/haldis_een_prijsje.git cd haldis_een_prijsje -
Install Dependencies:
pip install -r requirements.txt
-
Run the Script: To scrape data for a specific restaurant:
python main.py
- Set
restaurant_nameto the desired restaurant (e.g.,"simpizza") in the script. - Enable or disable parallelism by toggling the
use_parallelismflag (TrueorFalse).
- Set
-
Run for All Restaurants: Set
run_everythingtoTruein the script to scrape data from all available restaurants.
- restaurant_name: Set this to the name of the restaurant you want to scrape.
- use_parallelism: Set to
Truefor parallel execution orFalsefor sequential execution. - run_everything: Set to
Trueto scrape all restaurants; otherwise, leave it asFalse.
To add support for a new restaurant scraper:
-
Check for Open Issues:
- Navigate to the Issues section of this repository.
- Look for an unsigned issue related to the new parser.
- Assign the issue to yourself.
-
Implement the Parser:
- Create a new scraper file under
scrapers/(e.g.,newrestaurant_scraper.py). - Implement a
get_prices()method in the new scraper, returning (see interface inscrapers/scraper.py):- A list of products.
- Location information.
- Create a new scraper file under
-
Add the Parser to the Main Script:
- Define a new function in
main.py(e.g.,run_newrestaurant):def run_newrestaurant(): newrestaurant_products, newrestaurant_location = NewRestaurantScraper.get_prices() with open("hlds_files/newrestaurant.hlds", "w", encoding="utf-8") as file: file.write(str(newrestaurant_location) + "\n") file.write(translate_products_to_text(newrestaurant_products)) print("newrestaurant done")
- Add the function conditionally to the
taskslist inmain.py:if restaurant_name.lower() == "newrestaurant" or run_everything: tasks.append(run_newrestaurant)
- Define a new function in
-
Test Your Parser:
- Run the script to ensure your parser works as expected.
- Fix any bugs or errors.
-
Submit Your Work:
- Mark the issue as resolved and create a pull request to merge your changes.
- Always assign yourself an open issue before starting work.
- Follow the project structure and coding conventions.
- Test your changes thoroughly before submitting a pull request.
- Ensure your code is well-documented.
This document provides an overview of all utility functions included in the script.
Fetches and parses the HTML content from a given URL.
- Parameters:
url(str): The URL to fetch.
- Returns: A
BeautifulSoupobject containing the parsed HTML, orNoneif fetching fails.
Filters out non-UTF-8 characters from the given text.
- Parameters:
text(str): The input text to filter.
- Returns: A string containing only UTF-8 characters.
Performs a GET request to the given URL and handles ConnectionError.
- Parameters:
link(str): The URL to fetch.
- Returns: A
requests.Responseobject if successful, or an empty string if aConnectionErroroccurs.
Extracts all non-empty text content from <span> elements within a given <div>.
- Parameters:
div: A BeautifulSoup<div>element.
- Returns: A list of strings containing the extracted text.
Filters <div> elements from a parsed HTML based on a user-defined condition.
- Parameters:
soup: ABeautifulSoupobject containing the HTML content.class_name(str): The class name of the<div>elements to filter.condition(callable): A function that takes a<div>element and returnsTrueif the<div>matches the condition.
- Returns: A list of
<div>elements that match the condition.
Generates a condition function to check if a <div> contains an <h2> tag with the specified text.
- Parameters:
text_to_search(str): The text to search for within an<h2>tag.
- Returns: A function that takes a
<div>and returnsTrueif it contains an<h2>tag with the specified text.
Downloads a PDF from a given URL and saves it locally.
- Parameters:
url(str): The URL of the PDF to download.save_path(str): The local path to save the downloaded PDF.
- Returns: None. Prints a success or failure message.
Extracts text from a PDF file. Optionally extracts text from a specified rectangular region.
- Parameters:
file_path(str): Path to the PDF file.coords(tuple, optional): A tuple defining the rectangle (x0, top, x1, bottom). Defaults toNonefor full-page extraction.
- Returns: Extracted text as a string.
Retrieves the dimensions of a specified page in a PDF.
- Parameters:
file_path(str): Path to the PDF file.page_number(int): The 1-based index of the page. Defaults to 1.
- Returns: A tuple containing the width and height of the page in points.
Converts a string representation of a number with a comma as a decimal separator to a float.
-
Parameters:
inp(str): The input string containing the number (e.g.,"1,23").
-
Returns:
- A
floatwhere commas in the input string are replaced with dots to adhere to standard decimal notation (e.g.,1.23).
- A
-
Example:
number = comma_float("1,23") print(number) # Output: 1.23