A Node.js scraper for Public.gr that auto-crawls categories under /cat
(no sitemap) and extracts product data (title, price, availability, specs, image, link) from list pages into JSON and CSV.
- 🧭 BFS crawling of subcategories up to
MAX_DEPTH
- 🧰 Helpers for blocking overlays, handling cookies, and other page guards
- 🧾 Export to
data/products_all.json
&data/products_all.csv
- 🧠 “Smart” target selection: full
/cat
, single-list page, or only a specific subtree
.
├─ helper/
│ └─ helpers.js # helper functions
├─ utils/
│ └─ export.js # exportToCSV(...)
├─ scrapePublic.js # main script
├─ data/ # output folder (ignored by git)
├─ package.json
└─ README.md
In scrapePublic.js
you import the helpers like this:
const {
sleep,
toHttps,
installSearchGuards,
dismissSearchOverlay,
autoScroll,
acceptCookiesIfAny,
isRootCat,
pageHasProductList
} = require('./helper/helpers');
- Node.js v18+
- Google Chrome installed (Windows recommended for profile path example)
- Puppeteer (installed as a dependency)
# Clone the repository
git clone https://github.com/StathisP-s/public-scraper.git
cd public-scraper
# Install dependencies
npm install
Make sure your package.json
includes:
{
"type": "commonjs",
"scripts": {
"start": "node scrapePublic.js"
},
"dependencies": {
"puppeteer": "^22.0.0"
}
}
ROOT_ALL_CATEGORIES
: the root category or/cat
for a full crawlMAX_DEPTH
: BFS depth (e.g.2
)USER_DATA_DIR
: your Chrome profile path on Windows, e.g.:const USER_DATA_DIR = 'C:\\Users\\<User>\\AppData\\Local\\Google\\Chrome\\User Data\\Default';
- UA / headers: set for Greek locale
npm start
# or
node scrapePublic.js
During execution:
- Crawls subcategories based on settings
- On each list page, clicks “See more” until all products are loaded
- Extracts for each card: Code, Title, Price, Availability, Specs, Image, Link
Output:
data/products_all.json
data/products_all.csv
If data/
does not exist, the script will create it automatically.
installSearchGuards(page)
: blocks search overlays and shortcut triggers before site scripts rundismissSearchOverlay(page)
: manually clears overlays and modalsacceptCookiesIfAny(page)
: clicks OneTrust cookie bannerautoScroll(page)
: scrolls to load lazy contentisRootCat(url)
,toHttps(url)
: URL utilitiespageHasProductList(browser, url)
: detects if a page is a product list
node_modules/
data/
*.csv
*.json
- Cannot find module './helper/helpers'
➜ Ensure the file is athelper/helpers.js
and the import path matches. - Empty availability for some cards
➜ The script scrolls each card into view before reading availability. The selector used is:card.querySelector('.availability-container strong') || card.querySelector('app-product-list-availability strong')
- Slow “See more”
➜ Reducesleep
delays or limit how many clicks per page.
This project is intended for educational use. Respect the robots.txt, the terms of service of Public.gr, and local laws regarding web scraping.
- Optional detail-page fetch for products missing availability/specs (with small concurrency)
- CLI flags (
--depth
,--root
,--headless
) - Playwright implementation