-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Hi Everyone!
I think I found a bug with the exclude
option in the config.ts
file (or maybe I'm just using it wrong haha). Here's my current setup:
My Current config.ts
:
import { Config } from "./src/config";
export const defaultConfig: Config = {
url: "https://cloud.google.com/chronicle/docs",
match: "https://cloud.google.com/chronicle/docs/**",
exclude: [
"https://cloud.google.com/chronicle/docs/**hl=**",
"https://cloud.google.com/chronicle/docs/soar/**",
"https://cloud.google.com/chronicle/docs/ingestion/parser-list/*-changelog",
],
selector: `.devsite-article-body`,
maxPagesToCrawl: 50000,
outputFileName: "ChronicleDocsAll.json",
maxTokens: 500000,
};
Problem
When I run this configuration, URLs containing query parameters like hl=
, which should be excluded, are still being crawled. Here's an example of some of the URLs that shouldn't appear:
INFO PlaywrightCrawler: Crawling: Page 10 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=de...
INFO PlaywrightCrawler: Crawling: Page 11 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=es-419...
INFO PlaywrightCrawler: Crawling: Page 12 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=fr...
INFO PlaywrightCrawler: Crawling: Page 13 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=id...
I've tried modifying the exclude
array with different patterns, like:
"**hl\=**"
However, some URLs still make it through, like (but hey so much better! so that's a W in my book haha):
INFO PlaywrightCrawler: Crawling: Page 926 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-kubernetes-node-logs?hl=de...
INFO PlaywrightCrawler: Crawling: Page 927 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-kubernetes-node-logs?hl=id...
INFO PlaywrightCrawler: Crawling: Page 928 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-gcp-loadbalancing-logs?hl=id...
Environment
- Operating Systems Tested: MacOS, Windows, Linux (Ubuntu and Debian in WSL2 and on a VM)
- Crawler version: v1.5.0
Steps to Reproduce:
git clone https://github.com/builderio/gpt-crawler
-
npm i
-
Update config from above.
-
Run the crawl and observe the URLs being crawled.
Expected Behavior
URLs matching the exclude
patterns, especially those with hl=
, should not be crawled.
Actual Behavior
URLs with hl=
are still being crawled despite being listed in the exclude
patterns. (with varying degrees of success based on config)
Additional Context
I've tried various exclude
patterns, but nothing seems to fully exclude these URLs. Has anyone encountered a similar issue or have suggestions on how to resolve this?
Thanks in advance for any help!