Skip to content

exclude Pattern in config.ts Not Working as Expected #179

@ChronicleCoder

Description

@ChronicleCoder

Hi Everyone!

I think I found a bug with the exclude option in the config.ts file (or maybe I'm just using it wrong haha). Here's my current setup:

My Current config.ts:
import { Config } from "./src/config";

export const defaultConfig: Config = {
  url: "https://cloud.google.com/chronicle/docs",
  match: "https://cloud.google.com/chronicle/docs/**",
  exclude: [
    "https://cloud.google.com/chronicle/docs/**hl=**",
    "https://cloud.google.com/chronicle/docs/soar/**",
    "https://cloud.google.com/chronicle/docs/ingestion/parser-list/*-changelog",
  ],
  selector: `.devsite-article-body`,
  maxPagesToCrawl: 50000,
  outputFileName: "ChronicleDocsAll.json",
  maxTokens: 500000,
};

Problem

When I run this configuration, URLs containing query parameters like hl=, which should be excluded, are still being crawled. Here's an example of some of the URLs that shouldn't appear:

INFO  PlaywrightCrawler: Crawling: Page 10 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=de...
INFO  PlaywrightCrawler: Crawling: Page 11 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=es-419...
INFO  PlaywrightCrawler: Crawling: Page 12 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=fr...
INFO  PlaywrightCrawler: Crawling: Page 13 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=id...

I've tried modifying the exclude array with different patterns, like:

"**hl\=**"

However, some URLs still make it through, like (but hey so much better! so that's a W in my book haha):

INFO  PlaywrightCrawler: Crawling: Page 926 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-kubernetes-node-logs?hl=de...
INFO  PlaywrightCrawler: Crawling: Page 927 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-kubernetes-node-logs?hl=id...
INFO  PlaywrightCrawler: Crawling: Page 928 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-gcp-loadbalancing-logs?hl=id...

Environment

  • Operating Systems Tested: MacOS, Windows, Linux (Ubuntu and Debian in WSL2 and on a VM)
  • Crawler version: v1.5.0

Steps to Reproduce:

git clone https://github.com/builderio/gpt-crawler
  1. npm i

  2. Update config from above.

  3. Run the crawl and observe the URLs being crawled.

Expected Behavior

URLs matching the exclude patterns, especially those with hl=, should not be crawled.

Actual Behavior

URLs with hl= are still being crawled despite being listed in the exclude patterns. (with varying degrees of success based on config)

Additional Context

I've tried various exclude patterns, but nothing seems to fully exclude these URLs. Has anyone encountered a similar issue or have suggestions on how to resolve this?

Thanks in advance for any help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions