`exclude` Pattern in `config.ts` Not Working as Expected

Hi Everyone!

I think I found a bug with the `exclude` option in the `config.ts` file (or maybe I'm just using it wrong haha). Here's my current setup:


##### My Current `config.ts`:

```typescript
import { Config } from "./src/config";

export const defaultConfig: Config = {
  url: "https://cloud.google.com/chronicle/docs",
  match: "https://cloud.google.com/chronicle/docs/**",
  exclude: [
    "https://cloud.google.com/chronicle/docs/**hl=**",
    "https://cloud.google.com/chronicle/docs/soar/**",
    "https://cloud.google.com/chronicle/docs/ingestion/parser-list/*-changelog",
  ],
  selector: `.devsite-article-body`,
  maxPagesToCrawl: 50000,
  outputFileName: "ChronicleDocsAll.json",
  maxTokens: 500000,
};
```

## Problem

When I run this configuration, URLs containing query parameters like `hl=`, which should be excluded, are still being crawled. Here's an example of some of the URLs that shouldn't appear:

```
INFO  PlaywrightCrawler: Crawling: Page 10 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=de...
INFO  PlaywrightCrawler: Crawling: Page 11 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=es-419...
INFO  PlaywrightCrawler: Crawling: Page 12 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=fr...
INFO  PlaywrightCrawler: Crawling: Page 13 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=id...
```

I've tried modifying the `exclude` array with different patterns, like:

```
"**hl\=**"
```

However, some URLs still make it through, like (but hey so much better! so that's a W in my book haha):

```
INFO  PlaywrightCrawler: Crawling: Page 926 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-kubernetes-node-logs?hl=de...
INFO  PlaywrightCrawler: Crawling: Page 927 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-kubernetes-node-logs?hl=id...
INFO  PlaywrightCrawler: Crawling: Page 928 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-gcp-loadbalancing-logs?hl=id...
```

## Environment

- **Operating Systems Tested:** MacOS, Windows, Linux (Ubuntu and Debian in WSL2 and on a VM)
- **Crawler version:** v1.5.0

## Steps to Reproduce:

1. 
```
git clone https://github.com/builderio/gpt-crawler
```

2. `npm i`

3. Update config from above.

4. Run the crawl and observe the URLs being crawled.

## Expected Behavior

URLs matching the `exclude` patterns, especially those with `hl=`, should not be crawled.

## Actual Behavior

URLs with `hl=` are still being crawled despite being listed in the `exclude` patterns. (with varying degrees of success based on config)

## Additional Context

I've tried various `exclude` patterns, but nothing seems to fully exclude these URLs. Has anyone encountered a similar issue or have suggestions on how to resolve this?

Thanks in advance for any help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`exclude` Pattern in `config.ts` Not Working as Expected #179

My Current `config.ts`:

Problem

Environment

Steps to Reproduce:

Expected Behavior

Actual Behavior

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

exclude Pattern in config.ts Not Working as Expected #179

Description

My Current config.ts:

Problem

Environment

Steps to Reproduce:

Expected Behavior

Actual Behavior

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`exclude` Pattern in `config.ts` Not Working as Expected #179

My Current `config.ts`: