Feat deduplication by JESREAL1JDL7LUSTRE · Pull Request #20 · hdmGOAT/veent-event-scraper

JESREAL1JDL7LUSTRE · 2026-06-18T01:59:14Z

read docs\deduplication

Summary by CodeRabbit

New Features
- Added automatic post-save deduplication for events, venues, and organizers to merge duplicate records and clean up overlaps.
- Introduced a standalone deduplication tool with --dry-run, per-entity execution, and verbose output/summary.
- Added a backend trigger for deduplication and a frontend “Deduplicate” button with typed results.
Bug Fixes
- Improved Windows compatibility for running the backend development server.
Documentation
- Added end-to-end deduplication docs (overview, protocols, script usage, and API reference).
Tests
- Added coverage for normalization, duplicate detection/merging, and dedup trigger behavior.

coderabbitai · 2026-06-18T01:59:26Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ddbbc77c-90cd-45bb-8a9a-e5273c538505

📥 Commits

Reviewing files that changed from the base of the PR and between aeded92 and f079d7b.

📒 Files selected for processing (1)

apps/backend/events/views.py

📝 Walkthrough

Walkthrough

Introduces a two-layer cross-source deduplication system for Event, Venue, and Organizer records. A new scripts/dedup.py module provides normalization helpers, two-pass duplicate finders with entity-specific guards, winner selection, and raw SQL merge functions. An inline _dedup_after_save hook is wired into the scraper base to run after persistence. A standalone deduplicate.py CLI script handles on-demand full dedup. A new API endpoint and frontend UI trigger dedup from the web interface. Tests and comprehensive documentation are included.

Changes

Cross-source deduplication system

Layer / File(s)	Summary
`dedup.py` — normalization, grouping, finders, and merge mechanics `apps/backend/scripts/dedup.py`	New standalone psycopg2 module with normalization helpers for names (punctuation/diacritic removal, lowercasing), URLs (scheme-stripped, UTM-removed, query-sorted, fragment-preserved), dates (UTC conversion), and cities (whitespace-lowercased). Adds richness-based winner selection with `created_at` tiebreak, order-preserving group union, two-pass duplicate finders for events/venues/organizers (with `place_id` conflict guards and name-word overlap guards), and merge functions that backfill winner fields from losers, remap `venue_id`/`organizer_ref_id` FKs, and hard-delete losers.
Inline post-save dedup hook wired into `BaseScraper` `apps/backend/events/scrapers/base.py`	Adds `_dedup_after_save` dispatcher and per-entity dedup functions (`_dedup_events_by_url`, `_dedup_venues_by_name_city`, `_dedup_organizers_by_website`), plus helpers for normalization, grouping, field-filling with protected-field exclusion, and group application/merging. Wires calls into `save_events`, `save_organizers`, and `save_venues` to dedup just-saved IDs, swallowing all errors as warnings.
`deduplicate.py` — standalone CLI script `apps/backend/scripts/deduplicate.py`	New CLI script that loads `DATABASE_URL` from env/`.env`, connects via psycopg2, dispatches to `dedup.py` finder/merger pairs per entity, supports `--dry-run`/`--verbose`/`--entity` CLI options, runs per-group transactions with isolated commit/rollback, and prints a Groups/Merged/Deleted summary table.
API endpoint and frontend UI for triggering dedup `apps/backend/events/urls.py`, `apps/backend/events/views.py`, `apps/frontend/src/lib/api.ts`, `apps/frontend/src/lib/types.ts`, `apps/frontend/src/routes/scrapers/+page.svelte`	Backend adds `api/scrapers/dedup/` URL route and `api_dedup_trigger` POST endpoint (CSRF-exempt) that validates `entity` parameter against an allowlist, spawns `deduplicate.py` with 120s timeout, and returns `{output, entity}` on success or JSON error for invalid input/timeout/script failure. Frontend adds `DedupResult` type, `api.deduplicate()` method, dedup-specific state holders, `handleDedup` async handler with in-flight serialization, and a "Deduplicate" button with conditional error/output rendering alongside existing "Run All" controls.
Deduplication test suite `apps/backend/events/tests.py`	Adds `NormalizationTests` (name/URL/date/city rules), `FindDuplicatesTests` (event/venue/organizer duplicate discovery per pass with winner richness ordering), `MergeTests` (field backfilling, protected-field preservation per entity, FK remapping, hard-delete loser behavior), and `DedupCommandTests` (`_dedup_after_save` hook semantics including noop for empty IDs and graceful unknown-entity handling).
Docs, protocols, and execution plan `docs/README.md`, `docs/deduplication/overview.md`, `docs/deduplication/api-reference.md`, `docs/deduplication/running-the-script.md`, `docs/deduplication/protocols.md`, `docs/deduplication/README.md`, `process/general-plans/active/deduplication_PLAN_18-06-26.md`, `apps/backend/package.json`	Adds system overview (two-layer architecture, per-entity matching passes, normalization rules, winner selection, merge semantics), API reference for `dedup.py` public interface (normalization helpers, duplicate finders, merge functions), running guide (prerequisites, dry-run/live examples, production workflow, troubleshooting), scraping protocols (identity-field requirements, per-entity scraper rules, source uniqueness), dedup README with quick-start and baseline results, and the implementation plan detailing Steps 1–4 (dedup module, CLI script, scraper hook integration, tests). Updates `docs/README.md` index and `package.json` dev script to Windows venv path (`..\\..\\\.venv\\Scripts\\python`).

Sequence Diagram(s)

sequenceDiagram
    rect rgba(100, 149, 237, 0.5)
        Note over BaseScraper,Event: Layer 1 — Inline post-save dedup
    end
    participant BaseScraper
    participant _dedup_after_save
    participant _dedup_events_by_url
    participant Event

    BaseScraper->>_dedup_after_save: save_events(...) → _dedup_after_save("events", event_ids)
    _dedup_after_save->>_dedup_events_by_url: event_ids
    _dedup_events_by_url->>Event: Query by normalized URL
    Event-->>_dedup_events_by_url: duplicate groups [[winner_id, loser_id, ...], ...]
    _dedup_events_by_url->>_dedup_events_by_url: _apply_groups(Event, groups, protected)
    _dedup_events_by_url->>Event: _fill_missing(winner, losers, protected)
    _dedup_events_by_url->>Event: winner.save()
    _dedup_events_by_url->>Event: loser_rows.delete()
    _dedup_after_save-->>BaseScraper: (errors swallowed as warnings)

sequenceDiagram
    rect rgba(60, 179, 113, 0.5)
        Note over main,Postgres: Layer 2 — On-demand standalone script
    end
    participant main as main()
    participant run_entity
    participant dedup as dedup.py
    participant Postgres

    main->>run_entity: entity="venues", dry_run=false, verbose=true
    run_entity->>dedup: find_venue_duplicates(cursor)
    dedup->>Postgres: SELECT all venues
    Postgres-->>dedup: all rows with normalized keys
    dedup-->>run_entity: [[winner_id, loser_id, ...], ...]
    loop per duplicate group
        run_entity->>dedup: merge_venues(cursor, winner_id, loser_ids)
        dedup->>Postgres: UPDATE events SET venue_id=winner WHERE venue_id IN (losers)
        dedup->>Postgres: DELETE FROM venues WHERE id IN (losers)
        Postgres-->>dedup: rows affected
        alt on error
            run_entity->>Postgres: ROLLBACK
        else success
            run_entity->>Postgres: COMMIT
        end
    end
    run_entity-->>main: {groups, merged, deleted, failures}
    main->>main: _print_summary(summaries)

sequenceDiagram
    rect rgba(218, 112, 214, 0.5)
        Note over Frontend,Postgres: API trigger for on-demand dedup
    end
    participant Frontend
    participant api_dedup_trigger as api_dedup_trigger()
    participant subprocess as deduplicate.py
    participant Postgres

    Frontend->>api_dedup_trigger: POST /scrapers/dedup/, entity="venues"
    api_dedup_trigger->>api_dedup_trigger: validate entity in allowlist
    api_dedup_trigger->>subprocess: subprocess.run(..., timeout=120s)
    subprocess->>Postgres: find_venue_duplicates(cursor)
    subprocess->>Postgres: merge_venues(...) per group, commit/rollback
    subprocess-->>api_dedup_trigger: stdout/stderr, exit code
    alt exit code = 0
        api_dedup_trigger-->>Frontend: {output: "...", entity: "venues"}
    else error
        api_dedup_trigger-->>Frontend: {error: "...", status: 400/500/504}
    end
    Frontend->>Frontend: render output or error UI

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

hdmGOAT/veent-event-scraper#8: Both PRs modify the same BaseScraper post-save pipeline, with this PR adding the _dedup_after_save deduplication hook and the other adding a classification hook (_categorize_after_save) at the same lifecycle point.

Suggested reviewers

potakaaa
hdmGOAT

🐇 Hop hop, duplicates be gone!
Two sources, one winner, all night long,
URLs sorted, names trimmed tight,
Guards and groups merge them right.
Layer one hooks and layer two's might,
Dedup magic sets things right! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Feat deduplication' is vague and generic, using only the convention prefix 'Feat' followed by a broad feature name without conveying meaningful specifics about the changeset.	Consider a more descriptive title that highlights the main change, such as 'Add post-save deduplication pipeline for events, venues, and organizers' or similar to clarify what deduplication approach was implemented.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/deduplication

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (12)

docs/deduplication/protocols.md (1)
163-163: ⚡ Quick win

Add language specifier to flow diagram code block.

Line 163 contains a flow/diagram block without language specification. Adding text specifier improves linting compliance.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/deduplication/protocols.md` at line 163, The code block at line 163 in
the flow diagram section is missing a language specifier on the opening triple
backticks. Add `text` as the language specifier to the code fence (change ``` to
```text) to improve markdown linting compliance and properly indicate the code
block language type.
Source: Linters/SAST tools
docs/deduplication/overview.md (1)
23-23: ⚡ Quick win

Consider adding language specifiers to fenced code blocks.

Lines 23, 41, 105, and 156 contain code/text blocks without language specifications. While not required for these diagram/example blocks, adding text or bash specifiers improves linting compliance and clarity.

Also applies to: 41-41, 105-105, 156-156
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/deduplication/overview.md` at line 23, The fenced code blocks at lines
23, 41, 105, and 156 in the deduplication overview markdown file are missing
language specifiers after the opening triple backticks. Add an appropriate
language identifier (such as `text` or `bash`) immediately after the opening
triple backticks for each of these code blocks to improve linting compliance and
clarity.
Source: Linters/SAST tools
docs/deduplication/running-the-script.md (1)
31-31: ⚡ Quick win

Add language specifiers to output code blocks.

Lines 31 and 97 contain sample output blocks without language specifications. Adding text specifier improves linting compliance and clarity.

Also applies to: 97-97
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/deduplication/running-the-script.md` at line 31, The sample output code
blocks at lines 31 and 97 in the markdown file are missing language specifiers.
To fix this, locate both code blocks (the opening triple backticks that precede
sample output sections) and add the text language specifier by changing the
opening backticks from ``` to ```text. This will improve markdown linting
compliance and make the output format explicit to readers.
Source: Linters/SAST tools
docs/deduplication/api-reference.md (1)
170-170: ⚡ Quick win

Add SQL language specifiers to fenced code blocks.

Lines 170, 188, and 210 contain SQL code blocks that are missing the sql language identifier. This improves readability and allows Markdown linters to validate syntax.
Proposed fix
-**FK remapping:**
-
-```sql
+**FK remapping:**
+
+```sql
 UPDATE events_event SET venue_id = winner_id WHERE venue_id = ANY(loser_ids)
Apply the same change (add `sql` language spec) to lines 188 and 210.

</details>


Also applies to: 188-188, 210-210

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/deduplication/api-reference.md at line 170, The SQL code blocks in the
markdown file are missing the language identifier in their fenced code block
declarations. Add the sql language specifier to the opening fence of the three
SQL code blocks located around lines 170, 188, and 210. Change each opening
fence from three backticks to three backticks followed by sql (e.g., sql instead of before the UPDATE statement blocks). This will improve
readability and allow Markdown linters to properly validate the SQL syntax.
</details>



_Source: Linters/SAST tools_

</blockquote></details>
<details>
<summary>process/general-plans/active/deduplication_PLAN_18-06-26.md (1)</summary><blockquote>

`203-203`: _⚡ Quick win_

**Add language specifiers to code example blocks.**

Lines 203, 212, and 240 contain code blocks without language specifications. Adding `python` (for lines 203, 240) and `text` (for line 212) improves linting compliance and syntax highlighting.






Also applies to: 212-212, 240-240

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @process/general-plans/active/deduplication_PLAN_18-06-26.md at line 203,
Locate the three code blocks in this markdown document that currently lack
language specifiers (marked with just without a language identifier) and add the appropriate language specifier to each: change to python for two of the code blocks and to ```text for one code block, as indicated in the
review comment to improve linting compliance and enable proper syntax
highlighting.
</details>



_Source: Linters/SAST tools_

</blockquote></details>
<details>
<summary>apps/backend/events/scrapers/base.py (2)</summary><blockquote>

`205-220`: _💤 Low value_

**Loop variable shadows `dataclasses.field` import.**

The `field` loop variable on line 208 shadows the `field` import from `dataclasses` on line 10. While this won't cause runtime issues (the import is only used in dataclass definitions), it's a code smell that static analysis correctly flags.




<details>
<summary>♻️ Suggested rename</summary>

```diff
-    for field in winner._meta.concrete_fields:
-        name = field.name
-        if name in skip or field.is_relation:
+    for model_field in winner._meta.concrete_fields:
+        name = model_field.name
+        if name in skip or model_field.is_relation:
             continue
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/scrapers/base.py` around lines 205 - 220, The loop
variable `field` in the `_fill_missing` function shadows the `field` import from
the `dataclasses` module. Rename the loop variable from `field` to something
else (such as `model_field`) in the for loop that iterates over
`winner._meta.concrete_fields`, and update all references to this variable
within the loop body where `field.name` and `field.is_relation` are accessed.
```

</details>



_Source: Linters/SAST tools_

---

`321-338`: _⚡ Quick win_

**Consider wrapping merge operations in a transaction for atomicity.**

`_apply_groups` performs multiple database operations per group (save winner, update FKs, delete losers). If a failure occurs mid-sequence, the database could be left in an inconsistent state—for example, winner enriched but losers still present as orphans.

The standalone script uses per-group transactions with rollback on failure. While the inline dedup is best-effort, wrapping in `transaction.atomic()` would ensure all-or-nothing semantics per group.



<details>
<summary>♻️ Suggested transaction wrapping</summary>

```diff
+from django.db import transaction
+
 def _apply_groups(model, groups, protected, venue_fk=False, organizer_fk=False):
     """Merge each group's losers into the winner and hard-delete the losers."""
     for group in groups:
         winner_id, loser_ids = group[0], group[1:]
         if not loser_ids:
             continue
-        winner = model.objects.get(pk=winner_id)
-        losers = list(model.objects.filter(pk__in=loser_ids))
-        _fill_missing(winner, losers, protected)
-        winner.save()
-        if venue_fk:
-            Event.objects.filter(venue_id__in=loser_ids).update(venue_id=winner_id)
-        if organizer_fk:
-            Event.objects.filter(organizer_ref_id__in=loser_ids).update(
-                organizer_ref_id=winner_id
-            )
-        model.objects.filter(pk__in=loser_ids).delete()
+        with transaction.atomic():
+            winner = model.objects.get(pk=winner_id)
+            losers = list(model.objects.filter(pk__in=loser_ids))
+            _fill_missing(winner, losers, protected)
+            winner.save()
+            if venue_fk:
+                Event.objects.filter(venue_id__in=loser_ids).update(venue_id=winner_id)
+            if organizer_fk:
+                Event.objects.filter(organizer_ref_id__in=loser_ids).update(
+                    organizer_ref_id=winner_id
+                )
+            model.objects.filter(pk__in=loser_ids).delete()
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/scrapers/base.py` around lines 321 - 338, The
_apply_groups function performs multiple database operations per group
(winner.save(), Event.objects.filter().update() calls for venue and organizer,
and model.objects.filter().delete()) without transactional protection, which
could leave the database in an inconsistent state if any operation fails
mid-sequence. Wrap all the database operations for each group iteration
(starting from winner.save() through the final model.objects.filter().delete()
call) inside a transaction.atomic() context manager to ensure all-or-nothing
semantics per group. Import transaction from django.db if not already imported.
```

</details>



</blockquote></details>
<details>
<summary>apps/backend/scripts/dedup.py (1)</summary><blockquote>

`141-155`: _💤 Low value_

**Static analysis SQL injection warnings are false positives here.**

The f-string SQL construction is flagged by static analysis, but in this context:
- `table` is always a hardcoded internal table name (`"events_event"`, `"events_venue"`, `"events_organizer"`) passed by callers
- Column names in `set_clause` derive from database schema via `cursor.fetchone()`, not user input

The pattern is safe for this internal-only module, though adding a comment noting the constraint would help future maintainers.




<details>
<summary>♻️ Optional: Minor style improvement for list concatenation (line 370)</summary>

```diff
-        params = list(updates.values()) + [winner_id]
+        params = [*updates.values(), winner_id]
```
</details>


Also applies to: 346-373

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/scripts/dedup.py` around lines 141 - 155, Add a clarifying
comment in the _select_winner function before the SQL query execution to
document that the f-string construction is intentional and safe because the
table parameter is restricted to hardcoded internal table names (not user
input), and the values passed to the query are properly parameterized. This will
help suppress static analysis warnings and inform future maintainers of the
design constraint.
```

</details>



_Source: Linters/SAST tools_

</blockquote></details>
<details>
<summary>apps/backend/events/tests.py (2)</summary><blockquote>

`1240-1242`: _💤 Low value_

**Consider using a direct import instead of `__import__`.**

The inline `__import__("datetime")` works but is unconventional. A direct import would be clearer.




<details>
<summary>♻️ Suggested change</summary>

```diff
+from datetime import timezone as dt_tz
+
 class NormalizationTests(TestCase):
     # ...
     def test_normalize_date_datetime(self):
         dt = timezone.now()
-        self.assertEqual(dedup_utils.normalize_date(dt), dt.astimezone(__import__("datetime").timezone.utc).date())
+        self.assertEqual(dedup_utils.normalize_date(dt), dt.astimezone(dt_tz.utc).date())
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/tests.py` around lines 1240 - 1242, In the
test_normalize_date_datetime method, replace the inline __import__("datetime")
call with a direct import of the datetime module at the top of the test file.
Change the assertion to use the directly imported datetime.timezone.utc instead
of accessing it through __import__. This will improve code readability and
follow Python conventions.
```

</details>



---

`1482-1486`: _💤 Low value_

**Missing blank lines before the next class.**

PEP 8 recommends two blank lines between top-level class definitions. The new `DedupCommandTests` class ends without blank lines before `OrganizerExportTests`.




<details>
<summary>♻️ Add blank lines after test class</summary>

```diff
     def test_dedup_after_save_never_raises(self):
         from events.scrapers.base import _dedup_after_save

         # Unknown entity is silently ignored (no dispatch), never raises.
         _dedup_after_save("nonexistent", [1, 2, 3])
+
+
 class OrganizerExportTests(TestCase):
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/tests.py` around lines 1482 - 1486, The test method
test_dedup_after_save_never_raises is part of the DedupCommandTests class which
is missing required blank lines before the next class definition. Add two blank
lines after the DedupCommandTests class definition ends (after its last method)
and before the OrganizerExportTests class begins to comply with PEP 8 style
guidelines for top-level class definitions.
```

</details>



</blockquote></details>
<details>
<summary>apps/backend/scripts/deduplicate.py (2)</summary><blockquote>

`125-147`: _💤 Low value_

**`autocommit` is set to `False` but never restored.**

After processing, `conn.autocommit` remains `False`. While this doesn't cause issues in the current code path (connection is closed after all entities are processed), it could lead to unexpected behavior if `run_entity` is used in other contexts or if the connection is reused.

Consider resetting `autocommit` in a `finally` block or documenting that the caller is responsible for connection state.




<details>
<summary>♻️ Optional: Reset autocommit after processing</summary>

```diff
     # Each group is committed independently so one failure cannot corrupt or
     # roll back the others.
     conn.autocommit = False
-    for group in groups:
-        winner, losers = group[0], group[1:]
-        if not losers:
-            continue
-        try:
-            with conn.cursor() as cur:
-                merge_fn(cur, winner, losers)
-            conn.commit()
-            summary["merged"] += 1
-            summary["deleted"] += len(losers)
-            if verbose:
-                print(f"  [{label}] merged losers {losers} -> winner {winner}")
-        except Exception as exc:  # noqa: BLE001 — isolate per-group failures
-            conn.rollback()
-            summary["errors"] += 1
-            print(
-                f"  [{label}] ERROR merging {group}: {exc}", file=sys.stderr
-            )
-
-    return summary
+    try:
+        for group in groups:
+            winner, losers = group[0], group[1:]
+            if not losers:
+                continue
+            try:
+                with conn.cursor() as cur:
+                    merge_fn(cur, winner, losers)
+                conn.commit()
+                summary["merged"] += 1
+                summary["deleted"] += len(losers)
+                if verbose:
+                    print(f"  [{label}] merged losers {losers} -> winner {winner}")
+            except Exception as exc:  # noqa: BLE001 — isolate per-group failures
+                conn.rollback()
+                summary["errors"] += 1
+                print(
+                    f"  [{label}] ERROR merging {group}: {exc}", file=sys.stderr
+                )
+    finally:
+        conn.autocommit = True
+
+    return summary
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/scripts/deduplicate.py` around lines 125 - 147, The autocommit
property of the database connection is set to False at the beginning of the loop
but is never restored to its original state. Save the original autocommit value
before the for loop that iterates through groups, then restore it after
processing is complete either by adding a finally block after the try/except
that resets conn.autocommit to the saved original value, or by restoring it at
the end of the function before returning the summary. This ensures that if the
connection is reused elsewhere, it maintains the expected autocommit state.
```

</details>



---

`50-63`: _💤 Low value_

**Custom .env parser has edge cases with quoted values containing equals signs.**

The `line.partition("=")` approach splits on the first `=`, which is correct. However, the double `.strip('"').strip("'")` only removes quotes at the very edges after stripping whitespace — a value like `"foo=bar"` would become `foo=bar` correctly, but `'foo\'s value'` or escaped quotes won't be handled. This is acceptable for typical DATABASE_URL values, but consider using `python-dotenv` if more complex .env files are expected in the future.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/scripts/deduplicate.py` around lines 50 - 63, The custom
quote-stripping logic in the `_load_dotenv` function using `line.partition("=")`
and chained `.strip('"').strip("'")` calls handles simple cases like
DATABASE_URL values but doesn't properly handle escaped quotes or complex quote
scenarios. If the application needs to support more complex .env files in the
future, replace the custom parsing logic in the `_load_dotenv` function with the
`python-dotenv` library instead, which provides robust handling of edge cases
like escaped quotes, quoted values with special characters, and other .env file
complexities.
```

</details>



</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @apps/backend/events/scrapers/base.py:

Around line 165-188: The _dedup_normalize_url function is dropping URL
fragments when reconstructing the normalized URL in its return statement, but
the standalone dedup script in scripts/dedup.py preserves fragments. This
causes inconsistency where distinct myruntime events with different fragments
(e.g., #/gala-night vs #/concert-series) incorrectly merge into the same
event. Extract the fragment from the parsed URL object (available from the
urlparse result) and include it in the urlunparse call at the return
statement instead of passing an empty string for the fragment parameter,
ensuring fragments are preserved to match the standalone dedup behavior.

In @apps/backend/package.json:

Line 5: The dev script in package.json uses Windows-specific backslash path
syntax and the Windows-specific Scripts/ directory structure, which prevents
Unix/macOS developers from running the development server. Replace the Windows
path ..\\..\\.venv\\Scripts\\python manage.py runserver with a cross-platform
compatible path using forward slashes (e.g., ../../.venv/bin/python manage.py runserver or .venv/bin/python manage.py runserver), or alternatively use a
package like cross-env or create separate platform-specific script entries
(dev:win and dev:unix) to handle both Windows and Unix environments. Ensure
the updated path matches the actual project structure and venv location
described in the README setup instructions.

Nitpick comments:
In @apps/backend/events/scrapers/base.py:

Around line 205-220: The loop variable field in the _fill_missing function
shadows the field import from the dataclasses module. Rename the loop
variable from field to something else (such as model_field) in the for loop
that iterates over winner._meta.concrete_fields, and update all references to
this variable within the loop body where field.name and field.is_relation
are accessed.

Around line 321-338: The _apply_groups function performs multiple database
operations per group (winner.save(), Event.objects.filter().update() calls for
venue and organizer, and model.objects.filter().delete()) without transactional
protection, which could leave the database in an inconsistent state if any
operation fails mid-sequence. Wrap all the database operations for each group
iteration (starting from winner.save() through the final
model.objects.filter().delete() call) inside a transaction.atomic() context
manager to ensure all-or-nothing semantics per group. Import transaction from
django.db if not already imported.

In @apps/backend/events/tests.py:

Around line 1240-1242: In the test_normalize_date_datetime method, replace the
inline import("datetime") call with a direct import of the datetime module
at the top of the test file. Change the assertion to use the directly imported
datetime.timezone.utc instead of accessing it through import. This will
improve code readability and follow Python conventions.

Around line 1482-1486: The test method test_dedup_after_save_never_raises is
part of the DedupCommandTests class which is missing required blank lines before
the next class definition. Add two blank lines after the DedupCommandTests class
definition ends (after its last method) and before the OrganizerExportTests
class begins to comply with PEP 8 style guidelines for top-level class
definitions.

In @apps/backend/scripts/dedup.py:

Around line 141-155: Add a clarifying comment in the _select_winner function
before the SQL query execution to document that the f-string construction is
intentional and safe because the table parameter is restricted to hardcoded
internal table names (not user input), and the values passed to the query are
properly parameterized. This will help suppress static analysis warnings and
inform future maintainers of the design constraint.

In @apps/backend/scripts/deduplicate.py:

Around line 125-147: The autocommit property of the database connection is set
to False at the beginning of the loop but is never restored to its original
state. Save the original autocommit value before the for loop that iterates
through groups, then restore it after processing is complete either by adding a
finally block after the try/except that resets conn.autocommit to the saved
original value, or by restoring it at the end of the function before returning
the summary. This ensures that if the connection is reused elsewhere, it
maintains the expected autocommit state.

Around line 50-63: The custom quote-stripping logic in the _load_dotenv
function using line.partition("=") and chained .strip('"').strip("'") calls
handles simple cases like DATABASE_URL values but doesn't properly handle
escaped quotes or complex quote scenarios. If the application needs to support
more complex .env files in the future, replace the custom parsing logic in the
_load_dotenv function with the python-dotenv library instead, which provides
robust handling of edge cases like escaped quotes, quoted values with special
characters, and other .env file complexities.

In @docs/deduplication/api-reference.md:

Line 170: The SQL code blocks in the markdown file are missing the language
identifier in their fenced code block declarations. Add the sql language
specifier to the opening fence of the three SQL code blocks located around lines
170, 188, and 210. Change each opening fence from three backticks to three
backticks followed by sql (e.g., sql instead of before the UPDATE
statement blocks). This will improve readability and allow Markdown linters to
properly validate the SQL syntax.

In @docs/deduplication/overview.md:

Line 23: The fenced code blocks at lines 23, 41, 105, and 156 in the
deduplication overview markdown file are missing language specifiers after the
opening triple backticks. Add an appropriate language identifier (such as text
or bash) immediately after the opening triple backticks for each of these code
blocks to improve linting compliance and clarity.

In @docs/deduplication/protocols.md:

Line 163: The code block at line 163 in the flow diagram section is missing a
language specifier on the opening triple backticks. Add text as the language
specifier to the code fence (change totext) to improve markdown linting
compliance and properly indicate the code block language type.

In @docs/deduplication/running-the-script.md:

Line 31: The sample output code blocks at lines 31 and 97 in the markdown file
are missing language specifiers. To fix this, locate both code blocks (the
opening triple backticks that precede sample output sections) and add the text
language specifier by changing the opening backticks from totext. This
will improve markdown linting compliance and make the output format explicit to
readers.

In @process/general-plans/active/deduplication_PLAN_18-06-26.md:

Line 203: Locate the three code blocks in this markdown document that
currently lack language specifiers (marked with just without a language identifier) and add the appropriate language specifier to each: change to
python for two of the code blocks and to ```text for one code block, as
indicated in the review comment to improve linting compliance and enable proper
syntax highlighting.
</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ]  Push a commit to this branch (recommended)
- [ ]  Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Repository UI

**Review profile**: CHILL

**Plan**: Pro Plus

**Run ID**: `871cdfff-e30a-4476-900a-50a24437b20f`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 00e70d0305b4c8b220eeaf62c48dfe0ed68ed9b7 and c428a2026ca30589506a3ab36d20940fe6e857dd.

</details>

<details>
<summary>📒 Files selected for processing (12)</summary>

* `apps/backend/events/scrapers/base.py`
* `apps/backend/events/tests.py`
* `apps/backend/package.json`
* `apps/backend/scripts/dedup.py`
* `apps/backend/scripts/deduplicate.py`
* `docs/README.md`
* `docs/deduplication/README.md`
* `docs/deduplication/api-reference.md`
* `docs/deduplication/overview.md`
* `docs/deduplication/protocols.md`
* `docs/deduplication/running-the-script.md`
* `process/general-plans/active/deduplication_PLAN_18-06-26.md`

</details>

</details>

coderabbitai · 2026-06-18T02:07:16Z

+def _dedup_normalize_url(url: str | None) -> str:
+    """Scheme-less, UTM-stripped, query-sorted, slash-trimmed URL key.
+
+    Unlike ``_normalize_url`` (which preserves the scheme for organizer
+    resolution), the dedup path drops the scheme so ``http://`` and ``https://``
+    collapse together — matching scripts/dedup.normalize_url.
+    """
+    from urllib.parse import parse_qsl, urlencode, urlparse, urlunparse
+
+    if not url:
+        return ""
+    parsed = urlparse(str(url).strip().lower())
+    netloc, path = parsed.netloc, parsed.path
+    if not netloc and path:
+        netloc, _, rest = path.partition("/")
+        path = "/" + rest if rest else ""
+    path = path.rstrip("/")
+    pairs = [
+        (k, v) for k, v in parse_qsl(parsed.query, keep_blank_values=True)
+        if not k.startswith("utm_")
+    ]
+    pairs.sort()
+    return urlunparse(("", netloc, path, "", urlencode(pairs), ""))
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

URL fragment dropped—inconsistent with standalone dedup and may cause incorrect merges.

The inline _dedup_normalize_url drops URL fragments (line 187 returns empty fragment), but scripts/dedup.py:normalize_url preserves them. Per the comment in dedup.py lines 68-69, myruntime uses #/event-slug to differentiate events on the same base URL.

With the current implementation, two distinct myruntime events like:

http://site.com/events#/gala-night

http://site.com/events#/concert-series

Would both normalize to //site.com/events and be incorrectly merged by the inline dedup.

🐛 Proposed fix: preserve fragment to match standalone behavior

- return urlunparse(("", netloc, path, "", urlencode(pairs), "")) + return urlunparse(("", netloc, path, "", urlencode(pairs), parsed.fragment))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _dedup_normalize_url(url: str | None) -> str:

"""Scheme-less, UTM-stripped, query-sorted, slash-trimmed URL key.

Unlike ``_normalize_url`` (which preserves the scheme for organizer

resolution), the dedup path drops the scheme so ``http://`` and ``https://``

collapse together — matching scripts/dedup.normalize_url.

"""

from urllib.parse import parse_qsl, urlencode, urlparse, urlunparse

if not url:

return ""

parsed = urlparse(str(url).strip().lower())

netloc, path = parsed.netloc, parsed.path

if not netloc and path:

netloc, _, rest = path.partition("/")

path = "/" + rest if rest else ""

path = path.rstrip("/")

pairs = [

(k, v) for k, v in parse_qsl(parsed.query, keep_blank_values=True)

if not k.startswith("utm_")

]

pairs.sort()

return urlunparse(("", netloc, path, "", urlencode(pairs), ""))

def _dedup_normalize_url(url: str | None) -> str:

"""Scheme-less, UTM-stripped, query-sorted, slash-trimmed URL key.

Unlike ``_normalize_url`` (which preserves the scheme for organizer

resolution), the dedup path drops the scheme so ``http://`` and ``https://``

collapse together — matching scripts/dedup.normalize_url.

"""

from urllib.parse import parse_qsl, urlencode, urlparse, urlunparse

if not url:

return ""

parsed = urlparse(str(url).strip().lower())

netloc, path = parsed.netloc, parsed.path

if not netloc and path:

netloc, _, rest = path.partition("/")

path = "/" + rest if rest else ""

path = path.rstrip("/")

pairs = [

(k, v) for k, v in parse_qsl(parsed.query, keep_blank_values=True)

if not k.startswith("utm_")

]

pairs.sort()

return urlunparse(("", netloc, path, "", urlencode(pairs), parsed.fragment))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/events/scrapers/base.py` around lines 165 - 188, The `_dedup_normalize_url` function is dropping URL fragments when reconstructing the normalized URL in its return statement, but the standalone dedup script in `scripts/dedup.py` preserves fragments. This causes inconsistency where distinct myruntime events with different fragments (e.g., `#/gala-night` vs `#/concert-series`) incorrectly merge into the same event. Extract the fragment from the parsed URL object (available from the `urlparse` result) and include it in the `urlunparse` call at the return statement instead of passing an empty string for the fragment parameter, ensuring fragments are preserved to match the standalone dedup behavior.

coderabbitai · 2026-06-18T02:07:16Z

  "private": true,
  "scripts": {
-    "dev": "venv/bin/python manage.py runserver",
+    "dev": "..\\..\\.venv\\Scripts\\python manage.py runserver",


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Get repository structure overview cd /tmp && find . -maxdepth 3 -type d -name ".venv" -o -type d -name "venv" 2>/dev/null | head -20

Repository: hdmGOAT/veent-event-scraper

Length of output: 53

🏁 Script executed:

# Check the full package.json file at apps/backend cat -n apps/backend/package.json

Repository: hdmGOAT/veent-event-scraper

Length of output: 337

🏁 Script executed:

# Look for any venv or Python environment setup git ls-files | grep -E "(venv|\.venv|requirements\.txt|pyproject\.toml|Pipfile)" | head -20

Repository: hdmGOAT/veent-event-scraper

Length of output: 101

🏁 Script executed:

# Check for any platform-specific handling in scripts or config rg "cross-env|process\.platform|os\.(win32|unix)" --type json --type js --type ts

Repository: hdmGOAT/veent-event-scraper

Length of output: 53

🏁 Script executed:

# Check repository root structure to understand venv location relative to apps/backend ls -la | head -20 ls -la apps/ 2>/dev/null | head -20

Repository: hdmGOAT/veent-event-scraper

Length of output: 1445

🏁 Script executed:

# Check if .venv exists at project root or elsewhere ls -la | grep -i venv

Repository: hdmGOAT/veent-event-scraper

Length of output: 53

🏁 Script executed:

# Check README for platform requirements and setup instructions cat README.md

Repository: hdmGOAT/veent-event-scraper

Length of output: 1934

🏁 Script executed:

# Check git history to see what the original dev script was git log --oneline -10 apps/backend/package.json git show HEAD:apps/backend/package.json 2>/dev/null || git show HEAD~1:apps/backend/package.json 2>/dev/null

Repository: hdmGOAT/veent-event-scraper

Length of output: 301

Dev script uses Windows-only path syntax; breaks cross-platform development.

The dev script at line 5 uses Windows-specific backslashes and Scripts/ subdirectory (..\\..\\.venv\\Scripts\\python manage.py runserver), which will fail on Unix/macOS systems. This conflicts with the README setup instructions, which use Unix conventions (source venv/bin/activate with forward slashes).

Fix by:

Using forward slashes that work on all platforms: .venv/bin/python manage.py runserver or ../../.venv/bin/python manage.py runserver

Or adopting a cross-platform approach with cross-env or platform-specific scripts (dev:win, dev:unix)

Ensure the venv path matches the actual project structure and documented setup process.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/package.json` at line 5, The `dev` script in package.json uses Windows-specific backslash path syntax and the Windows-specific `Scripts/` directory structure, which prevents Unix/macOS developers from running the development server. Replace the Windows path `..\\..\\.venv\\Scripts\\python manage.py runserver` with a cross-platform compatible path using forward slashes (e.g., `../../.venv/bin/python manage.py runserver` or `.venv/bin/python manage.py runserver`), or alternatively use a package like `cross-env` or create separate platform-specific script entries (`dev:win` and `dev:unix`) to handle both Windows and Unix environments. Ensure the updated path matches the actual project structure and venv location described in the README setup instructions.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

apps/frontend/src/lib/types.ts (1)

152-155: ⚡ Quick win

Narrow DedupResult.entity to known literals.

Line 154 should mirror backend constraints to keep client-side state type-safe.

Suggested type refinement

 export interface DedupResult {
 	output: string;
-	entity: string;
+	entity: 'events' | 'venues' | 'organizers' | 'all';
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/frontend/src/lib/types.ts` around lines 152 - 155, The `entity` property
in the `DedupResult` interface is currently typed as a generic string, but it
should be narrowed to a union of specific literal string values to match backend
constraints and ensure type safety. Update the `entity` field in the
`DedupResult` interface to use literal union types (like "entityType1" |
"entityType2") instead of the broad string type, making sure the allowed values
correspond to what the backend supports.

apps/backend/events/views.py (1)

629-635: Move dedup execution off the request thread.

Running a potentially long subprocess inline (up to 120s) ties up web workers. Queueing this as a background job (and returning a job id) will improve resilience under concurrent admin actions.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/views.py` around lines 629 - 635, The subprocess.run call
that executes deduplicate.py is running synchronously on the request thread and
blocking for up to 120 seconds, which can starve web workers. Instead of
executing the subprocess inline, queue this operation as a background job (using
your application's job queue system like Celery) and return a job identifier to
the client immediately. Move the subprocess.run call with its parameters (the
scripts_dir path, deduplicate.py script, entity argument, and 120-second
timeout) into a separate background task, and have the view method return the
job id or status response without waiting for completion.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/backend/events/views.py`:
- Around line 615-646: The api_dedup_trigger endpoint is missing authentication
protection and exposes internal error details. First, add an authentication
check at the beginning of the function that validates the X-Scraper-Key header
against _WEBHOOK_SECRET, similar to how scraper_webhook and
ingest_events_webhook protect their endpoints. Second, replace the two error
returns that currently expose internal details (line 640 returning stderr and
line 645 returning str(exc)) with generic error messages like "Dedup operation
failed" or "An error occurred". For both cases, capture the actual error details
and log them server-side using proper logging instead of exposing them in the
JsonResponse.
- Around line 629-636: The subprocess.run call for deduplicate.py in the
api_dedup_trigger view lacks synchronization, allowing concurrent requests to
execute overlapping dedup operations that can corrupt data integrity. Wrap the
entire subprocess.run block (including the call with timeout=120 and cwd
parameters) with a process-level lock mechanism using either a file-based lock
or a threading.Lock/multiprocessing.Lock to ensure only one dedup execution runs
at a time. Acquire the lock before spawning the subprocess and ensure it is
released after the process completes, handling both success and error cases
appropriately.

---

Nitpick comments:
In `@apps/backend/events/views.py`:
- Around line 629-635: The subprocess.run call that executes deduplicate.py is
running synchronously on the request thread and blocking for up to 120 seconds,
which can starve web workers. Instead of executing the subprocess inline, queue
this operation as a background job (using your application's job queue system
like Celery) and return a job identifier to the client immediately. Move the
subprocess.run call with its parameters (the scripts_dir path, deduplicate.py
script, entity argument, and 120-second timeout) into a separate background
task, and have the view method return the job id or status response without
waiting for completion.

In `@apps/frontend/src/lib/types.ts`:
- Around line 152-155: The `entity` property in the `DedupResult` interface is
currently typed as a generic string, but it should be narrowed to a union of
specific literal string values to match backend constraints and ensure type
safety. Update the `entity` field in the `DedupResult` interface to use literal
union types (like "entityType1" | "entityType2") instead of the broad string
type, making sure the allowed values correspond to what the backend supports.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 5a2ec474-d055-4f17-97d1-4f3c8d967ef6

📥 Commits

Reviewing files that changed from the base of the PR and between c428a20 and 1c27a10.

📒 Files selected for processing (5)

apps/backend/events/urls.py
apps/backend/events/views.py
apps/frontend/src/lib/api.ts
apps/frontend/src/lib/types.ts
apps/frontend/src/routes/scrapers/+page.svelte

…row DedupResult entity type Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

apps/backend/events/views.py (1)

670-695: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Apply consistent auth and error handling to api_script_trigger.

This endpoint lacks the auth check and error sanitization that were just added to api_dedup_trigger. Line 695 returns str(exc) which can leak internal paths or stack details. For consistency:

Suggested fix

 `@csrf_exempt`
 `@require_POST`
 def api_script_trigger(request, script_name: str):
+    key = request.headers.get("X-Scraper-Key", "")
+    if not _WEBHOOK_SECRET or key != _WEBHOOK_SECRET:
+        return JsonResponse({"error": "unauthorized"}, status=401)
+
     ...
     try:
         process = subprocess.Popen(...)
         return JsonResponse({"started": True, "script": script_name, "pid": process.pid})
-    except Exception as exc:  # noqa: BLE001
-        return JsonResponse({"error": str(exc)}, status=500)
+    except Exception:  # noqa: BLE001
+        logger.exception("script trigger error (script=%s)", script_name)
+        return JsonResponse({"error": "Failed to start script"}, status=500)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/views.py` around lines 670 - 695, The api_script_trigger
function is missing the authentication check and error sanitization that were
applied to api_dedup_trigger. Add the same auth check decorator or guard that
protects api_dedup_trigger at the start of api_script_trigger, and replace the
raw str(exc) error response in the except block with a sanitized error message
that does not leak internal paths or stack details, consistent with the error
handling approach used in api_dedup_trigger.

🧹 Nitpick comments (1)

apps/backend/events/views.py (1)
637-659: ⚖️ Poor tradeoff

Single-flight lock is process-local; consider multi-worker implications.

The threading.Lock correctly prevents concurrent dedup within a single process. However, production deployments with multiple workers (gunicorn, uWSGI) each have their own lock instance, so concurrent requests to different workers could still overlap.

If overlapping dedup runs are dangerous (they likely are given the merge/delete operations), consider:

A database-level advisory lock (SELECT pg_advisory_lock(...))

A file-based lock (fcntl.flock)

A Redis-based distributed lock

That said, the 120s timeout + typical request patterns may make collisions rare in practice.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/views.py` around lines 637 - 659, The `_DEDUP_LOCK` used
in the dedup trigger view is a process-local threading.Lock, which only prevents
concurrent execution within a single worker process. In multi-worker deployments
(gunicorn, uWSGI), each worker has its own lock instance, allowing concurrent
dedup runs across different workers which could cause data corruption. Replace
the threading.Lock mechanism with a distributed locking solution such as a
database-level advisory lock using PostgreSQL's pg_advisory_lock, a file-based
lock using fcntl.flock, or a Redis-based distributed lock, depending on your
infrastructure constraints and preferences.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/backend/events/views.py`:
- Around line 626-628: The authentication check at the beginning of this view
requires the X-Scraper-Key header which the frontend's generic post() helper
cannot provide. To fix this, you need to implement a different authentication
mechanism for frontend calls while preserving the webhook authentication for
n8n. Create a separate endpoint (or modify this one to handle both cases) that
uses Django's `@staff_member_required` decorator for the frontend UI calls, while
keeping the X-Scraper-Key header validation for n8n webhook calls. You can
determine which authentication path to use by checking the request origin (e.g.,
if the header is present use webhook auth, otherwise use session/staff auth).
This allows the frontend to authenticate via session while keeping webhook calls
secure with the secret header.

---

Outside diff comments:
In `@apps/backend/events/views.py`:
- Around line 670-695: The api_script_trigger function is missing the
authentication check and error sanitization that were applied to
api_dedup_trigger. Add the same auth check decorator or guard that protects
api_dedup_trigger at the start of api_script_trigger, and replace the raw
str(exc) error response in the except block with a sanitized error message that
does not leak internal paths or stack details, consistent with the error
handling approach used in api_dedup_trigger.

---

Nitpick comments:
In `@apps/backend/events/views.py`:
- Around line 637-659: The `_DEDUP_LOCK` used in the dedup trigger view is a
process-local threading.Lock, which only prevents concurrent execution within a
single worker process. In multi-worker deployments (gunicorn, uWSGI), each
worker has its own lock instance, allowing concurrent dedup runs across
different workers which could cause data corruption. Replace the threading.Lock
mechanism with a distributed locking solution such as a database-level advisory
lock using PostgreSQL's pg_advisory_lock, a file-based lock using fcntl.flock,
or a Redis-based distributed lock, depending on your infrastructure constraints
and preferences.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: edb1a7e5-ad42-44f6-8c2f-c1b5208fd78a

📥 Commits

Reviewing files that changed from the base of the PR and between 73fca24 and aeded92.

📒 Files selected for processing (2)

apps/backend/events/views.py
apps/frontend/src/lib/types.ts

🚧 Files skipped from review as they are similar to previous changes (1)

apps/frontend/src/lib/types.ts

JESREAL1JDL7LUSTRE added 3 commits June 18, 2026 09:46

scripts and docs

b34d301

fix on scripts and docs

3ae5ad8

fix merge conflict

c428a20

JESREAL1JDL7LUSTRE requested review from hdmGOAT and potakaaa and removed request for potakaaa June 18, 2026 01:59

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

added ui button for deduplication

1c27a10

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread apps/backend/events/views.py

Comment thread apps/backend/events/views.py

JESREAL1JDL7LUSTRE and others added 2 commits June 18, 2026 10:28

added button for scripts in the scraper center

73fca24

fix(dedup): add auth, concurrency lock, sanitize error responses; nar…

aeded92

…row DedupResult entity type Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread apps/backend/events/views.py Outdated

fix

f079d7b

hdmGOAT merged commit 0031782 into development Jun 18, 2026
3 checks passed

This was referenced Jun 18, 2026

Feat/meetup eventalways #23

Merged

Update main with development #24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat deduplication#20

Feat deduplication#20
hdmGOAT merged 7 commits into
developmentfrom
feat/deduplication

JESREAL1JDL7LUSTRE commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Uh oh!

coderabbitai Bot Jun 18, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JESREAL1JDL7LUSTRE commented Jun 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JESREAL1JDL7LUSTRE commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading