Skip to content

Feat deduplication#20

Merged
hdmGOAT merged 7 commits into
developmentfrom
feat/deduplication
Jun 18, 2026
Merged

Feat deduplication#20
hdmGOAT merged 7 commits into
developmentfrom
feat/deduplication

Conversation

@JESREAL1JDL7LUSTRE

@JESREAL1JDL7LUSTRE JESREAL1JDL7LUSTRE commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

read docs\deduplication

Summary by CodeRabbit

  • New Features
    • Added automatic post-save deduplication for events, venues, and organizers to merge duplicate records and clean up overlaps.
    • Introduced a standalone deduplication tool with --dry-run, per-entity execution, and verbose output/summary.
    • Added a backend trigger for deduplication and a frontend “Deduplicate” button with typed results.
  • Bug Fixes
    • Improved Windows compatibility for running the backend development server.
  • Documentation
    • Added end-to-end deduplication docs (overview, protocols, script usage, and API reference).
  • Tests
    • Added coverage for normalization, duplicate detection/merging, and dedup trigger behavior.

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ddbbc77c-90cd-45bb-8a9a-e5273c538505

📥 Commits

Reviewing files that changed from the base of the PR and between aeded92 and f079d7b.

📒 Files selected for processing (1)
  • apps/backend/events/views.py

📝 Walkthrough

Walkthrough

Introduces a two-layer cross-source deduplication system for Event, Venue, and Organizer records. A new scripts/dedup.py module provides normalization helpers, two-pass duplicate finders with entity-specific guards, winner selection, and raw SQL merge functions. An inline _dedup_after_save hook is wired into the scraper base to run after persistence. A standalone deduplicate.py CLI script handles on-demand full dedup. A new API endpoint and frontend UI trigger dedup from the web interface. Tests and comprehensive documentation are included.

Changes

Cross-source deduplication system

Layer / File(s) Summary
dedup.py — normalization, grouping, finders, and merge mechanics
apps/backend/scripts/dedup.py
New standalone psycopg2 module with normalization helpers for names (punctuation/diacritic removal, lowercasing), URLs (scheme-stripped, UTM-removed, query-sorted, fragment-preserved), dates (UTC conversion), and cities (whitespace-lowercased). Adds richness-based winner selection with created_at tiebreak, order-preserving group union, two-pass duplicate finders for events/venues/organizers (with place_id conflict guards and name-word overlap guards), and merge functions that backfill winner fields from losers, remap venue_id/organizer_ref_id FKs, and hard-delete losers.
Inline post-save dedup hook wired into BaseScraper
apps/backend/events/scrapers/base.py
Adds _dedup_after_save dispatcher and per-entity dedup functions (_dedup_events_by_url, _dedup_venues_by_name_city, _dedup_organizers_by_website), plus helpers for normalization, grouping, field-filling with protected-field exclusion, and group application/merging. Wires calls into save_events, save_organizers, and save_venues to dedup just-saved IDs, swallowing all errors as warnings.
deduplicate.py — standalone CLI script
apps/backend/scripts/deduplicate.py
New CLI script that loads DATABASE_URL from env/.env, connects via psycopg2, dispatches to dedup.py finder/merger pairs per entity, supports --dry-run/--verbose/--entity CLI options, runs per-group transactions with isolated commit/rollback, and prints a Groups/Merged/Deleted summary table.
API endpoint and frontend UI for triggering dedup
apps/backend/events/urls.py, apps/backend/events/views.py, apps/frontend/src/lib/api.ts, apps/frontend/src/lib/types.ts, apps/frontend/src/routes/scrapers/+page.svelte
Backend adds api/scrapers/dedup/ URL route and api_dedup_trigger POST endpoint (CSRF-exempt) that validates entity parameter against an allowlist, spawns deduplicate.py with 120s timeout, and returns {output, entity} on success or JSON error for invalid input/timeout/script failure. Frontend adds DedupResult type, api.deduplicate() method, dedup-specific state holders, handleDedup async handler with in-flight serialization, and a "Deduplicate" button with conditional error/output rendering alongside existing "Run All" controls.
Deduplication test suite
apps/backend/events/tests.py
Adds NormalizationTests (name/URL/date/city rules), FindDuplicatesTests (event/venue/organizer duplicate discovery per pass with winner richness ordering), MergeTests (field backfilling, protected-field preservation per entity, FK remapping, hard-delete loser behavior), and DedupCommandTests (_dedup_after_save hook semantics including noop for empty IDs and graceful unknown-entity handling).
Docs, protocols, and execution plan
docs/README.md, docs/deduplication/overview.md, docs/deduplication/api-reference.md, docs/deduplication/running-the-script.md, docs/deduplication/protocols.md, docs/deduplication/README.md, process/general-plans/active/deduplication_PLAN_18-06-26.md, apps/backend/package.json
Adds system overview (two-layer architecture, per-entity matching passes, normalization rules, winner selection, merge semantics), API reference for dedup.py public interface (normalization helpers, duplicate finders, merge functions), running guide (prerequisites, dry-run/live examples, production workflow, troubleshooting), scraping protocols (identity-field requirements, per-entity scraper rules, source uniqueness), dedup README with quick-start and baseline results, and the implementation plan detailing Steps 1–4 (dedup module, CLI script, scraper hook integration, tests). Updates docs/README.md index and package.json dev script to Windows venv path (..\\..\\\.venv\\Scripts\\python).

Sequence Diagram(s)

sequenceDiagram
    rect rgba(100, 149, 237, 0.5)
        Note over BaseScraper,Event: Layer 1 — Inline post-save dedup
    end
    participant BaseScraper
    participant _dedup_after_save
    participant _dedup_events_by_url
    participant Event

    BaseScraper->>_dedup_after_save: save_events(...) → _dedup_after_save("events", event_ids)
    _dedup_after_save->>_dedup_events_by_url: event_ids
    _dedup_events_by_url->>Event: Query by normalized URL
    Event-->>_dedup_events_by_url: duplicate groups [[winner_id, loser_id, ...], ...]
    _dedup_events_by_url->>_dedup_events_by_url: _apply_groups(Event, groups, protected)
    _dedup_events_by_url->>Event: _fill_missing(winner, losers, protected)
    _dedup_events_by_url->>Event: winner.save()
    _dedup_events_by_url->>Event: loser_rows.delete()
    _dedup_after_save-->>BaseScraper: (errors swallowed as warnings)
Loading
sequenceDiagram
    rect rgba(60, 179, 113, 0.5)
        Note over main,Postgres: Layer 2 — On-demand standalone script
    end
    participant main as main()
    participant run_entity
    participant dedup as dedup.py
    participant Postgres

    main->>run_entity: entity="venues", dry_run=false, verbose=true
    run_entity->>dedup: find_venue_duplicates(cursor)
    dedup->>Postgres: SELECT all venues
    Postgres-->>dedup: all rows with normalized keys
    dedup-->>run_entity: [[winner_id, loser_id, ...], ...]
    loop per duplicate group
        run_entity->>dedup: merge_venues(cursor, winner_id, loser_ids)
        dedup->>Postgres: UPDATE events SET venue_id=winner WHERE venue_id IN (losers)
        dedup->>Postgres: DELETE FROM venues WHERE id IN (losers)
        Postgres-->>dedup: rows affected
        alt on error
            run_entity->>Postgres: ROLLBACK
        else success
            run_entity->>Postgres: COMMIT
        end
    end
    run_entity-->>main: {groups, merged, deleted, failures}
    main->>main: _print_summary(summaries)
Loading
sequenceDiagram
    rect rgba(218, 112, 214, 0.5)
        Note over Frontend,Postgres: API trigger for on-demand dedup
    end
    participant Frontend
    participant api_dedup_trigger as api_dedup_trigger()
    participant subprocess as deduplicate.py
    participant Postgres

    Frontend->>api_dedup_trigger: POST /scrapers/dedup/, entity="venues"
    api_dedup_trigger->>api_dedup_trigger: validate entity in allowlist
    api_dedup_trigger->>subprocess: subprocess.run(..., timeout=120s)
    subprocess->>Postgres: find_venue_duplicates(cursor)
    subprocess->>Postgres: merge_venues(...) per group, commit/rollback
    subprocess-->>api_dedup_trigger: stdout/stderr, exit code
    alt exit code = 0
        api_dedup_trigger-->>Frontend: {output: "...", entity: "venues"}
    else error
        api_dedup_trigger-->>Frontend: {error: "...", status: 400/500/504}
    end
    Frontend->>Frontend: render output or error UI
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hdmGOAT/veent-event-scraper#8: Both PRs modify the same BaseScraper post-save pipeline, with this PR adding the _dedup_after_save deduplication hook and the other adding a classification hook (_categorize_after_save) at the same lifecycle point.

Suggested reviewers

  • potakaaa
  • hdmGOAT

🐇 Hop hop, duplicates be gone!
Two sources, one winner, all night long,
URLs sorted, names trimmed tight,
Guards and groups merge them right.
Layer one hooks and layer two's might,
Dedup magic sets things right!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Feat deduplication' is vague and generic, using only the convention prefix 'Feat' followed by a broad feature name without conveying meaningful specifics about the changeset. Consider a more descriptive title that highlights the main change, such as 'Add post-save deduplication pipeline for events, venues, and organizers' or similar to clarify what deduplication approach was implemented.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/deduplication

Comment @coderabbitai help to get the list of available commands and usage tips.

@JESREAL1JDL7LUSTRE JESREAL1JDL7LUSTRE requested review from hdmGOAT and potakaaa and removed request for potakaaa June 18, 2026 01:59

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (12)
docs/deduplication/protocols.md (1)

163-163: ⚡ Quick win

Add language specifier to flow diagram code block.

Line 163 contains a flow/diagram block without language specification. Adding text specifier improves linting compliance.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/deduplication/protocols.md` at line 163, The code block at line 163 in
the flow diagram section is missing a language specifier on the opening triple
backticks. Add `text` as the language specifier to the code fence (change ``` to
```text) to improve markdown linting compliance and properly indicate the code
block language type.

Source: Linters/SAST tools

docs/deduplication/overview.md (1)

23-23: ⚡ Quick win

Consider adding language specifiers to fenced code blocks.

Lines 23, 41, 105, and 156 contain code/text blocks without language specifications. While not required for these diagram/example blocks, adding text or bash specifiers improves linting compliance and clarity.

Also applies to: 41-41, 105-105, 156-156

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/deduplication/overview.md` at line 23, The fenced code blocks at lines
23, 41, 105, and 156 in the deduplication overview markdown file are missing
language specifiers after the opening triple backticks. Add an appropriate
language identifier (such as `text` or `bash`) immediately after the opening
triple backticks for each of these code blocks to improve linting compliance and
clarity.

Source: Linters/SAST tools

docs/deduplication/running-the-script.md (1)

31-31: ⚡ Quick win

Add language specifiers to output code blocks.

Lines 31 and 97 contain sample output blocks without language specifications. Adding text specifier improves linting compliance and clarity.

Also applies to: 97-97

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/deduplication/running-the-script.md` at line 31, The sample output code
blocks at lines 31 and 97 in the markdown file are missing language specifiers.
To fix this, locate both code blocks (the opening triple backticks that precede
sample output sections) and add the text language specifier by changing the
opening backticks from ``` to ```text. This will improve markdown linting
compliance and make the output format explicit to readers.

Source: Linters/SAST tools

docs/deduplication/api-reference.md (1)

170-170: ⚡ Quick win

Add SQL language specifiers to fenced code blocks.

Lines 170, 188, and 210 contain SQL code blocks that are missing the sql language identifier. This improves readability and allows Markdown linters to validate syntax.

Proposed fix
-**FK remapping:**
-
-```sql
+**FK remapping:**
+
+```sql
 UPDATE events_event SET venue_id = winner_id WHERE venue_id = ANY(loser_ids)

Apply the same change (add `sql` language spec) to lines 188 and 210.

</details>


Also applies to: 188-188, 210-210

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/deduplication/api-reference.md at line 170, The SQL code blocks in the
markdown file are missing the language identifier in their fenced code block
declarations. Add the sql language specifier to the opening fence of the three
SQL code blocks located around lines 170, 188, and 210. Change each opening
fence from three backticks to three backticks followed by sql (e.g., sql instead of before the UPDATE statement blocks). This will improve
readability and allow Markdown linters to properly validate the SQL syntax.


</details>

<!-- cr-comment:v1:054182d580119f9836bcf0e3 -->

_Source: Linters/SAST tools_

</blockquote></details>
<details>
<summary>process/general-plans/active/deduplication_PLAN_18-06-26.md (1)</summary><blockquote>

`203-203`: _⚡ Quick win_

**Add language specifiers to code example blocks.**

Lines 203, 212, and 240 contain code blocks without language specifications. Adding `python` (for lines 203, 240) and `text` (for line 212) improves linting compliance and syntax highlighting.






Also applies to: 212-212, 240-240

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @process/general-plans/active/deduplication_PLAN_18-06-26.md at line 203,
Locate the three code blocks in this markdown document that currently lack
language specifiers (marked with just without a language identifier) and add the appropriate language specifier to each: change to python for two of the code blocks and to ```text for one code block, as indicated in the
review comment to improve linting compliance and enable proper syntax
highlighting.


</details>

<!-- cr-comment:v1:4bf9838976bbf9a920427712 -->

_Source: Linters/SAST tools_

</blockquote></details>
<details>
<summary>apps/backend/events/scrapers/base.py (2)</summary><blockquote>

`205-220`: _💤 Low value_

**Loop variable shadows `dataclasses.field` import.**

The `field` loop variable on line 208 shadows the `field` import from `dataclasses` on line 10. While this won't cause runtime issues (the import is only used in dataclass definitions), it's a code smell that static analysis correctly flags.




<details>
<summary>♻️ Suggested rename</summary>

```diff
-    for field in winner._meta.concrete_fields:
-        name = field.name
-        if name in skip or field.is_relation:
+    for model_field in winner._meta.concrete_fields:
+        name = model_field.name
+        if name in skip or model_field.is_relation:
             continue
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/scrapers/base.py` around lines 205 - 220, The loop
variable `field` in the `_fill_missing` function shadows the `field` import from
the `dataclasses` module. Rename the loop variable from `field` to something
else (such as `model_field`) in the for loop that iterates over
`winner._meta.concrete_fields`, and update all references to this variable
within the loop body where `field.name` and `field.is_relation` are accessed.
```

</details>

<!-- cr-comment:v1:fb19f996979b66e5121a6067 -->

_Source: Linters/SAST tools_

---

`321-338`: _⚡ Quick win_

**Consider wrapping merge operations in a transaction for atomicity.**

`_apply_groups` performs multiple database operations per group (save winner, update FKs, delete losers). If a failure occurs mid-sequence, the database could be left in an inconsistent state—for example, winner enriched but losers still present as orphans.

The standalone script uses per-group transactions with rollback on failure. While the inline dedup is best-effort, wrapping in `transaction.atomic()` would ensure all-or-nothing semantics per group.



<details>
<summary>♻️ Suggested transaction wrapping</summary>

```diff
+from django.db import transaction
+
 def _apply_groups(model, groups, protected, venue_fk=False, organizer_fk=False):
     """Merge each group's losers into the winner and hard-delete the losers."""
     for group in groups:
         winner_id, loser_ids = group[0], group[1:]
         if not loser_ids:
             continue
-        winner = model.objects.get(pk=winner_id)
-        losers = list(model.objects.filter(pk__in=loser_ids))
-        _fill_missing(winner, losers, protected)
-        winner.save()
-        if venue_fk:
-            Event.objects.filter(venue_id__in=loser_ids).update(venue_id=winner_id)
-        if organizer_fk:
-            Event.objects.filter(organizer_ref_id__in=loser_ids).update(
-                organizer_ref_id=winner_id
-            )
-        model.objects.filter(pk__in=loser_ids).delete()
+        with transaction.atomic():
+            winner = model.objects.get(pk=winner_id)
+            losers = list(model.objects.filter(pk__in=loser_ids))
+            _fill_missing(winner, losers, protected)
+            winner.save()
+            if venue_fk:
+                Event.objects.filter(venue_id__in=loser_ids).update(venue_id=winner_id)
+            if organizer_fk:
+                Event.objects.filter(organizer_ref_id__in=loser_ids).update(
+                    organizer_ref_id=winner_id
+                )
+            model.objects.filter(pk__in=loser_ids).delete()
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/scrapers/base.py` around lines 321 - 338, The
_apply_groups function performs multiple database operations per group
(winner.save(), Event.objects.filter().update() calls for venue and organizer,
and model.objects.filter().delete()) without transactional protection, which
could leave the database in an inconsistent state if any operation fails
mid-sequence. Wrap all the database operations for each group iteration
(starting from winner.save() through the final model.objects.filter().delete()
call) inside a transaction.atomic() context manager to ensure all-or-nothing
semantics per group. Import transaction from django.db if not already imported.
```

</details>

<!-- cr-comment:v1:937691563c864e39966500c5 -->

</blockquote></details>
<details>
<summary>apps/backend/scripts/dedup.py (1)</summary><blockquote>

`141-155`: _💤 Low value_

**Static analysis SQL injection warnings are false positives here.**

The f-string SQL construction is flagged by static analysis, but in this context:
- `table` is always a hardcoded internal table name (`"events_event"`, `"events_venue"`, `"events_organizer"`) passed by callers
- Column names in `set_clause` derive from database schema via `cursor.fetchone()`, not user input

The pattern is safe for this internal-only module, though adding a comment noting the constraint would help future maintainers.




<details>
<summary>♻️ Optional: Minor style improvement for list concatenation (line 370)</summary>

```diff
-        params = list(updates.values()) + [winner_id]
+        params = [*updates.values(), winner_id]
```
</details>


Also applies to: 346-373

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/scripts/dedup.py` around lines 141 - 155, Add a clarifying
comment in the _select_winner function before the SQL query execution to
document that the f-string construction is intentional and safe because the
table parameter is restricted to hardcoded internal table names (not user
input), and the values passed to the query are properly parameterized. This will
help suppress static analysis warnings and inform future maintainers of the
design constraint.
```

</details>

<!-- cr-comment:v1:68f7ccc651b0b90165dba869 -->

_Source: Linters/SAST tools_

</blockquote></details>
<details>
<summary>apps/backend/events/tests.py (2)</summary><blockquote>

`1240-1242`: _💤 Low value_

**Consider using a direct import instead of `__import__`.**

The inline `__import__("datetime")` works but is unconventional. A direct import would be clearer.




<details>
<summary>♻️ Suggested change</summary>

```diff
+from datetime import timezone as dt_tz
+
 class NormalizationTests(TestCase):
     # ...
     def test_normalize_date_datetime(self):
         dt = timezone.now()
-        self.assertEqual(dedup_utils.normalize_date(dt), dt.astimezone(__import__("datetime").timezone.utc).date())
+        self.assertEqual(dedup_utils.normalize_date(dt), dt.astimezone(dt_tz.utc).date())
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/tests.py` around lines 1240 - 1242, In the
test_normalize_date_datetime method, replace the inline __import__("datetime")
call with a direct import of the datetime module at the top of the test file.
Change the assertion to use the directly imported datetime.timezone.utc instead
of accessing it through __import__. This will improve code readability and
follow Python conventions.
```

</details>

<!-- cr-comment:v1:5def510af346acdfce0a541b -->

---

`1482-1486`: _💤 Low value_

**Missing blank lines before the next class.**

PEP 8 recommends two blank lines between top-level class definitions. The new `DedupCommandTests` class ends without blank lines before `OrganizerExportTests`.




<details>
<summary>♻️ Add blank lines after test class</summary>

```diff
     def test_dedup_after_save_never_raises(self):
         from events.scrapers.base import _dedup_after_save

         # Unknown entity is silently ignored (no dispatch), never raises.
         _dedup_after_save("nonexistent", [1, 2, 3])
+
+
 class OrganizerExportTests(TestCase):
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/tests.py` around lines 1482 - 1486, The test method
test_dedup_after_save_never_raises is part of the DedupCommandTests class which
is missing required blank lines before the next class definition. Add two blank
lines after the DedupCommandTests class definition ends (after its last method)
and before the OrganizerExportTests class begins to comply with PEP 8 style
guidelines for top-level class definitions.
```

</details>

<!-- cr-comment:v1:d5579c0425a158a6dccb57f2 -->

</blockquote></details>
<details>
<summary>apps/backend/scripts/deduplicate.py (2)</summary><blockquote>

`125-147`: _💤 Low value_

**`autocommit` is set to `False` but never restored.**

After processing, `conn.autocommit` remains `False`. While this doesn't cause issues in the current code path (connection is closed after all entities are processed), it could lead to unexpected behavior if `run_entity` is used in other contexts or if the connection is reused.

Consider resetting `autocommit` in a `finally` block or documenting that the caller is responsible for connection state.




<details>
<summary>♻️ Optional: Reset autocommit after processing</summary>

```diff
     # Each group is committed independently so one failure cannot corrupt or
     # roll back the others.
     conn.autocommit = False
-    for group in groups:
-        winner, losers = group[0], group[1:]
-        if not losers:
-            continue
-        try:
-            with conn.cursor() as cur:
-                merge_fn(cur, winner, losers)
-            conn.commit()
-            summary["merged"] += 1
-            summary["deleted"] += len(losers)
-            if verbose:
-                print(f"  [{label}] merged losers {losers} -> winner {winner}")
-        except Exception as exc:  # noqa: BLE001 — isolate per-group failures
-            conn.rollback()
-            summary["errors"] += 1
-            print(
-                f"  [{label}] ERROR merging {group}: {exc}", file=sys.stderr
-            )
-
-    return summary
+    try:
+        for group in groups:
+            winner, losers = group[0], group[1:]
+            if not losers:
+                continue
+            try:
+                with conn.cursor() as cur:
+                    merge_fn(cur, winner, losers)
+                conn.commit()
+                summary["merged"] += 1
+                summary["deleted"] += len(losers)
+                if verbose:
+                    print(f"  [{label}] merged losers {losers} -> winner {winner}")
+            except Exception as exc:  # noqa: BLE001 — isolate per-group failures
+                conn.rollback()
+                summary["errors"] += 1
+                print(
+                    f"  [{label}] ERROR merging {group}: {exc}", file=sys.stderr
+                )
+    finally:
+        conn.autocommit = True
+
+    return summary
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/scripts/deduplicate.py` around lines 125 - 147, The autocommit
property of the database connection is set to False at the beginning of the loop
but is never restored to its original state. Save the original autocommit value
before the for loop that iterates through groups, then restore it after
processing is complete either by adding a finally block after the try/except
that resets conn.autocommit to the saved original value, or by restoring it at
the end of the function before returning the summary. This ensures that if the
connection is reused elsewhere, it maintains the expected autocommit state.
```

</details>

<!-- cr-comment:v1:4da3e599a92fd8d97b687358 -->

---

`50-63`: _💤 Low value_

**Custom .env parser has edge cases with quoted values containing equals signs.**

The `line.partition("=")` approach splits on the first `=`, which is correct. However, the double `.strip('"').strip("'")` only removes quotes at the very edges after stripping whitespace — a value like `"foo=bar"` would become `foo=bar` correctly, but `'foo\'s value'` or escaped quotes won't be handled. This is acceptable for typical DATABASE_URL values, but consider using `python-dotenv` if more complex .env files are expected in the future.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/scripts/deduplicate.py` around lines 50 - 63, The custom
quote-stripping logic in the `_load_dotenv` function using `line.partition("=")`
and chained `.strip('"').strip("'")` calls handles simple cases like
DATABASE_URL values but doesn't properly handle escaped quotes or complex quote
scenarios. If the application needs to support more complex .env files in the
future, replace the custom parsing logic in the `_load_dotenv` function with the
`python-dotenv` library instead, which provides robust handling of edge cases
like escaped quotes, quoted values with special characters, and other .env file
complexities.
```

</details>

<!-- cr-comment:v1:8da06ff3206aca6d0a74f243 -->

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @apps/backend/events/scrapers/base.py:

  • Around line 165-188: The _dedup_normalize_url function is dropping URL
    fragments when reconstructing the normalized URL in its return statement, but
    the standalone dedup script in scripts/dedup.py preserves fragments. This
    causes inconsistency where distinct myruntime events with different fragments
    (e.g., #/gala-night vs #/concert-series) incorrectly merge into the same
    event. Extract the fragment from the parsed URL object (available from the
    urlparse result) and include it in the urlunparse call at the return
    statement instead of passing an empty string for the fragment parameter,
    ensuring fragments are preserved to match the standalone dedup behavior.

In @apps/backend/package.json:

  • Line 5: The dev script in package.json uses Windows-specific backslash path
    syntax and the Windows-specific Scripts/ directory structure, which prevents
    Unix/macOS developers from running the development server. Replace the Windows
    path ..\\..\\.venv\\Scripts\\python manage.py runserver with a cross-platform
    compatible path using forward slashes (e.g., ../../.venv/bin/python manage.py runserver or .venv/bin/python manage.py runserver), or alternatively use a
    package like cross-env or create separate platform-specific script entries
    (dev:win and dev:unix) to handle both Windows and Unix environments. Ensure
    the updated path matches the actual project structure and venv location
    described in the README setup instructions.

Nitpick comments:
In @apps/backend/events/scrapers/base.py:

  • Around line 205-220: The loop variable field in the _fill_missing function
    shadows the field import from the dataclasses module. Rename the loop
    variable from field to something else (such as model_field) in the for loop
    that iterates over winner._meta.concrete_fields, and update all references to
    this variable within the loop body where field.name and field.is_relation
    are accessed.
  • Around line 321-338: The _apply_groups function performs multiple database
    operations per group (winner.save(), Event.objects.filter().update() calls for
    venue and organizer, and model.objects.filter().delete()) without transactional
    protection, which could leave the database in an inconsistent state if any
    operation fails mid-sequence. Wrap all the database operations for each group
    iteration (starting from winner.save() through the final
    model.objects.filter().delete() call) inside a transaction.atomic() context
    manager to ensure all-or-nothing semantics per group. Import transaction from
    django.db if not already imported.

In @apps/backend/events/tests.py:

  • Around line 1240-1242: In the test_normalize_date_datetime method, replace the
    inline import("datetime") call with a direct import of the datetime module
    at the top of the test file. Change the assertion to use the directly imported
    datetime.timezone.utc instead of accessing it through import. This will
    improve code readability and follow Python conventions.
  • Around line 1482-1486: The test method test_dedup_after_save_never_raises is
    part of the DedupCommandTests class which is missing required blank lines before
    the next class definition. Add two blank lines after the DedupCommandTests class
    definition ends (after its last method) and before the OrganizerExportTests
    class begins to comply with PEP 8 style guidelines for top-level class
    definitions.

In @apps/backend/scripts/dedup.py:

  • Around line 141-155: Add a clarifying comment in the _select_winner function
    before the SQL query execution to document that the f-string construction is
    intentional and safe because the table parameter is restricted to hardcoded
    internal table names (not user input), and the values passed to the query are
    properly parameterized. This will help suppress static analysis warnings and
    inform future maintainers of the design constraint.

In @apps/backend/scripts/deduplicate.py:

  • Around line 125-147: The autocommit property of the database connection is set
    to False at the beginning of the loop but is never restored to its original
    state. Save the original autocommit value before the for loop that iterates
    through groups, then restore it after processing is complete either by adding a
    finally block after the try/except that resets conn.autocommit to the saved
    original value, or by restoring it at the end of the function before returning
    the summary. This ensures that if the connection is reused elsewhere, it
    maintains the expected autocommit state.
  • Around line 50-63: The custom quote-stripping logic in the _load_dotenv
    function using line.partition("=") and chained .strip('"').strip("'") calls
    handles simple cases like DATABASE_URL values but doesn't properly handle
    escaped quotes or complex quote scenarios. If the application needs to support
    more complex .env files in the future, replace the custom parsing logic in the
    _load_dotenv function with the python-dotenv library instead, which provides
    robust handling of edge cases like escaped quotes, quoted values with special
    characters, and other .env file complexities.

In @docs/deduplication/api-reference.md:

  • Line 170: The SQL code blocks in the markdown file are missing the language
    identifier in their fenced code block declarations. Add the sql language
    specifier to the opening fence of the three SQL code blocks located around lines
    170, 188, and 210. Change each opening fence from three backticks to three
    backticks followed by sql (e.g., sql instead of before the UPDATE
    statement blocks). This will improve readability and allow Markdown linters to
    properly validate the SQL syntax.

In @docs/deduplication/overview.md:

  • Line 23: The fenced code blocks at lines 23, 41, 105, and 156 in the
    deduplication overview markdown file are missing language specifiers after the
    opening triple backticks. Add an appropriate language identifier (such as text
    or bash) immediately after the opening triple backticks for each of these code
    blocks to improve linting compliance and clarity.

In @docs/deduplication/protocols.md:

  • Line 163: The code block at line 163 in the flow diagram section is missing a
    language specifier on the opening triple backticks. Add text as the language
    specifier to the code fence (change totext) to improve markdown linting
    compliance and properly indicate the code block language type.

In @docs/deduplication/running-the-script.md:

  • Line 31: The sample output code blocks at lines 31 and 97 in the markdown file
    are missing language specifiers. To fix this, locate both code blocks (the
    opening triple backticks that precede sample output sections) and add the text
    language specifier by changing the opening backticks from totext. This
    will improve markdown linting compliance and make the output format explicit to
    readers.

In @process/general-plans/active/deduplication_PLAN_18-06-26.md:

  • Line 203: Locate the three code blocks in this markdown document that
    currently lack language specifiers (marked with just without a language identifier) and add the appropriate language specifier to each: change to
    python for two of the code blocks and to ```text for one code block, as
    indicated in the review comment to improve linting compliance and enable proper
    syntax highlighting.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Repository UI

**Review profile**: CHILL

**Plan**: Pro Plus

**Run ID**: `871cdfff-e30a-4476-900a-50a24437b20f`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 00e70d0305b4c8b220eeaf62c48dfe0ed68ed9b7 and c428a2026ca30589506a3ab36d20940fe6e857dd.

</details>

<details>
<summary>📒 Files selected for processing (12)</summary>

* `apps/backend/events/scrapers/base.py`
* `apps/backend/events/tests.py`
* `apps/backend/package.json`
* `apps/backend/scripts/dedup.py`
* `apps/backend/scripts/deduplicate.py`
* `docs/README.md`
* `docs/deduplication/README.md`
* `docs/deduplication/api-reference.md`
* `docs/deduplication/overview.md`
* `docs/deduplication/protocols.md`
* `docs/deduplication/running-the-script.md`
* `process/general-plans/active/deduplication_PLAN_18-06-26.md`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines +165 to +188
def _dedup_normalize_url(url: str | None) -> str:
"""Scheme-less, UTM-stripped, query-sorted, slash-trimmed URL key.

Unlike ``_normalize_url`` (which preserves the scheme for organizer
resolution), the dedup path drops the scheme so ``http://`` and ``https://``
collapse together — matching scripts/dedup.normalize_url.
"""
from urllib.parse import parse_qsl, urlencode, urlparse, urlunparse

if not url:
return ""
parsed = urlparse(str(url).strip().lower())
netloc, path = parsed.netloc, parsed.path
if not netloc and path:
netloc, _, rest = path.partition("/")
path = "/" + rest if rest else ""
path = path.rstrip("/")
pairs = [
(k, v) for k, v in parse_qsl(parsed.query, keep_blank_values=True)
if not k.startswith("utm_")
]
pairs.sort()
return urlunparse(("", netloc, path, "", urlencode(pairs), ""))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

URL fragment dropped—inconsistent with standalone dedup and may cause incorrect merges.

The inline _dedup_normalize_url drops URL fragments (line 187 returns empty fragment), but scripts/dedup.py:normalize_url preserves them. Per the comment in dedup.py lines 68-69, myruntime uses #/event-slug to differentiate events on the same base URL.

With the current implementation, two distinct myruntime events like:

  • http://site.com/events#/gala-night
  • http://site.com/events#/concert-series

Would both normalize to //site.com/events and be incorrectly merged by the inline dedup.

🐛 Proposed fix: preserve fragment to match standalone behavior
-    return urlunparse(("", netloc, path, "", urlencode(pairs), ""))
+    return urlunparse(("", netloc, path, "", urlencode(pairs), parsed.fragment))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _dedup_normalize_url(url: str | None) -> str:
"""Scheme-less, UTM-stripped, query-sorted, slash-trimmed URL key.
Unlike ``_normalize_url`` (which preserves the scheme for organizer
resolution), the dedup path drops the scheme so ``http://`` and ``https://``
collapse togethermatching scripts/dedup.normalize_url.
"""
from urllib.parse import parse_qsl, urlencode, urlparse, urlunparse
if not url:
return ""
parsed = urlparse(str(url).strip().lower())
netloc, path = parsed.netloc, parsed.path
if not netloc and path:
netloc, _, rest = path.partition("/")
path = "/" + rest if rest else ""
path = path.rstrip("/")
pairs = [
(k, v) for k, v in parse_qsl(parsed.query, keep_blank_values=True)
if not k.startswith("utm_")
]
pairs.sort()
return urlunparse(("", netloc, path, "", urlencode(pairs), ""))
def _dedup_normalize_url(url: str | None) -> str:
"""Scheme-less, UTM-stripped, query-sorted, slash-trimmed URL key.
Unlike ``_normalize_url`` (which preserves the scheme for organizer
resolution), the dedup path drops the scheme so ``http://`` and ``https://``
collapse togethermatching scripts/dedup.normalize_url.
"""
from urllib.parse import parse_qsl, urlencode, urlparse, urlunparse
if not url:
return ""
parsed = urlparse(str(url).strip().lower())
netloc, path = parsed.netloc, parsed.path
if not netloc and path:
netloc, _, rest = path.partition("/")
path = "/" + rest if rest else ""
path = path.rstrip("/")
pairs = [
(k, v) for k, v in parse_qsl(parsed.query, keep_blank_values=True)
if not k.startswith("utm_")
]
pairs.sort()
return urlunparse(("", netloc, path, "", urlencode(pairs), parsed.fragment))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/scrapers/base.py` around lines 165 - 188, The
`_dedup_normalize_url` function is dropping URL fragments when reconstructing
the normalized URL in its return statement, but the standalone dedup script in
`scripts/dedup.py` preserves fragments. This causes inconsistency where distinct
myruntime events with different fragments (e.g., `#/gala-night` vs
`#/concert-series`) incorrectly merge into the same event. Extract the fragment
from the parsed URL object (available from the `urlparse` result) and include it
in the `urlunparse` call at the return statement instead of passing an empty
string for the fragment parameter, ensuring fragments are preserved to match the
standalone dedup behavior.

Comment thread apps/backend/package.json
"private": true,
"scripts": {
"dev": "venv/bin/python manage.py runserver",
"dev": "..\\..\\.venv\\Scripts\\python manage.py runserver",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Get repository structure overview
cd /tmp && find . -maxdepth 3 -type d -name ".venv" -o -type d -name "venv" 2>/dev/null | head -20

Repository: hdmGOAT/veent-event-scraper

Length of output: 53


🏁 Script executed:

# Check the full package.json file at apps/backend
cat -n apps/backend/package.json

Repository: hdmGOAT/veent-event-scraper

Length of output: 337


🏁 Script executed:

# Look for any venv or Python environment setup
git ls-files | grep -E "(venv|\.venv|requirements\.txt|pyproject\.toml|Pipfile)" | head -20

Repository: hdmGOAT/veent-event-scraper

Length of output: 101


🏁 Script executed:

# Check for any platform-specific handling in scripts or config
rg "cross-env|process\.platform|os\.(win32|unix)" --type json --type js --type ts

Repository: hdmGOAT/veent-event-scraper

Length of output: 53


🏁 Script executed:

# Check repository root structure to understand venv location relative to apps/backend
ls -la | head -20
ls -la apps/ 2>/dev/null | head -20

Repository: hdmGOAT/veent-event-scraper

Length of output: 1445


🏁 Script executed:

# Check if .venv exists at project root or elsewhere
ls -la | grep -i venv

Repository: hdmGOAT/veent-event-scraper

Length of output: 53


🏁 Script executed:

# Check README for platform requirements and setup instructions
cat README.md

Repository: hdmGOAT/veent-event-scraper

Length of output: 1934


🏁 Script executed:

# Check git history to see what the original dev script was
git log --oneline -10 apps/backend/package.json
git show HEAD:apps/backend/package.json 2>/dev/null || git show HEAD~1:apps/backend/package.json 2>/dev/null

Repository: hdmGOAT/veent-event-scraper

Length of output: 301


Dev script uses Windows-only path syntax; breaks cross-platform development.

The dev script at line 5 uses Windows-specific backslashes and Scripts/ subdirectory (..\\..\\.venv\\Scripts\\python manage.py runserver), which will fail on Unix/macOS systems. This conflicts with the README setup instructions, which use Unix conventions (source venv/bin/activate with forward slashes).

Fix by:

  1. Using forward slashes that work on all platforms: .venv/bin/python manage.py runserver or ../../.venv/bin/python manage.py runserver
  2. Or adopting a cross-platform approach with cross-env or platform-specific scripts (dev:win, dev:unix)

Ensure the venv path matches the actual project structure and documented setup process.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/package.json` at line 5, The `dev` script in package.json uses
Windows-specific backslash path syntax and the Windows-specific `Scripts/`
directory structure, which prevents Unix/macOS developers from running the
development server. Replace the Windows path `..\\..\\.venv\\Scripts\\python
manage.py runserver` with a cross-platform compatible path using forward slashes
(e.g., `../../.venv/bin/python manage.py runserver` or `.venv/bin/python
manage.py runserver`), or alternatively use a package like `cross-env` or create
separate platform-specific script entries (`dev:win` and `dev:unix`) to handle
both Windows and Unix environments. Ensure the updated path matches the actual
project structure and venv location described in the README setup instructions.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
apps/frontend/src/lib/types.ts (1)

152-155: ⚡ Quick win

Narrow DedupResult.entity to known literals.

Line 154 should mirror backend constraints to keep client-side state type-safe.

Suggested type refinement
 export interface DedupResult {
 	output: string;
-	entity: string;
+	entity: 'events' | 'venues' | 'organizers' | 'all';
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/frontend/src/lib/types.ts` around lines 152 - 155, The `entity` property
in the `DedupResult` interface is currently typed as a generic string, but it
should be narrowed to a union of specific literal string values to match backend
constraints and ensure type safety. Update the `entity` field in the
`DedupResult` interface to use literal union types (like "entityType1" |
"entityType2") instead of the broad string type, making sure the allowed values
correspond to what the backend supports.
apps/backend/events/views.py (1)

629-635: Move dedup execution off the request thread.

Running a potentially long subprocess inline (up to 120s) ties up web workers. Queueing this as a background job (and returning a job id) will improve resilience under concurrent admin actions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/views.py` around lines 629 - 635, The subprocess.run call
that executes deduplicate.py is running synchronously on the request thread and
blocking for up to 120 seconds, which can starve web workers. Instead of
executing the subprocess inline, queue this operation as a background job (using
your application's job queue system like Celery) and return a job identifier to
the client immediately. Move the subprocess.run call with its parameters (the
scripts_dir path, deduplicate.py script, entity argument, and 120-second
timeout) into a separate background task, and have the view method return the
job id or status response without waiting for completion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/backend/events/views.py`:
- Around line 615-646: The api_dedup_trigger endpoint is missing authentication
protection and exposes internal error details. First, add an authentication
check at the beginning of the function that validates the X-Scraper-Key header
against _WEBHOOK_SECRET, similar to how scraper_webhook and
ingest_events_webhook protect their endpoints. Second, replace the two error
returns that currently expose internal details (line 640 returning stderr and
line 645 returning str(exc)) with generic error messages like "Dedup operation
failed" or "An error occurred". For both cases, capture the actual error details
and log them server-side using proper logging instead of exposing them in the
JsonResponse.
- Around line 629-636: The subprocess.run call for deduplicate.py in the
api_dedup_trigger view lacks synchronization, allowing concurrent requests to
execute overlapping dedup operations that can corrupt data integrity. Wrap the
entire subprocess.run block (including the call with timeout=120 and cwd
parameters) with a process-level lock mechanism using either a file-based lock
or a threading.Lock/multiprocessing.Lock to ensure only one dedup execution runs
at a time. Acquire the lock before spawning the subprocess and ensure it is
released after the process completes, handling both success and error cases
appropriately.

---

Nitpick comments:
In `@apps/backend/events/views.py`:
- Around line 629-635: The subprocess.run call that executes deduplicate.py is
running synchronously on the request thread and blocking for up to 120 seconds,
which can starve web workers. Instead of executing the subprocess inline, queue
this operation as a background job (using your application's job queue system
like Celery) and return a job identifier to the client immediately. Move the
subprocess.run call with its parameters (the scripts_dir path, deduplicate.py
script, entity argument, and 120-second timeout) into a separate background
task, and have the view method return the job id or status response without
waiting for completion.

In `@apps/frontend/src/lib/types.ts`:
- Around line 152-155: The `entity` property in the `DedupResult` interface is
currently typed as a generic string, but it should be narrowed to a union of
specific literal string values to match backend constraints and ensure type
safety. Update the `entity` field in the `DedupResult` interface to use literal
union types (like "entityType1" | "entityType2") instead of the broad string
type, making sure the allowed values correspond to what the backend supports.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 5a2ec474-d055-4f17-97d1-4f3c8d967ef6

📥 Commits

Reviewing files that changed from the base of the PR and between c428a20 and 1c27a10.

📒 Files selected for processing (5)
  • apps/backend/events/urls.py
  • apps/backend/events/views.py
  • apps/frontend/src/lib/api.ts
  • apps/frontend/src/lib/types.ts
  • apps/frontend/src/routes/scrapers/+page.svelte

Comment thread apps/backend/events/views.py
Comment thread apps/backend/events/views.py
JESREAL1JDL7LUSTRE and others added 2 commits June 18, 2026 10:28
…row DedupResult entity type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
apps/backend/events/views.py (1)

670-695: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Apply consistent auth and error handling to api_script_trigger.

This endpoint lacks the auth check and error sanitization that were just added to api_dedup_trigger. Line 695 returns str(exc) which can leak internal paths or stack details. For consistency:

Suggested fix
 `@csrf_exempt`
 `@require_POST`
 def api_script_trigger(request, script_name: str):
+    key = request.headers.get("X-Scraper-Key", "")
+    if not _WEBHOOK_SECRET or key != _WEBHOOK_SECRET:
+        return JsonResponse({"error": "unauthorized"}, status=401)
+
     ...
     try:
         process = subprocess.Popen(...)
         return JsonResponse({"started": True, "script": script_name, "pid": process.pid})
-    except Exception as exc:  # noqa: BLE001
-        return JsonResponse({"error": str(exc)}, status=500)
+    except Exception:  # noqa: BLE001
+        logger.exception("script trigger error (script=%s)", script_name)
+        return JsonResponse({"error": "Failed to start script"}, status=500)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/views.py` around lines 670 - 695, The api_script_trigger
function is missing the authentication check and error sanitization that were
applied to api_dedup_trigger. Add the same auth check decorator or guard that
protects api_dedup_trigger at the start of api_script_trigger, and replace the
raw str(exc) error response in the except block with a sanitized error message
that does not leak internal paths or stack details, consistent with the error
handling approach used in api_dedup_trigger.
🧹 Nitpick comments (1)
apps/backend/events/views.py (1)

637-659: ⚖️ Poor tradeoff

Single-flight lock is process-local; consider multi-worker implications.

The threading.Lock correctly prevents concurrent dedup within a single process. However, production deployments with multiple workers (gunicorn, uWSGI) each have their own lock instance, so concurrent requests to different workers could still overlap.

If overlapping dedup runs are dangerous (they likely are given the merge/delete operations), consider:

  • A database-level advisory lock (SELECT pg_advisory_lock(...))
  • A file-based lock (fcntl.flock)
  • A Redis-based distributed lock

That said, the 120s timeout + typical request patterns may make collisions rare in practice.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/views.py` around lines 637 - 659, The `_DEDUP_LOCK` used
in the dedup trigger view is a process-local threading.Lock, which only prevents
concurrent execution within a single worker process. In multi-worker deployments
(gunicorn, uWSGI), each worker has its own lock instance, allowing concurrent
dedup runs across different workers which could cause data corruption. Replace
the threading.Lock mechanism with a distributed locking solution such as a
database-level advisory lock using PostgreSQL's pg_advisory_lock, a file-based
lock using fcntl.flock, or a Redis-based distributed lock, depending on your
infrastructure constraints and preferences.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/backend/events/views.py`:
- Around line 626-628: The authentication check at the beginning of this view
requires the X-Scraper-Key header which the frontend's generic post() helper
cannot provide. To fix this, you need to implement a different authentication
mechanism for frontend calls while preserving the webhook authentication for
n8n. Create a separate endpoint (or modify this one to handle both cases) that
uses Django's `@staff_member_required` decorator for the frontend UI calls, while
keeping the X-Scraper-Key header validation for n8n webhook calls. You can
determine which authentication path to use by checking the request origin (e.g.,
if the header is present use webhook auth, otherwise use session/staff auth).
This allows the frontend to authenticate via session while keeping webhook calls
secure with the secret header.

---

Outside diff comments:
In `@apps/backend/events/views.py`:
- Around line 670-695: The api_script_trigger function is missing the
authentication check and error sanitization that were applied to
api_dedup_trigger. Add the same auth check decorator or guard that protects
api_dedup_trigger at the start of api_script_trigger, and replace the raw
str(exc) error response in the except block with a sanitized error message that
does not leak internal paths or stack details, consistent with the error
handling approach used in api_dedup_trigger.

---

Nitpick comments:
In `@apps/backend/events/views.py`:
- Around line 637-659: The `_DEDUP_LOCK` used in the dedup trigger view is a
process-local threading.Lock, which only prevents concurrent execution within a
single worker process. In multi-worker deployments (gunicorn, uWSGI), each
worker has its own lock instance, allowing concurrent dedup runs across
different workers which could cause data corruption. Replace the threading.Lock
mechanism with a distributed locking solution such as a database-level advisory
lock using PostgreSQL's pg_advisory_lock, a file-based lock using fcntl.flock,
or a Redis-based distributed lock, depending on your infrastructure constraints
and preferences.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: edb1a7e5-ad42-44f6-8c2f-c1b5208fd78a

📥 Commits

Reviewing files that changed from the base of the PR and between 73fca24 and aeded92.

📒 Files selected for processing (2)
  • apps/backend/events/views.py
  • apps/frontend/src/lib/types.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • apps/frontend/src/lib/types.ts

Comment thread apps/backend/events/views.py Outdated
@hdmGOAT hdmGOAT merged commit 0031782 into development Jun 18, 2026
3 checks passed
This was referenced Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants