Feat deduplication#20
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughIntroduces a two-layer cross-source deduplication system for Event, Venue, and Organizer records. A new ChangesCross-source deduplication system
Sequence Diagram(s)sequenceDiagram
rect rgba(100, 149, 237, 0.5)
Note over BaseScraper,Event: Layer 1 — Inline post-save dedup
end
participant BaseScraper
participant _dedup_after_save
participant _dedup_events_by_url
participant Event
BaseScraper->>_dedup_after_save: save_events(...) → _dedup_after_save("events", event_ids)
_dedup_after_save->>_dedup_events_by_url: event_ids
_dedup_events_by_url->>Event: Query by normalized URL
Event-->>_dedup_events_by_url: duplicate groups [[winner_id, loser_id, ...], ...]
_dedup_events_by_url->>_dedup_events_by_url: _apply_groups(Event, groups, protected)
_dedup_events_by_url->>Event: _fill_missing(winner, losers, protected)
_dedup_events_by_url->>Event: winner.save()
_dedup_events_by_url->>Event: loser_rows.delete()
_dedup_after_save-->>BaseScraper: (errors swallowed as warnings)
sequenceDiagram
rect rgba(60, 179, 113, 0.5)
Note over main,Postgres: Layer 2 — On-demand standalone script
end
participant main as main()
participant run_entity
participant dedup as dedup.py
participant Postgres
main->>run_entity: entity="venues", dry_run=false, verbose=true
run_entity->>dedup: find_venue_duplicates(cursor)
dedup->>Postgres: SELECT all venues
Postgres-->>dedup: all rows with normalized keys
dedup-->>run_entity: [[winner_id, loser_id, ...], ...]
loop per duplicate group
run_entity->>dedup: merge_venues(cursor, winner_id, loser_ids)
dedup->>Postgres: UPDATE events SET venue_id=winner WHERE venue_id IN (losers)
dedup->>Postgres: DELETE FROM venues WHERE id IN (losers)
Postgres-->>dedup: rows affected
alt on error
run_entity->>Postgres: ROLLBACK
else success
run_entity->>Postgres: COMMIT
end
end
run_entity-->>main: {groups, merged, deleted, failures}
main->>main: _print_summary(summaries)
sequenceDiagram
rect rgba(218, 112, 214, 0.5)
Note over Frontend,Postgres: API trigger for on-demand dedup
end
participant Frontend
participant api_dedup_trigger as api_dedup_trigger()
participant subprocess as deduplicate.py
participant Postgres
Frontend->>api_dedup_trigger: POST /scrapers/dedup/, entity="venues"
api_dedup_trigger->>api_dedup_trigger: validate entity in allowlist
api_dedup_trigger->>subprocess: subprocess.run(..., timeout=120s)
subprocess->>Postgres: find_venue_duplicates(cursor)
subprocess->>Postgres: merge_venues(...) per group, commit/rollback
subprocess-->>api_dedup_trigger: stdout/stderr, exit code
alt exit code = 0
api_dedup_trigger-->>Frontend: {output: "...", entity: "venues"}
else error
api_dedup_trigger-->>Frontend: {error: "...", status: 400/500/504}
end
Frontend->>Frontend: render output or error UI
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (12)
docs/deduplication/protocols.md (1)
163-163: ⚡ Quick winAdd language specifier to flow diagram code block.
Line 163 contains a flow/diagram block without language specification. Adding
textspecifier improves linting compliance.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/deduplication/protocols.md` at line 163, The code block at line 163 in the flow diagram section is missing a language specifier on the opening triple backticks. Add `text` as the language specifier to the code fence (change ``` to ```text) to improve markdown linting compliance and properly indicate the code block language type.Source: Linters/SAST tools
docs/deduplication/overview.md (1)
23-23: ⚡ Quick winConsider adding language specifiers to fenced code blocks.
Lines 23, 41, 105, and 156 contain code/text blocks without language specifications. While not required for these diagram/example blocks, adding
textorbashspecifiers improves linting compliance and clarity.Also applies to: 41-41, 105-105, 156-156
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/deduplication/overview.md` at line 23, The fenced code blocks at lines 23, 41, 105, and 156 in the deduplication overview markdown file are missing language specifiers after the opening triple backticks. Add an appropriate language identifier (such as `text` or `bash`) immediately after the opening triple backticks for each of these code blocks to improve linting compliance and clarity.Source: Linters/SAST tools
docs/deduplication/running-the-script.md (1)
31-31: ⚡ Quick winAdd language specifiers to output code blocks.
Lines 31 and 97 contain sample output blocks without language specifications. Adding
textspecifier improves linting compliance and clarity.Also applies to: 97-97
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/deduplication/running-the-script.md` at line 31, The sample output code blocks at lines 31 and 97 in the markdown file are missing language specifiers. To fix this, locate both code blocks (the opening triple backticks that precede sample output sections) and add the text language specifier by changing the opening backticks from ``` to ```text. This will improve markdown linting compliance and make the output format explicit to readers.Source: Linters/SAST tools
docs/deduplication/api-reference.md (1)
170-170: ⚡ Quick winAdd SQL language specifiers to fenced code blocks.
Lines 170, 188, and 210 contain SQL code blocks that are missing the
sqllanguage identifier. This improves readability and allows Markdown linters to validate syntax.Proposed fix
-**FK remapping:** - -```sql +**FK remapping:** + +```sql UPDATE events_event SET venue_id = winner_id WHERE venue_id = ANY(loser_ids)Apply the same change (add `sql` language spec) to lines 188 and 210. </details> Also applies to: 188-188, 210-210 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@docs/deduplication/api-reference.mdat line 170, The SQL code blocks in the
markdown file are missing the language identifier in their fenced code block
declarations. Add thesqllanguage specifier to the opening fence of the three
SQL code blocks located around lines 170, 188, and 210. Change each opening
fence from three backticks to three backticks followed bysql(e.g.,sql instead ofbefore the UPDATE statement blocks). This will improve
readability and allow Markdown linters to properly validate the SQL syntax.</details> <!-- cr-comment:v1:054182d580119f9836bcf0e3 --> _Source: Linters/SAST tools_ </blockquote></details> <details> <summary>process/general-plans/active/deduplication_PLAN_18-06-26.md (1)</summary><blockquote> `203-203`: _⚡ Quick win_ **Add language specifiers to code example blocks.** Lines 203, 212, and 240 contain code blocks without language specifications. Adding `python` (for lines 203, 240) and `text` (for line 212) improves linting compliance and syntax highlighting. Also applies to: 212-212, 240-240 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@process/general-plans/active/deduplication_PLAN_18-06-26.mdat line 203,
Locate the three code blocks in this markdown document that currently lack
language specifiers (marked with justwithout a language identifier) and add the appropriate language specifier to each: changetopython for two of the code blocks andto ```text for one code block, as indicated in the
review comment to improve linting compliance and enable proper syntax
highlighting.</details> <!-- cr-comment:v1:4bf9838976bbf9a920427712 --> _Source: Linters/SAST tools_ </blockquote></details> <details> <summary>apps/backend/events/scrapers/base.py (2)</summary><blockquote> `205-220`: _💤 Low value_ **Loop variable shadows `dataclasses.field` import.** The `field` loop variable on line 208 shadows the `field` import from `dataclasses` on line 10. While this won't cause runtime issues (the import is only used in dataclass definitions), it's a code smell that static analysis correctly flags. <details> <summary>♻️ Suggested rename</summary> ```diff - for field in winner._meta.concrete_fields: - name = field.name - if name in skip or field.is_relation: + for model_field in winner._meta.concrete_fields: + name = model_field.name + if name in skip or model_field.is_relation: continue ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/events/scrapers/base.py` around lines 205 - 220, The loop variable `field` in the `_fill_missing` function shadows the `field` import from the `dataclasses` module. Rename the loop variable from `field` to something else (such as `model_field`) in the for loop that iterates over `winner._meta.concrete_fields`, and update all references to this variable within the loop body where `field.name` and `field.is_relation` are accessed. ``` </details> <!-- cr-comment:v1:fb19f996979b66e5121a6067 --> _Source: Linters/SAST tools_ --- `321-338`: _⚡ Quick win_ **Consider wrapping merge operations in a transaction for atomicity.** `_apply_groups` performs multiple database operations per group (save winner, update FKs, delete losers). If a failure occurs mid-sequence, the database could be left in an inconsistent state—for example, winner enriched but losers still present as orphans. The standalone script uses per-group transactions with rollback on failure. While the inline dedup is best-effort, wrapping in `transaction.atomic()` would ensure all-or-nothing semantics per group. <details> <summary>♻️ Suggested transaction wrapping</summary> ```diff +from django.db import transaction + def _apply_groups(model, groups, protected, venue_fk=False, organizer_fk=False): """Merge each group's losers into the winner and hard-delete the losers.""" for group in groups: winner_id, loser_ids = group[0], group[1:] if not loser_ids: continue - winner = model.objects.get(pk=winner_id) - losers = list(model.objects.filter(pk__in=loser_ids)) - _fill_missing(winner, losers, protected) - winner.save() - if venue_fk: - Event.objects.filter(venue_id__in=loser_ids).update(venue_id=winner_id) - if organizer_fk: - Event.objects.filter(organizer_ref_id__in=loser_ids).update( - organizer_ref_id=winner_id - ) - model.objects.filter(pk__in=loser_ids).delete() + with transaction.atomic(): + winner = model.objects.get(pk=winner_id) + losers = list(model.objects.filter(pk__in=loser_ids)) + _fill_missing(winner, losers, protected) + winner.save() + if venue_fk: + Event.objects.filter(venue_id__in=loser_ids).update(venue_id=winner_id) + if organizer_fk: + Event.objects.filter(organizer_ref_id__in=loser_ids).update( + organizer_ref_id=winner_id + ) + model.objects.filter(pk__in=loser_ids).delete() ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/events/scrapers/base.py` around lines 321 - 338, The _apply_groups function performs multiple database operations per group (winner.save(), Event.objects.filter().update() calls for venue and organizer, and model.objects.filter().delete()) without transactional protection, which could leave the database in an inconsistent state if any operation fails mid-sequence. Wrap all the database operations for each group iteration (starting from winner.save() through the final model.objects.filter().delete() call) inside a transaction.atomic() context manager to ensure all-or-nothing semantics per group. Import transaction from django.db if not already imported. ``` </details> <!-- cr-comment:v1:937691563c864e39966500c5 --> </blockquote></details> <details> <summary>apps/backend/scripts/dedup.py (1)</summary><blockquote> `141-155`: _💤 Low value_ **Static analysis SQL injection warnings are false positives here.** The f-string SQL construction is flagged by static analysis, but in this context: - `table` is always a hardcoded internal table name (`"events_event"`, `"events_venue"`, `"events_organizer"`) passed by callers - Column names in `set_clause` derive from database schema via `cursor.fetchone()`, not user input The pattern is safe for this internal-only module, though adding a comment noting the constraint would help future maintainers. <details> <summary>♻️ Optional: Minor style improvement for list concatenation (line 370)</summary> ```diff - params = list(updates.values()) + [winner_id] + params = [*updates.values(), winner_id] ``` </details> Also applies to: 346-373 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/scripts/dedup.py` around lines 141 - 155, Add a clarifying comment in the _select_winner function before the SQL query execution to document that the f-string construction is intentional and safe because the table parameter is restricted to hardcoded internal table names (not user input), and the values passed to the query are properly parameterized. This will help suppress static analysis warnings and inform future maintainers of the design constraint. ``` </details> <!-- cr-comment:v1:68f7ccc651b0b90165dba869 --> _Source: Linters/SAST tools_ </blockquote></details> <details> <summary>apps/backend/events/tests.py (2)</summary><blockquote> `1240-1242`: _💤 Low value_ **Consider using a direct import instead of `__import__`.** The inline `__import__("datetime")` works but is unconventional. A direct import would be clearer. <details> <summary>♻️ Suggested change</summary> ```diff +from datetime import timezone as dt_tz + class NormalizationTests(TestCase): # ... def test_normalize_date_datetime(self): dt = timezone.now() - self.assertEqual(dedup_utils.normalize_date(dt), dt.astimezone(__import__("datetime").timezone.utc).date()) + self.assertEqual(dedup_utils.normalize_date(dt), dt.astimezone(dt_tz.utc).date()) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/events/tests.py` around lines 1240 - 1242, In the test_normalize_date_datetime method, replace the inline __import__("datetime") call with a direct import of the datetime module at the top of the test file. Change the assertion to use the directly imported datetime.timezone.utc instead of accessing it through __import__. This will improve code readability and follow Python conventions. ``` </details> <!-- cr-comment:v1:5def510af346acdfce0a541b --> --- `1482-1486`: _💤 Low value_ **Missing blank lines before the next class.** PEP 8 recommends two blank lines between top-level class definitions. The new `DedupCommandTests` class ends without blank lines before `OrganizerExportTests`. <details> <summary>♻️ Add blank lines after test class</summary> ```diff def test_dedup_after_save_never_raises(self): from events.scrapers.base import _dedup_after_save # Unknown entity is silently ignored (no dispatch), never raises. _dedup_after_save("nonexistent", [1, 2, 3]) + + class OrganizerExportTests(TestCase): ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/events/tests.py` around lines 1482 - 1486, The test method test_dedup_after_save_never_raises is part of the DedupCommandTests class which is missing required blank lines before the next class definition. Add two blank lines after the DedupCommandTests class definition ends (after its last method) and before the OrganizerExportTests class begins to comply with PEP 8 style guidelines for top-level class definitions. ``` </details> <!-- cr-comment:v1:d5579c0425a158a6dccb57f2 --> </blockquote></details> <details> <summary>apps/backend/scripts/deduplicate.py (2)</summary><blockquote> `125-147`: _💤 Low value_ **`autocommit` is set to `False` but never restored.** After processing, `conn.autocommit` remains `False`. While this doesn't cause issues in the current code path (connection is closed after all entities are processed), it could lead to unexpected behavior if `run_entity` is used in other contexts or if the connection is reused. Consider resetting `autocommit` in a `finally` block or documenting that the caller is responsible for connection state. <details> <summary>♻️ Optional: Reset autocommit after processing</summary> ```diff # Each group is committed independently so one failure cannot corrupt or # roll back the others. conn.autocommit = False - for group in groups: - winner, losers = group[0], group[1:] - if not losers: - continue - try: - with conn.cursor() as cur: - merge_fn(cur, winner, losers) - conn.commit() - summary["merged"] += 1 - summary["deleted"] += len(losers) - if verbose: - print(f" [{label}] merged losers {losers} -> winner {winner}") - except Exception as exc: # noqa: BLE001 — isolate per-group failures - conn.rollback() - summary["errors"] += 1 - print( - f" [{label}] ERROR merging {group}: {exc}", file=sys.stderr - ) - - return summary + try: + for group in groups: + winner, losers = group[0], group[1:] + if not losers: + continue + try: + with conn.cursor() as cur: + merge_fn(cur, winner, losers) + conn.commit() + summary["merged"] += 1 + summary["deleted"] += len(losers) + if verbose: + print(f" [{label}] merged losers {losers} -> winner {winner}") + except Exception as exc: # noqa: BLE001 — isolate per-group failures + conn.rollback() + summary["errors"] += 1 + print( + f" [{label}] ERROR merging {group}: {exc}", file=sys.stderr + ) + finally: + conn.autocommit = True + + return summary ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/scripts/deduplicate.py` around lines 125 - 147, The autocommit property of the database connection is set to False at the beginning of the loop but is never restored to its original state. Save the original autocommit value before the for loop that iterates through groups, then restore it after processing is complete either by adding a finally block after the try/except that resets conn.autocommit to the saved original value, or by restoring it at the end of the function before returning the summary. This ensures that if the connection is reused elsewhere, it maintains the expected autocommit state. ``` </details> <!-- cr-comment:v1:4da3e599a92fd8d97b687358 --> --- `50-63`: _💤 Low value_ **Custom .env parser has edge cases with quoted values containing equals signs.** The `line.partition("=")` approach splits on the first `=`, which is correct. However, the double `.strip('"').strip("'")` only removes quotes at the very edges after stripping whitespace — a value like `"foo=bar"` would become `foo=bar` correctly, but `'foo\'s value'` or escaped quotes won't be handled. This is acceptable for typical DATABASE_URL values, but consider using `python-dotenv` if more complex .env files are expected in the future. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/scripts/deduplicate.py` around lines 50 - 63, The custom quote-stripping logic in the `_load_dotenv` function using `line.partition("=")` and chained `.strip('"').strip("'")` calls handles simple cases like DATABASE_URL values but doesn't properly handle escaped quotes or complex quote scenarios. If the application needs to support more complex .env files in the future, replace the custom parsing logic in the `_load_dotenv` function with the `python-dotenv` library instead, which provides robust handling of edge cases like escaped quotes, quoted values with special characters, and other .env file complexities. ``` </details> <!-- cr-comment:v1:8da06ff3206aca6d0a74f243 --> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.Inline comments:
In@apps/backend/events/scrapers/base.py:
- Around line 165-188: The
_dedup_normalize_urlfunction is dropping URL
fragments when reconstructing the normalized URL in its return statement, but
the standalone dedup script inscripts/dedup.pypreserves fragments. This
causes inconsistency where distinct myruntime events with different fragments
(e.g.,#/gala-nightvs#/concert-series) incorrectly merge into the same
event. Extract the fragment from the parsed URL object (available from the
urlparseresult) and include it in theurlunparsecall at the return
statement instead of passing an empty string for the fragment parameter,
ensuring fragments are preserved to match the standalone dedup behavior.In
@apps/backend/package.json:
- Line 5: The
devscript in package.json uses Windows-specific backslash path
syntax and the Windows-specificScripts/directory structure, which prevents
Unix/macOS developers from running the development server. Replace the Windows
path..\\..\\.venv\\Scripts\\python manage.py runserverwith a cross-platform
compatible path using forward slashes (e.g.,../../.venv/bin/python manage.py runserveror.venv/bin/python manage.py runserver), or alternatively use a
package likecross-envor create separate platform-specific script entries
(dev:winanddev:unix) to handle both Windows and Unix environments. Ensure
the updated path matches the actual project structure and venv location
described in the README setup instructions.
Nitpick comments:
In@apps/backend/events/scrapers/base.py:
- Around line 205-220: The loop variable
fieldin the_fill_missingfunction
shadows thefieldimport from thedataclassesmodule. Rename the loop
variable fromfieldto something else (such asmodel_field) in the for loop
that iterates overwinner._meta.concrete_fields, and update all references to
this variable within the loop body wherefield.nameandfield.is_relation
are accessed.- Around line 321-338: The _apply_groups function performs multiple database
operations per group (winner.save(), Event.objects.filter().update() calls for
venue and organizer, and model.objects.filter().delete()) without transactional
protection, which could leave the database in an inconsistent state if any
operation fails mid-sequence. Wrap all the database operations for each group
iteration (starting from winner.save() through the final
model.objects.filter().delete() call) inside a transaction.atomic() context
manager to ensure all-or-nothing semantics per group. Import transaction from
django.db if not already imported.In
@apps/backend/events/tests.py:
- Around line 1240-1242: In the test_normalize_date_datetime method, replace the
inline import("datetime") call with a direct import of the datetime module
at the top of the test file. Change the assertion to use the directly imported
datetime.timezone.utc instead of accessing it through import. This will
improve code readability and follow Python conventions.- Around line 1482-1486: The test method test_dedup_after_save_never_raises is
part of the DedupCommandTests class which is missing required blank lines before
the next class definition. Add two blank lines after the DedupCommandTests class
definition ends (after its last method) and before the OrganizerExportTests
class begins to comply with PEP 8 style guidelines for top-level class
definitions.In
@apps/backend/scripts/dedup.py:
- Around line 141-155: Add a clarifying comment in the _select_winner function
before the SQL query execution to document that the f-string construction is
intentional and safe because the table parameter is restricted to hardcoded
internal table names (not user input), and the values passed to the query are
properly parameterized. This will help suppress static analysis warnings and
inform future maintainers of the design constraint.In
@apps/backend/scripts/deduplicate.py:
- Around line 125-147: The autocommit property of the database connection is set
to False at the beginning of the loop but is never restored to its original
state. Save the original autocommit value before the for loop that iterates
through groups, then restore it after processing is complete either by adding a
finally block after the try/except that resets conn.autocommit to the saved
original value, or by restoring it at the end of the function before returning
the summary. This ensures that if the connection is reused elsewhere, it
maintains the expected autocommit state.- Around line 50-63: The custom quote-stripping logic in the
_load_dotenv
function usingline.partition("=")and chained.strip('"').strip("'")calls
handles simple cases like DATABASE_URL values but doesn't properly handle
escaped quotes or complex quote scenarios. If the application needs to support
more complex .env files in the future, replace the custom parsing logic in the
_load_dotenvfunction with thepython-dotenvlibrary instead, which provides
robust handling of edge cases like escaped quotes, quoted values with special
characters, and other .env file complexities.In
@docs/deduplication/api-reference.md:
- Line 170: The SQL code blocks in the markdown file are missing the language
identifier in their fenced code block declarations. Add thesqllanguage
specifier to the opening fence of the three SQL code blocks located around lines
170, 188, and 210. Change each opening fence from three backticks to three
backticks followed bysql(e.g.,sql instead ofbefore the UPDATE
statement blocks). This will improve readability and allow Markdown linters to
properly validate the SQL syntax.In
@docs/deduplication/overview.md:
- Line 23: The fenced code blocks at lines 23, 41, 105, and 156 in the
deduplication overview markdown file are missing language specifiers after the
opening triple backticks. Add an appropriate language identifier (such astext
orbash) immediately after the opening triple backticks for each of these code
blocks to improve linting compliance and clarity.In
@docs/deduplication/protocols.md:
- Line 163: The code block at line 163 in the flow diagram section is missing a
language specifier on the opening triple backticks. Addtextas the language
specifier to the code fence (changetotext) to improve markdown linting
compliance and properly indicate the code block language type.In
@docs/deduplication/running-the-script.md:
- Line 31: The sample output code blocks at lines 31 and 97 in the markdown file
are missing language specifiers. To fix this, locate both code blocks (the
opening triple backticks that precede sample output sections) and add the text
language specifier by changing the opening backticks fromtotext. This
will improve markdown linting compliance and make the output format explicit to
readers.In
@process/general-plans/active/deduplication_PLAN_18-06-26.md:
- Line 203: Locate the three code blocks in this markdown document that
currently lack language specifiers (marked with justwithout a language identifier) and add the appropriate language specifier to each: changeto
python for two of the code blocks andto ```text for one code block, as
indicated in the review comment to improve linting compliance and enable proper
syntax highlighting.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Repository UI **Review profile**: CHILL **Plan**: Pro Plus **Run ID**: `871cdfff-e30a-4476-900a-50a24437b20f` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 00e70d0305b4c8b220eeaf62c48dfe0ed68ed9b7 and c428a2026ca30589506a3ab36d20940fe6e857dd. </details> <details> <summary>📒 Files selected for processing (12)</summary> * `apps/backend/events/scrapers/base.py` * `apps/backend/events/tests.py` * `apps/backend/package.json` * `apps/backend/scripts/dedup.py` * `apps/backend/scripts/deduplicate.py` * `docs/README.md` * `docs/deduplication/README.md` * `docs/deduplication/api-reference.md` * `docs/deduplication/overview.md` * `docs/deduplication/protocols.md` * `docs/deduplication/running-the-script.md` * `process/general-plans/active/deduplication_PLAN_18-06-26.md` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| def _dedup_normalize_url(url: str | None) -> str: | ||
| """Scheme-less, UTM-stripped, query-sorted, slash-trimmed URL key. | ||
|
|
||
| Unlike ``_normalize_url`` (which preserves the scheme for organizer | ||
| resolution), the dedup path drops the scheme so ``http://`` and ``https://`` | ||
| collapse together — matching scripts/dedup.normalize_url. | ||
| """ | ||
| from urllib.parse import parse_qsl, urlencode, urlparse, urlunparse | ||
|
|
||
| if not url: | ||
| return "" | ||
| parsed = urlparse(str(url).strip().lower()) | ||
| netloc, path = parsed.netloc, parsed.path | ||
| if not netloc and path: | ||
| netloc, _, rest = path.partition("/") | ||
| path = "/" + rest if rest else "" | ||
| path = path.rstrip("/") | ||
| pairs = [ | ||
| (k, v) for k, v in parse_qsl(parsed.query, keep_blank_values=True) | ||
| if not k.startswith("utm_") | ||
| ] | ||
| pairs.sort() | ||
| return urlunparse(("", netloc, path, "", urlencode(pairs), "")) | ||
|
|
There was a problem hiding this comment.
URL fragment dropped—inconsistent with standalone dedup and may cause incorrect merges.
The inline _dedup_normalize_url drops URL fragments (line 187 returns empty fragment), but scripts/dedup.py:normalize_url preserves them. Per the comment in dedup.py lines 68-69, myruntime uses #/event-slug to differentiate events on the same base URL.
With the current implementation, two distinct myruntime events like:
http://site.com/events#/gala-nighthttp://site.com/events#/concert-series
Would both normalize to //site.com/events and be incorrectly merged by the inline dedup.
🐛 Proposed fix: preserve fragment to match standalone behavior
- return urlunparse(("", netloc, path, "", urlencode(pairs), ""))
+ return urlunparse(("", netloc, path, "", urlencode(pairs), parsed.fragment))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _dedup_normalize_url(url: str | None) -> str: | |
| """Scheme-less, UTM-stripped, query-sorted, slash-trimmed URL key. | |
| Unlike ``_normalize_url`` (which preserves the scheme for organizer | |
| resolution), the dedup path drops the scheme so ``http://`` and ``https://`` | |
| collapse together — matching scripts/dedup.normalize_url. | |
| """ | |
| from urllib.parse import parse_qsl, urlencode, urlparse, urlunparse | |
| if not url: | |
| return "" | |
| parsed = urlparse(str(url).strip().lower()) | |
| netloc, path = parsed.netloc, parsed.path | |
| if not netloc and path: | |
| netloc, _, rest = path.partition("/") | |
| path = "/" + rest if rest else "" | |
| path = path.rstrip("/") | |
| pairs = [ | |
| (k, v) for k, v in parse_qsl(parsed.query, keep_blank_values=True) | |
| if not k.startswith("utm_") | |
| ] | |
| pairs.sort() | |
| return urlunparse(("", netloc, path, "", urlencode(pairs), "")) | |
| def _dedup_normalize_url(url: str | None) -> str: | |
| """Scheme-less, UTM-stripped, query-sorted, slash-trimmed URL key. | |
| Unlike ``_normalize_url`` (which preserves the scheme for organizer | |
| resolution), the dedup path drops the scheme so ``http://`` and ``https://`` | |
| collapse together — matching scripts/dedup.normalize_url. | |
| """ | |
| from urllib.parse import parse_qsl, urlencode, urlparse, urlunparse | |
| if not url: | |
| return "" | |
| parsed = urlparse(str(url).strip().lower()) | |
| netloc, path = parsed.netloc, parsed.path | |
| if not netloc and path: | |
| netloc, _, rest = path.partition("/") | |
| path = "/" + rest if rest else "" | |
| path = path.rstrip("/") | |
| pairs = [ | |
| (k, v) for k, v in parse_qsl(parsed.query, keep_blank_values=True) | |
| if not k.startswith("utm_") | |
| ] | |
| pairs.sort() | |
| return urlunparse(("", netloc, path, "", urlencode(pairs), parsed.fragment)) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/backend/events/scrapers/base.py` around lines 165 - 188, The
`_dedup_normalize_url` function is dropping URL fragments when reconstructing
the normalized URL in its return statement, but the standalone dedup script in
`scripts/dedup.py` preserves fragments. This causes inconsistency where distinct
myruntime events with different fragments (e.g., `#/gala-night` vs
`#/concert-series`) incorrectly merge into the same event. Extract the fragment
from the parsed URL object (available from the `urlparse` result) and include it
in the `urlunparse` call at the return statement instead of passing an empty
string for the fragment parameter, ensuring fragments are preserved to match the
standalone dedup behavior.
| "private": true, | ||
| "scripts": { | ||
| "dev": "venv/bin/python manage.py runserver", | ||
| "dev": "..\\..\\.venv\\Scripts\\python manage.py runserver", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Get repository structure overview
cd /tmp && find . -maxdepth 3 -type d -name ".venv" -o -type d -name "venv" 2>/dev/null | head -20Repository: hdmGOAT/veent-event-scraper
Length of output: 53
🏁 Script executed:
# Check the full package.json file at apps/backend
cat -n apps/backend/package.jsonRepository: hdmGOAT/veent-event-scraper
Length of output: 337
🏁 Script executed:
# Look for any venv or Python environment setup
git ls-files | grep -E "(venv|\.venv|requirements\.txt|pyproject\.toml|Pipfile)" | head -20Repository: hdmGOAT/veent-event-scraper
Length of output: 101
🏁 Script executed:
# Check for any platform-specific handling in scripts or config
rg "cross-env|process\.platform|os\.(win32|unix)" --type json --type js --type tsRepository: hdmGOAT/veent-event-scraper
Length of output: 53
🏁 Script executed:
# Check repository root structure to understand venv location relative to apps/backend
ls -la | head -20
ls -la apps/ 2>/dev/null | head -20Repository: hdmGOAT/veent-event-scraper
Length of output: 1445
🏁 Script executed:
# Check if .venv exists at project root or elsewhere
ls -la | grep -i venvRepository: hdmGOAT/veent-event-scraper
Length of output: 53
🏁 Script executed:
# Check README for platform requirements and setup instructions
cat README.mdRepository: hdmGOAT/veent-event-scraper
Length of output: 1934
🏁 Script executed:
# Check git history to see what the original dev script was
git log --oneline -10 apps/backend/package.json
git show HEAD:apps/backend/package.json 2>/dev/null || git show HEAD~1:apps/backend/package.json 2>/dev/nullRepository: hdmGOAT/veent-event-scraper
Length of output: 301
Dev script uses Windows-only path syntax; breaks cross-platform development.
The dev script at line 5 uses Windows-specific backslashes and Scripts/ subdirectory (..\\..\\.venv\\Scripts\\python manage.py runserver), which will fail on Unix/macOS systems. This conflicts with the README setup instructions, which use Unix conventions (source venv/bin/activate with forward slashes).
Fix by:
- Using forward slashes that work on all platforms:
.venv/bin/python manage.py runserveror../../.venv/bin/python manage.py runserver - Or adopting a cross-platform approach with
cross-envor platform-specific scripts (dev:win,dev:unix)
Ensure the venv path matches the actual project structure and documented setup process.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/backend/package.json` at line 5, The `dev` script in package.json uses
Windows-specific backslash path syntax and the Windows-specific `Scripts/`
directory structure, which prevents Unix/macOS developers from running the
development server. Replace the Windows path `..\\..\\.venv\\Scripts\\python
manage.py runserver` with a cross-platform compatible path using forward slashes
(e.g., `../../.venv/bin/python manage.py runserver` or `.venv/bin/python
manage.py runserver`), or alternatively use a package like `cross-env` or create
separate platform-specific script entries (`dev:win` and `dev:unix`) to handle
both Windows and Unix environments. Ensure the updated path matches the actual
project structure and venv location described in the README setup instructions.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
apps/frontend/src/lib/types.ts (1)
152-155: ⚡ Quick winNarrow
DedupResult.entityto known literals.Line 154 should mirror backend constraints to keep client-side state type-safe.
Suggested type refinement
export interface DedupResult { output: string; - entity: string; + entity: 'events' | 'venues' | 'organizers' | 'all'; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/frontend/src/lib/types.ts` around lines 152 - 155, The `entity` property in the `DedupResult` interface is currently typed as a generic string, but it should be narrowed to a union of specific literal string values to match backend constraints and ensure type safety. Update the `entity` field in the `DedupResult` interface to use literal union types (like "entityType1" | "entityType2") instead of the broad string type, making sure the allowed values correspond to what the backend supports.apps/backend/events/views.py (1)
629-635: Move dedup execution off the request thread.Running a potentially long subprocess inline (up to 120s) ties up web workers. Queueing this as a background job (and returning a job id) will improve resilience under concurrent admin actions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/events/views.py` around lines 629 - 635, The subprocess.run call that executes deduplicate.py is running synchronously on the request thread and blocking for up to 120 seconds, which can starve web workers. Instead of executing the subprocess inline, queue this operation as a background job (using your application's job queue system like Celery) and return a job identifier to the client immediately. Move the subprocess.run call with its parameters (the scripts_dir path, deduplicate.py script, entity argument, and 120-second timeout) into a separate background task, and have the view method return the job id or status response without waiting for completion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/backend/events/views.py`:
- Around line 615-646: The api_dedup_trigger endpoint is missing authentication
protection and exposes internal error details. First, add an authentication
check at the beginning of the function that validates the X-Scraper-Key header
against _WEBHOOK_SECRET, similar to how scraper_webhook and
ingest_events_webhook protect their endpoints. Second, replace the two error
returns that currently expose internal details (line 640 returning stderr and
line 645 returning str(exc)) with generic error messages like "Dedup operation
failed" or "An error occurred". For both cases, capture the actual error details
and log them server-side using proper logging instead of exposing them in the
JsonResponse.
- Around line 629-636: The subprocess.run call for deduplicate.py in the
api_dedup_trigger view lacks synchronization, allowing concurrent requests to
execute overlapping dedup operations that can corrupt data integrity. Wrap the
entire subprocess.run block (including the call with timeout=120 and cwd
parameters) with a process-level lock mechanism using either a file-based lock
or a threading.Lock/multiprocessing.Lock to ensure only one dedup execution runs
at a time. Acquire the lock before spawning the subprocess and ensure it is
released after the process completes, handling both success and error cases
appropriately.
---
Nitpick comments:
In `@apps/backend/events/views.py`:
- Around line 629-635: The subprocess.run call that executes deduplicate.py is
running synchronously on the request thread and blocking for up to 120 seconds,
which can starve web workers. Instead of executing the subprocess inline, queue
this operation as a background job (using your application's job queue system
like Celery) and return a job identifier to the client immediately. Move the
subprocess.run call with its parameters (the scripts_dir path, deduplicate.py
script, entity argument, and 120-second timeout) into a separate background
task, and have the view method return the job id or status response without
waiting for completion.
In `@apps/frontend/src/lib/types.ts`:
- Around line 152-155: The `entity` property in the `DedupResult` interface is
currently typed as a generic string, but it should be narrowed to a union of
specific literal string values to match backend constraints and ensure type
safety. Update the `entity` field in the `DedupResult` interface to use literal
union types (like "entityType1" | "entityType2") instead of the broad string
type, making sure the allowed values correspond to what the backend supports.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 5a2ec474-d055-4f17-97d1-4f3c8d967ef6
📒 Files selected for processing (5)
apps/backend/events/urls.pyapps/backend/events/views.pyapps/frontend/src/lib/api.tsapps/frontend/src/lib/types.tsapps/frontend/src/routes/scrapers/+page.svelte
…row DedupResult entity type Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
apps/backend/events/views.py (1)
670-695: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winApply consistent auth and error handling to
api_script_trigger.This endpoint lacks the auth check and error sanitization that were just added to
api_dedup_trigger. Line 695 returnsstr(exc)which can leak internal paths or stack details. For consistency:Suggested fix
`@csrf_exempt` `@require_POST` def api_script_trigger(request, script_name: str): + key = request.headers.get("X-Scraper-Key", "") + if not _WEBHOOK_SECRET or key != _WEBHOOK_SECRET: + return JsonResponse({"error": "unauthorized"}, status=401) + ... try: process = subprocess.Popen(...) return JsonResponse({"started": True, "script": script_name, "pid": process.pid}) - except Exception as exc: # noqa: BLE001 - return JsonResponse({"error": str(exc)}, status=500) + except Exception: # noqa: BLE001 + logger.exception("script trigger error (script=%s)", script_name) + return JsonResponse({"error": "Failed to start script"}, status=500)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/events/views.py` around lines 670 - 695, The api_script_trigger function is missing the authentication check and error sanitization that were applied to api_dedup_trigger. Add the same auth check decorator or guard that protects api_dedup_trigger at the start of api_script_trigger, and replace the raw str(exc) error response in the except block with a sanitized error message that does not leak internal paths or stack details, consistent with the error handling approach used in api_dedup_trigger.
🧹 Nitpick comments (1)
apps/backend/events/views.py (1)
637-659: ⚖️ Poor tradeoffSingle-flight lock is process-local; consider multi-worker implications.
The
threading.Lockcorrectly prevents concurrent dedup within a single process. However, production deployments with multiple workers (gunicorn, uWSGI) each have their own lock instance, so concurrent requests to different workers could still overlap.If overlapping dedup runs are dangerous (they likely are given the merge/delete operations), consider:
- A database-level advisory lock (
SELECT pg_advisory_lock(...))- A file-based lock (
fcntl.flock)- A Redis-based distributed lock
That said, the 120s timeout + typical request patterns may make collisions rare in practice.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/events/views.py` around lines 637 - 659, The `_DEDUP_LOCK` used in the dedup trigger view is a process-local threading.Lock, which only prevents concurrent execution within a single worker process. In multi-worker deployments (gunicorn, uWSGI), each worker has its own lock instance, allowing concurrent dedup runs across different workers which could cause data corruption. Replace the threading.Lock mechanism with a distributed locking solution such as a database-level advisory lock using PostgreSQL's pg_advisory_lock, a file-based lock using fcntl.flock, or a Redis-based distributed lock, depending on your infrastructure constraints and preferences.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/backend/events/views.py`:
- Around line 626-628: The authentication check at the beginning of this view
requires the X-Scraper-Key header which the frontend's generic post() helper
cannot provide. To fix this, you need to implement a different authentication
mechanism for frontend calls while preserving the webhook authentication for
n8n. Create a separate endpoint (or modify this one to handle both cases) that
uses Django's `@staff_member_required` decorator for the frontend UI calls, while
keeping the X-Scraper-Key header validation for n8n webhook calls. You can
determine which authentication path to use by checking the request origin (e.g.,
if the header is present use webhook auth, otherwise use session/staff auth).
This allows the frontend to authenticate via session while keeping webhook calls
secure with the secret header.
---
Outside diff comments:
In `@apps/backend/events/views.py`:
- Around line 670-695: The api_script_trigger function is missing the
authentication check and error sanitization that were applied to
api_dedup_trigger. Add the same auth check decorator or guard that protects
api_dedup_trigger at the start of api_script_trigger, and replace the raw
str(exc) error response in the except block with a sanitized error message that
does not leak internal paths or stack details, consistent with the error
handling approach used in api_dedup_trigger.
---
Nitpick comments:
In `@apps/backend/events/views.py`:
- Around line 637-659: The `_DEDUP_LOCK` used in the dedup trigger view is a
process-local threading.Lock, which only prevents concurrent execution within a
single worker process. In multi-worker deployments (gunicorn, uWSGI), each
worker has its own lock instance, allowing concurrent dedup runs across
different workers which could cause data corruption. Replace the threading.Lock
mechanism with a distributed locking solution such as a database-level advisory
lock using PostgreSQL's pg_advisory_lock, a file-based lock using fcntl.flock,
or a Redis-based distributed lock, depending on your infrastructure constraints
and preferences.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: edb1a7e5-ad42-44f6-8c2f-c1b5208fd78a
📒 Files selected for processing (2)
apps/backend/events/views.pyapps/frontend/src/lib/types.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- apps/frontend/src/lib/types.ts
read docs\deduplication
Summary by CodeRabbit
--dry-run, per-entity execution, and verbose output/summary.