feat(organizers): crawler-based enrichment (no API keys)#21
Conversation
- Add enriched_at and enrichment_source fields to Organizer model - Fix save_organizers() to merge-not-overwrite existing contact data - Add DIFFBOT_API_KEY and HUNTER_API_KEY to settings - New management command: enrich_organizers (--limit, --dry-run, --force, --delay) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes Diffbot and Hunter.io API dependencies in favor of a self-contained HTML crawler. Adds contact_extractor.py as a shared extraction helper and rewrites the enrich_organizers management command to use direct HTTP crawling. Drops DIFFBOT_API_KEY and HUNTER_API_KEY from settings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…et encoding Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds enrichment tracking fields ( ChangesOrganizer Enrichment Feature
Sequence DiagramsequenceDiagram
participant CLI as enrich_organizers<br/>handle
participant DB as Organizer DB
participant Safe as _is_safe_public_url
participant HTTP as _http_get
participant Stealth as _stealth_get
participant Extract as extract_contact_info
CLI->>DB: Query eligible organizers
loop each organizer
alt no website
CLI->>DB: Save enrichment_source=skipped_no_website
else has website
CLI->>Safe: Check if public URL
alt unsafe/private
CLI-->>CLI: Skip without enrichment
else safe
CLI->>HTTP: GET homepage (plain HTTP)
alt HTTP 200
HTTP-->>CLI: HTML text
else error or non-200
CLI->>Stealth: StealthyFetcher.fetch
Stealth-->>CLI: HTML or None
end
alt HTML obtained
CLI->>Extract: extract_contact_info(homepage)
Extract-->>CLI: contact dict
CLI->>HTTP: GET /contact subpage
HTTP-->>CLI: HTML or None
alt HTML received
CLI->>Extract: extract_contact_info(/contact)
Extract-->>CLI: merged fields
end
CLI->>HTTP: GET /about subpage
HTTP-->>CLI: HTML or None
alt HTML received
CLI->>Extract: extract_contact_info(/about)
Extract-->>CLI: merged fields
end
CLI->>DB: Save only previously empty fields<br/>(update_fields restricted)
else failed to fetch
CLI-->>CLI: Warn and skip organizer
end
end
end
CLI-->>CLI: Apply per-organizer delay
end
CLI-->>CLI: Print enriched count summary
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/backend/events/management/commands/enrich_organizers.py`:
- Around line 88-113: The command-line argument parsing for --limit and --delay
does not validate input bounds, allowing invalid values to cause issues. Add
validation to the --limit and --delay argument definitions in the add_arguments
method to ensure --limit is a positive integer and --delay is a non-negative
float. You can use the type parameter with a custom validation function or add a
choices constraint. Alternatively, add validation checks in the handle method
before using these options to reject invalid values and raise a CommandError
with a descriptive message explaining the valid range for each argument.
- Around line 144-160: The code is vulnerable to SSRF attacks by fetching
untrusted Organizer.website URLs without validation. Before each call to
_http_get() and _stealth_get() in this command (including the initial homepage
fetch with org.website and the subpage fetches with urljoin), validate the URL
using _is_safe_public_url() to ensure it targets only safe public endpoints. If
the URL fails the safety check, skip processing that organizer similar to how
the code currently handles failed homepage fetches. This applies to both the
initial org.website validation and the constructed subpage URLs in the loop that
checks paths like "/contact" and "/about".
In `@apps/backend/events/models.py`:
- Around line 188-195: The help_text for both the enriched_at and
enrichment_source fields in the organizer model references external APIs like
"diffbot,hunter" but the code now uses crawler-based enrichment instead. Update
the help_text for the enriched_at field to remove the reference to "external
API" and replace it with crawler-based language, and update the help_text for
the enrichment_source field to reflect crawler-oriented values instead of the
example API names "diffbot,hunter".
In `@apps/backend/events/scrapers/base.py`:
- Around line 286-295: The always_update set on line 286 includes external_id
and source_url, which causes these identifier fields to be unconditionally
overwritten even when the new contact_fields data does not contain them. This
erases previously known identifiers. Remove external_id and source_url from the
always_update set so that these fields only get updated when they have actual
values in contact_fields and the existing record is blank, matching the behavior
of the other optional fields handled in the elif branch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 70df2b80-3b8a-4eee-955e-746e41cf7940
📒 Files selected for processing (6)
apps/backend/config/settings.pyapps/backend/events/management/commands/enrich_organizers.pyapps/backend/events/migrations/0017_organizer_enrichment_fields.pyapps/backend/events/models.pyapps/backend/events/scrapers/base.pyapps/backend/events/scrapers/contact_extractor.py
💤 Files with no reviewable changes (1)
- apps/backend/config/settings.py
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
apps/backend/events/scrapers/contact_extractor.py (1)
46-53: ⚡ Quick winPrefer hostname-based social detection to avoid false positives.
Substring matching can capture unrelated URLs that merely contain
facebook.com/orinstagram.com/in query/path text. Parse and validate the hostname instead.Proposed refactor
+from urllib.parse import urlparse ... - elif "facebook.com/" in lower and "facebook_url" not in result: - normalized = _normalize_social(href) + elif "facebook_url" not in result: + normalized = _normalize_social(href) + host = (urlparse(normalized).hostname or "").lower() + if host.endswith("facebook.com"): + result["facebook_url"] = normalized - if normalized: - result["facebook_url"] = normalized - elif "instagram.com/" in lower and "instagram_url" not in result: - normalized = _normalize_social(href) - if normalized: - result["instagram_url"] = normalized + elif "instagram_url" not in result: + normalized = _normalize_social(href) + host = (urlparse(normalized).hostname or "").lower() + if host.endswith("instagram.com"): + result["instagram_url"] = normalized🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/events/scrapers/contact_extractor.py` around lines 46 - 53, Replace the substring-based social media detection logic in the elif conditions that check for "facebook.com/" and "instagram.com/" with hostname-based validation. Instead of using the in operator to search for these strings anywhere in the lowercased URL, parse the href using a URL parsing library to extract the actual hostname and compare it against the expected social media domains. This approach prevents false positives where these domain names might appear in query parameters or path segments. Apply this change to both the Facebook and Instagram detection blocks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/backend/events/scrapers/contact_extractor.py`:
- Around line 81-84: The `city` and `country` values being assigned to the
result dictionary in the contact_extractor code are not being truncated to match
the `CharField(max_length=120)` constraints of the Organizer model fields.
Modify the code where `result["city"]` and `result["country"]` are assigned by
applying string truncation (limiting to 120 characters) to the
`_stringify(locality)` and `_stringify(country)` values to ensure they conform
to the database field constraints before saving.
---
Nitpick comments:
In `@apps/backend/events/scrapers/contact_extractor.py`:
- Around line 46-53: Replace the substring-based social media detection logic in
the elif conditions that check for "facebook.com/" and "instagram.com/" with
hostname-based validation. Instead of using the in operator to search for these
strings anywhere in the lowercased URL, parse the href using a URL parsing
library to extract the actual hostname and compare it against the expected
social media domains. This approach prevents false positives where these domain
names might appear in query parameters or path segments. Apply this change to
both the Facebook and Instagram detection blocks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 332a32ca-9b04-47e4-9aef-4e94658b131a
📒 Files selected for processing (2)
apps/backend/events/management/commands/enrich_organizers.pyapps/backend/events/scrapers/contact_extractor.py
🚧 Files skipped from review as they are similar to previous changes (1)
- apps/backend/events/management/commands/enrich_organizers.py
| if locality and "city" not in result: | ||
| result["city"] = _stringify(locality) | ||
| if country and "country" not in result: | ||
| result["country"] = _stringify(country) |
There was a problem hiding this comment.
Bound city/country to model field lengths before saving.
Organizer.city and Organizer.country are CharField(max_length=120), but extracted values are not truncated. Oversized JSON-LD values can break enrichment saves at runtime.
Proposed fix
if address:
street = address.get("streetAddress")
locality = address.get("addressLocality")
country = address.get("addressCountry")
if street and "address" not in result:
result["address"] = _stringify(street)[:500]
if locality and "city" not in result:
- result["city"] = _stringify(locality)
+ result["city"] = _stringify(locality)[:120]
if country and "country" not in result:
- result["country"] = _stringify(country)
+ result["country"] = _stringify(country)[:120]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if locality and "city" not in result: | |
| result["city"] = _stringify(locality) | |
| if country and "country" not in result: | |
| result["country"] = _stringify(country) | |
| if locality and "city" not in result: | |
| result["city"] = _stringify(locality)[:120] | |
| if country and "country" not in result: | |
| result["country"] = _stringify(country)[:120] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/backend/events/scrapers/contact_extractor.py` around lines 81 - 84, The
`city` and `country` values being assigned to the result dictionary in the
contact_extractor code are not being truncated to match the
`CharField(max_length=120)` constraints of the Organizer model fields. Modify
the code where `result["city"]` and `result["country"]` are assigned by applying
string truncation (limiting to 120 characters) to the `_stringify(locality)` and
`_stringify(country)` values to ensure they conform to the database field
constraints before saving.
…fix always_update set Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pdate Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
apps/backend/events/management/commands/enrich_organizers.py (2)
198-203:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate extracted values against model field lengths before saving.
Crawler output is untrusted, and several target fields are bounded (
phone,city,country,address, URL/email fields). An overlong value can makeorg.save(...)fail and abort the command.Suggested fix
changed_fields = [] for field in _CONTACT_FIELDS: value = data.get(field) if value and not getattr(org, field): + max_length = getattr(Organizer._meta.get_field(field), "max_length", None) + if max_length is not None and len(value) > max_length: + self.stdout.write( + self.style.WARNING( + f" → skipped {field}: extracted value exceeds {max_length} chars" + ) + ) + continue setattr(org, field, value) changed_fields.append(field)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/events/management/commands/enrich_organizers.py` around lines 198 - 203, The issue is that crawler output values are not validated against model field length constraints before being assigned to the organizer object in the loop that iterates through _CONTACT_FIELDS. Before calling setattr on the org object with the value extracted from data.get(field), you need to validate that the value does not exceed the maximum length defined on the corresponding model field. Check the field's max_length attribute from the model definition and only proceed with setattr and appending to changed_fields if the value length is within the allowed bounds. This will prevent org.save() from failing due to oversized field values from untrusted crawler output.
159-161:⚠️ Potential issue | 🟡 MinorInclude
updated_atin these targeted saves to maintain consistent modification timestamps.The
Organizer.updated_atfield usesauto_now=True, but whenupdate_fieldsis explicitly specified, Django bypassesauto_nowbehavior unless the field is included in the list. These saves at lines 161 and 209 modify the organizer (setting enrichment state) without includingupdated_at, leaving the modification timestamp stale. The rest of the codebase consistently includesupdated_atinupdate_fieldsfor all model updates (views.py, runner.py, run_scraper_job.py, tests.py). Update both save calls to includeupdated_at:
- Line 161:
org.save(update_fields=["enriched_at", "enrichment_source", "updated_at"])- Line 209:
org.save(update_fields=changed_fields)should have"updated_at"added tochanged_fieldsbefore the save🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/events/management/commands/enrich_organizers.py` around lines 159 - 161, The Organizer model's auto_now=True behavior on the updated_at field is bypassed when update_fields is explicitly specified in the save() call. Add "updated_at" to the update_fields list in both save operations to maintain consistent modification timestamps across the codebase. At line 161, include "updated_at" in the list with "enriched_at" and "enrichment_source". At line 209, ensure "updated_at" is added to the changed_fields collection before the save call. This aligns with the consistent pattern used throughout the rest of the codebase in views.py, runner.py, run_scraper_job.py, and tests.py.
🧹 Nitpick comments (1)
apps/backend/events/management/commands/enrich_organizers.py (1)
61-62: ⚡ Quick winCatch only expected URL/DNS failures here.
Ruff flags the blanket
Exception; narrow it to resolver/parsing errors so unrelated defects are not silently converted into “unsafe URL.”Suggested fix
- except Exception: + except (OSError, UnicodeError, ValueError): return False🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/events/management/commands/enrich_organizers.py` around lines 61 - 62, The broad Exception catch on line 61-62 is masking unrelated errors by silently converting them to false. Instead of catching all exceptions in the try block (which appears to be validating organizer URLs), catch only the specific exceptions that represent expected URL/DNS resolution failures such as socket.gaierror for DNS failures and urllib-related exceptions for parsing errors. This way, unexpected errors will properly surface rather than being silently suppressed.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@apps/backend/events/management/commands/enrich_organizers.py`:
- Around line 198-203: The issue is that crawler output values are not validated
against model field length constraints before being assigned to the organizer
object in the loop that iterates through _CONTACT_FIELDS. Before calling setattr
on the org object with the value extracted from data.get(field), you need to
validate that the value does not exceed the maximum length defined on the
corresponding model field. Check the field's max_length attribute from the model
definition and only proceed with setattr and appending to changed_fields if the
value length is within the allowed bounds. This will prevent org.save() from
failing due to oversized field values from untrusted crawler output.
- Around line 159-161: The Organizer model's auto_now=True behavior on the
updated_at field is bypassed when update_fields is explicitly specified in the
save() call. Add "updated_at" to the update_fields list in both save operations
to maintain consistent modification timestamps across the codebase. At line 161,
include "updated_at" in the list with "enriched_at" and "enrichment_source". At
line 209, ensure "updated_at" is added to the changed_fields collection before
the save call. This aligns with the consistent pattern used throughout the rest
of the codebase in views.py, runner.py, run_scraper_job.py, and tests.py.
---
Nitpick comments:
In `@apps/backend/events/management/commands/enrich_organizers.py`:
- Around line 61-62: The broad Exception catch on line 61-62 is masking
unrelated errors by silently converting them to false. Instead of catching all
exceptions in the try block (which appears to be validating organizer URLs),
catch only the specific exceptions that represent expected URL/DNS resolution
failures such as socket.gaierror for DNS failures and urllib-related exceptions
for parsing errors. This way, unexpected errors will properly surface rather
than being silently suppressed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 9a207eee-1e52-4bc2-b663-4f4e9f301805
📒 Files selected for processing (3)
apps/backend/events/management/commands/enrich_organizers.pyapps/backend/events/models.pyapps/backend/events/scrapers/base.py
🚧 Files skipped from review as they are similar to previous changes (1)
- apps/backend/events/scrapers/base.py
Summary
contact_extractor.py— shared HTML parser that extracts email, phone, Facebook/Instagram URLs, description, and city/country from any webpageenrich_organizersmanagement command to crawl organizer websites instead of calling Diffbot/Hunter APIs — no API keys requiredrequests.get()first,StealthyFetcher(headless Playwright) fallback for Cloudflare-protected sites/contact+/aboutsubpages; only fills blank fields (additive merge, never clobbers existing data)DIFFBOT_API_KEYandHUNTER_API_KEYfrom settingsresp.apparent_encodingTest plan
manage.py enrich_organizers --dry-run --limit 10— only shows orgs with websitesmanage.py enrich_organizers --limit 20— 19/20 enriched, emails/phones/descriptions filled🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements
Chores