Skip to content

feat(scrapers): run scrapers from the UI with live status, logs, and cancellation#7

Merged
hdmGOAT merged 10 commits into
developmentfrom
feat/run-scrapers-and-logs
Jun 17, 2026
Merged

feat(scrapers): run scrapers from the UI with live status, logs, and cancellation#7
hdmGOAT merged 10 commits into
developmentfrom
feat/run-scrapers-and-logs

Conversation

@hdmGOAT

@hdmGOAT hdmGOAT commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Summary

  • UI-triggered scraper runs: Run any registered scraper (or all at once) directly from the Scraper Center via POST to /api/scrapers/<key>/run/ or /api/scrapers/run-all/. Run buttons and a "Run All" header action are wired in the SvelteKit frontend.
  • Live polling: The Scraper Center polls /api/scrapers/runs/active/ every 2.5 s while any run is queued/running. The spinner stops and history auto-refreshes when all active runs complete.
  • ScraperRun table: New ScraperRun model (migrations 0010, 0011) tracks every run with status (queued / running / success / failed / cancelled), timestamps, item counts, error_message (full traceback on failure), and pid for process-kill cancellation.
  • Subprocess-based cancellation: The worker is spawned with start_new_session=True (POSIX setsid). cancel_run() sends SIGTERM to the entire process group via os.killpg, which also kills Playwright's chromium. A SELECT FOR UPDATE race guard prevents overwriting a worker that finished in the same instant.
  • New API endpoints: GET /api/scrapers/runs/, GET /api/scrapers/runs/active/, GET /api/scrapers/runs/<id>/, POST /api/scrapers/runs/<id>/cancel/ — all standard JSON; existing GET /api/scrapers/ enriched with a last_run field per key.
  • ScraperRunAdmin: registered as a read-only history view (no add/delete).
  • 84 backend tests passing (63 new: model, runner, API endpoints). svelte-check and build clean.

Auth posture / security follow-up

The three mutation endpoints (/run/, /run-all/, /cancel/) are @csrf_exempt and unauthenticated. The SvelteKit frontend has no Django session — there is no CSRF surface from a browser cross-site attack on this internal-only tool during development. The Vite proxy makes all /api/* calls same-origin in dev. These endpoints must be re-gated behind real auth when user authentication is added (planned Phase 2).

Not yet verified live

  • Real process-kill smoke test (cancel a running Playwright scraper and confirm chromium exits).
  • Header date / auto-refresh visually in-browser after a full run completes.

Test plan

  • cd apps/backend && ./venv/bin/python manage.py test events — all 84 tests pass
  • cd apps/frontend && pnpm svelte-check — no type errors
  • cd apps/frontend && pnpm build — build succeeds
  • Start Django (./venv/bin/python manage.py runserver) + SvelteKit dev (pnpm dev), open Scraper Center, trigger a fast scraper, verify status updates live
  • Trigger a run, then cancel before it finishes, verify status flips to cancelled
  • Trigger Run All, verify one row per registered scraper appears in the history table

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added scraper run management in the Scraper Center: trigger individual runs and “Run All”
    • Added active-run tracking with live polling, run history, and detailed run views (counts, timing, and errors)
    • Added cancellation for active runs, including status updates and safe handling of races
  • UI/UX

    • Updated status badges (now supports queued/running/success/failed/cancelled with optional category styling)
  • Tests

    • Expanded coverage for the scraper-run lifecycle, API behavior, and cancellation/trigger edge cases

hdmGOAT and others added 6 commits June 17, 2026 11:34
Introduces a ScraperRun model that serves as both the live job table
(while queued/running) and durable run history. Statuses: queued,
running, success, failed, cancelled. Stores pid for subprocess-kill
cancellation, triggered_by FK to auth.User, duration helpers, and
extra_counts for per-scraper summary fields.

Migrations 0010 (initial ScraperRun) and 0011 (pid + CANCELLED status
for subprocess cancellation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ation

runner.py: trigger_scraper_run() guards against duplicate active runs,
creates a ScraperRun row, and spawns manage.py run_scraper_job as an
independent subprocess (start_new_session=True / POSIX setsid) so the
entire process tree — including Playwright's chromium — can be killed
wholesale. cancel_run() sends SIGTERM to the process group via
os.killpg, and uses SELECT FOR UPDATE inside a transaction to safely
race the worker writing SUCCESS/FAILED simultaneously.

run_scraper_job management command: executed in the worker subprocess,
looks up the ScraperRun row, runs the scraper, and writes status,
counts, and error_message back to the DB row.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erRunAdmin

views.py: six new endpoints — POST /api/scrapers/<key>/run/ (trigger),
POST /api/scrapers/run-all/, GET /api/scrapers/runs/, GET
/api/scrapers/runs/active/, GET /api/scrapers/runs/<id>/, POST
/api/scrapers/runs/<id>/cancel/. api_scrapers() enriched with
last_run from latest ScraperRun per key (single DISTINCT ON query).
All mutation endpoints are @csrf_exempt with a documented security note
— re-evaluate when real auth exists (the SvelteKit frontend has no
Django session today).

urls.py: routes wired with more-specific paths ordered before the
api/scrapers/ catch-all.

admin.py: ScraperRunAdmin registered as read-only history (no add/delete
permissions, all fields readonly).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…PI endpoints

New test classes: ScraperRunModelTests, ScraperRunCancelledStatusTests,
RunnerSubprocessTests (TransactionTestCase — required for select_for_update
compatibility with Postgres), ScraperRunAPITests, and ScraperAdminAccessTests.

Covers: model defaults and properties, CANCELLED status and pid field,
trigger dedup (409 on active run), cancel race (select_for_update),
all six run-job API endpoints, run-all endpoint, ScraperRunAdmin
read-only permissions, and api_scrapers last_run enrichment.

All mutation tests use mock.patch on subprocess.Popen so no real worker
processes are spawned. Total test suite grows to 84 passing tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…polling

types.ts: ScraperRun, ScraperLastRun, RunAllResult interfaces; Scraper
extended with last_run and active_run fields.

api.ts: post() helper with CSRF token extraction; runScraper, runAll,
scraperRuns, activeRuns, scraperRun, cancelRun API methods.

+page.ts: parallel-loads scrapers + recentRuns on page load.

+page.svelte (Svelte 5 runes): per-card Run / Cancel buttons with
in-flight state; Run All button in header; 2.5 s polling via
setInterval that auto-stops when no active runs remain and refreshes
history + per-card last_run lines; spin indicator while active;
expandable failure tracebacks; recent-run history table with toggle
for full list; formatted duration helper.

PageHeader.svelte: optional action snippet slot — passes Run All button
in; falls back to the existing date/bell display when no action provided.

Badge.svelte: running / queued / success / failed / cancelled status
colour mappings for ScraperRun badges.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mpleted plan

all-context.md (rev 3): updated current-state summary to reflect monorepo
layout, Neon Postgres, SvelteKit primary UI, 8 scrapers, ScraperRun
model; corrected path prefixes from flat-Django to apps/backend/;
added context-freshness rule.

tests/all-tests.md: updated test count to 64 (was empty placeholder),
corrected venv path to apps/backend/venv/bin/python, added
TestCase vs. TransactionTestCase guidance for runner tests.

Plans: archived completed_scraper-run-jobs_PLAN_17-06-26.md to
general-plans/completed/; new active plan
run-cancellation_PLAN_17-06-26.md for subprocess cancellation phase.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a ScraperRun Django model with a five-state lifecycle (queued/running/success/failed/cancelled), three migrations (model creation, pid/cancelled fields, unique constraint), a subprocess-based runner that spawns manage.py run_scraper_job in its own process group and cancels via SIGTERM, seven new CSRF-exempt JSON API endpoints, and replaces the static SvelteKit Scraper Center with an interactive polling UI supporting per-scraper run/cancel controls and a recent runs table.

Changes

ScraperRun Feature: Backend to Frontend

Layer / File(s) Summary
ScraperRun model, migrations, and admin registration
apps/backend/events/models.py, apps/backend/events/migrations/0010_scraperrun.py, apps/backend/events/migrations/0011_run_cancellation.py, apps/backend/events/migrations/0012_scraperrun_unique_active_constraint.py, apps/backend/events/admin.py
ScraperRun model defines Status TextChoices (queued/running/success/failed/cancelled), lifecycle timestamp fields, created/updated counters, extra_counts JSON, error_message, pid, triggered_by FK, and duration_seconds/is_active properties. Migration 0010 creates the table; migration 0011 adds pid and the cancelled choice; migration 0012 adds a UniqueConstraint on scraper_key for active runs. Admin registers ScraperRunAdmin as read-only (add/delete permissions disabled).
Subprocess runner and run_scraper_job management command
apps/backend/events/runner.py, apps/backend/events/management/commands/run_scraper_job.py
runner.py adds _map_result, trigger_scraper_run (concurrency guard → Popen(start_new_session=True) → stores pid), and cancel_run (SELECT FOR UPDATEos.killpg SIGTERM → race-safe refresh_from_db). run_scraper_job command takes --run-id, validates scraper key, transitions QUEUED→RUNNING→SUCCESS/FAILED with targeted update_fields.
Scraper run API views and URL routing
apps/backend/events/views.py, apps/backend/events/urls.py
Adds _serialize_run helper and seven CSRF-exempt endpoints: trigger by key, run-all, list runs, list active, run detail, and cancel (returning 404/409 by signal). Extends api_scrapers with a Postgres distinct("scraper_key") query to inject last_run per scraper. URLs insert new specific patterns before the existing api/scrapers/ catch-all.
Backend tests
apps/backend/events/tests.py
Adds ScraperRunModelTests, ScraperRunCancelledStatusTests, RunnerSubprocessTests (TransactionTestCase with mocked Popen/os.killpg, including SUCCESS-race guard), ScraperJobCommandTests, RunnerMappingTests, CancelEndpointTests (200/404/409/405), RunEndpointTests (trigger/run-all/public access), and ApiScrapersLastRunTests (last_run decoration and cross-key isolation).
Frontend types, API client, Badge, and PageHeader
apps/frontend/src/lib/types.ts, apps/frontend/src/lib/api.ts, apps/frontend/src/lib/components/Badge.svelte, apps/frontend/src/lib/components/PageHeader.svelte
types.ts adds ScraperRunStatus, ScraperRun, ScraperLastRun, RunAllResult, and extends Scraper with last_run/active_run. api.ts adds getCsrfToken(), typed post() with X-CSRFToken, and run-related API methods. Badge.svelte gains category prop with hash-based color and status entries for running/queued/success/failed/cancelled. PageHeader.svelte adds optional action Snippet prop.
Scraper Center interactive page
apps/frontend/src/routes/scrapers/+page.ts, apps/frontend/src/routes/scrapers/+page.svelte
+page.ts fetches scrapers and recentRuns in parallel. +page.svelte adds per-scraper Run/Cancel buttons with in-flight tracking, api.activeRuns() polling every 2.5s that refreshes cards and history when idle, expandable failure traces, Run All button, and a Recent Runs table with a 20-row limit toggle.

Process Documentation Updates

Layer / File(s) Summary
Context and test docs
process/context/all-context.md, process/context/tests/all-tests.md
all-context.md refreshed to rev 3: documents 10 scrapers, ScraperRun model with lifecycle fields, subprocess runner details (process-group SIGTERM), API endpoint table, CSRF-exempt public auth posture, frontend API client with CSRF header, and updated migration/test counts. all-tests.md updated with 97-test count, TestCase vs TransactionTestCase guidance for threaded runner tests, revised command matrix with venv/working-dir assumptions, expanded debugging reference with Neon Postgres behavior, and refreshed Known Gaps.
Plan files
process/general-plans/active/run-cancellation_PLAN_17-06-26.md, process/general-plans/completed/completed_scraper-run-jobs_PLAN_17-06-26.md
Adds the active run-cancellation plan (6-phase checklist covering CANCELLED status/pid field addition, management command lifecycle, subprocess runner rewrite with atomic cancel, cancel endpoint and URL routing, frontend cancelRun API and Cancel button, and test rewrites for subprocess behavior). Archives the completed scraper-run-jobs plan with explicit deviations record (Neon Postgres test harness instead of SQLite, public CSRF-exempt auth posture instead of staff-gated, and run-all scope addition).

Sequence Diagram(s)

sequenceDiagram
    participant Page as +page.svelte
    participant APIClient as api.ts
    participant Django as Django JSON API
    participant Runner as runner.py
    participant Subprocess as run_scraper_job subprocess

    Page->>Django: load() — api.scrapers + api.scraperRuns (parallel)
    Django-->>Page: scrapers[], recentRuns[]

    loop poll api.activeRuns() every 2.5s while active runs exist
        Page->>Django: GET /api/scrapers/runs/active/
        Django-->>Page: active ScraperRun[]
        alt no active runs left
            Page->>Django: GET /api/scrapers/runs/ + /api/scrapers/
            Django-->>Page: refreshed recentRuns + scrapers with last_run
        end
    end

    Page->>APIClient: api.runScraper(key)
    APIClient->>Django: POST /api/scrapers/key/run/ (X-CSRFToken)
    Django->>Runner: trigger_scraper_run(key)
    Runner->>Django: create ScraperRun QUEUED + Popen(start_new_session=True)
    Runner->>Django: save pid on row
    Django-->>Page: {id, status: "queued"}

    Subprocess->>Django: set RUNNING + started_at
    Subprocess->>Subprocess: SCRAPERS[key]().run()
    Subprocess->>Django: set SUCCESS/FAILED + finished_at

    Page->>APIClient: api.cancelRun(id)
    APIClient->>Django: POST /api/scrapers/runs/id/cancel/
    Django->>Runner: cancel_run(id) SELECT FOR UPDATE
    Runner->>Subprocess: os.killpg(os.getpgid(pid), SIGTERM)
    Runner->>Django: refresh_from_db → set CANCELLED + finished_at
    Django-->>Page: ScraperRun {status: "cancelled"}
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hdmGOAT/veent-event-scraper#3: This PR extends the scraper JSON API surface in apps/backend/events/views.py and apps/backend/events/urls.py—building on the prior API wiring by adding subprocess-backed run/cancel/list/detail endpoints that depend on the ScraperRun model.

Poem

🐇 Hop hop, a scraper runs free,
Spawned in a subprocess, PID stored with glee!
SIGTERM sent when Cancel is pressed,
The DB row refreshed—no race distressed.
Polling the frontend, badges aglow,
Run All, or just one—watch the statuses flow! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main feature addition: enabling scraper execution from the UI with live status tracking and cancellation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/run-scrapers-and-logs

Comment @coderabbitai help to get the list of available commands and usage tips.

@hdmGOAT hdmGOAT changed the base branch from main to development June 17, 2026 03:39
…ers-and-logs

# Conflicts:
#	apps/backend/events/tests.py
#	apps/backend/events/views.py
#	apps/frontend/src/routes/scrapers/+page.svelte
#	process/context/all-context.md
#	process/context/tests/all-tests.md

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (2)
apps/frontend/src/lib/types.ts (1)

119-121: ⚡ Quick win

Tighten RunAllResult status typing to the known status union.

Line 120 currently allows any string, but this payload is constrained to scraper run statuses. Using ScraperRunStatus preserves compile-time contract checks and prevents accidental UI branch drift.

Proposed diff
 export interface RunAllResult {
-	created: { key: string; id: number; status: string }[];
+	created: { key: string; id: number; status: ScraperRunStatus }[];
 	skipped: string[];
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/frontend/src/lib/types.ts` around lines 119 - 121, In the RunAllResult
interface, the status field within the created array is typed as a plain string,
which allows any value and loses compile-time validation. Change the status
field type from string to ScraperRunStatus to enforce that only valid scraper
run statuses are used, ensuring the UI contract is correctly enforced at compile
time and preventing accidental branch drift.
apps/frontend/src/lib/api.ts (1)

78-78: ⚡ Quick win

Use ScraperRunStatus for runScraper response typing.

Line 78 uses a broad string for status; narrowing it to ScraperRunStatus keeps the client API contract strict and consistent with the rest of the scraper-run types.

Proposed diff
 import type {
 	CategoryCount,
 	EventRow,
 	Organizer,
 	OrganizerDetail,
 	Paginated,
 	RunAllResult,
 	Scraper,
+	ScraperRunStatus,
 	ScraperRun,
 	SourceCount,
 	Stats,
 	VenueRow
 } from './types';
@@
-	runScraper: (key: string) => post<{ id: number; status: string }>(`/scrapers/${key}/run/`),
+	runScraper: (key: string) =>
+		post<{ id: number; status: ScraperRunStatus }>(`/scrapers/${key}/run/`),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/frontend/src/lib/api.ts` at line 78, The runScraper function in the
api.ts file uses a generic string type for the status field in its response
type, which reduces type safety. Replace the string type with ScraperRunStatus
in the response type definition of the runScraper function (the post generic
parameter) to enforce a strict API contract and maintain consistency with other
scraper-run related types. Ensure ScraperRunStatus is imported if it is not
already available in the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/backend/events/admin.py`:
- Around line 76-80: The `pid` field is missing from the readonly_fields tuple
in the admin configuration. Add `pid` to the readonly_fields tuple to prevent
manual editing of this field in the Django admin interface. This is critical
because the `cancel_run` function relies on the `pid` field value to send
SIGTERM signals to the correct process group, and allowing manual edits could
cause the signal to be sent to the wrong target. Include `pid` in the
readonly_fields tuple alongside the other read-only fields like scraper_key,
status, started_at, and finished_at.

In `@apps/backend/events/management/commands/run_scraper_job.py`:
- Around line 47-49: The status transition to RUNNING in the ScraperRun update
is unconditional and vulnerable to a race condition where a cancelled run can be
resurrected. Instead of directly setting run.status to ScraperRun.Status.RUNNING
and calling save(), use Django's atomic conditional update pattern (such as
filter().update() with a queryset condition) to ensure the status transitions
from QUEUED to RUNNING only if the current status is still QUEUED. This mirrors
the defensive approach already used in cancel_run with select_for_update() and
refresh_from_db(), preventing a worker process from overwriting a terminal
CANCELLED state that was set by an admin between the time the job was queued and
when the worker actually started execution.

In `@apps/backend/events/runner.py`:
- Around line 57-65: The code at lines 57-64 has a race condition where two
concurrent requests can both pass the exists() check and create duplicate active
ScraperRun entries for the same scraper_key. Add a partial unique constraint to
the ScraperRun model database to enforce that only one active run (with status
QUEUED or RUNNING) can exist per scraper_key, then wrap the
ScraperRun.objects.create() call in a try-except block to catch the database
integrity constraint violation and return the same response (None, True) that
would be returned if active_exists was true, ensuring the race condition is
prevented at the database level.

In `@apps/backend/events/tests.py`:
- Around line 757-765: Before releasing to production, verify that the three
public mutation endpoints (api_scraper_trigger, api_scraper_run_all, and
api_scraper_run_cancel) are properly gated at the infrastructure level since
they lack application-level authentication. Work with the DevOps and
Infrastructure team to confirm that production deployments have reverse proxy
rules, firewall restrictions, or network policies in place that restrict access
to these endpoints to only internal networks or trusted sources. Document the
deployment gates used (firewall rules, reverse proxy configuration, load
balancer restrictions) and ensure this security configuration is enforced
consistently across all production environments before the application is
released.

In `@apps/backend/events/views.py`:
- Around line 445-551: Add authentication and authorization checks to all
scraper-run endpoints to enforce staff-only access. The endpoints
api_scraper_trigger, api_scraper_run_all, api_scraper_runs,
api_scraper_runs_active, api_scraper_run_detail, and api_scraper_run_cancel
currently lack any access control despite being documented as staff-only. Add
either a decorator-based approach (such as Django's staff_member_required or
login_required with manual staff check) or implement explicit authentication
logic at the start of each function to verify the user is authenticated and has
staff privileges, returning an appropriate 403 Forbidden response if
authorization fails.
- Around line 513-516: The limit parameter parsing in the try block does not
enforce a minimum bound, allowing zero and negative values to pass through which
causes problematic queryset slicing behavior. Modify the limit assignment to
clamp the value to the range [1, 200] by using max() with a minimum value of 1
in addition to the existing min() call that enforces the maximum of 200,
ensuring the API behaves deterministically regardless of the input value.

In `@apps/frontend/src/routes/scrapers/`+page.svelte:
- Around line 176-177: The run variable at line 176 is only derived from
runningMap which contains only active runs (queued/running status), making the
success and failed status branches at lines 219 and 226 unreachable dead code.
Update the const run assignment to first check runningMap for an active run, and
if no active run exists, fall back to the latest run from an available data
source (such as a latest run property or API response). This will allow the
success and failed card details to be displayed when runs complete.
- Around line 123-125: The duration formatter has a bug where rounding the
remainder seconds can produce 60 as the seconds value, creating invalid time
strings like "1m 60s". Fix this by first normalizing the input seconds to a
whole number using Math.round before splitting into minutes and seconds. This
ensures the remainder operation always produces a value between 0 and 59,
avoiding the invalid 60-second bucket.

In `@process/context/all-context.md`:
- Around line 303-317: The ScraperRun model documentation section is incomplete
and needs two updates. First, in the status field documentation where it
currently lists the TextChoices as queued/running/success/failed, add CANCELLED
to this list to reflect the complete set of status options. Second, add a new
line documenting the pid field (PositiveIntegerField, null=True, blank=True)
which stores the OS PID of the worker subprocess and is used for cancellation
purposes. Insert the pid field documentation after the triggered_by field entry
to maintain logical grouping of model fields.
- Around line 376-394: The Run-Jobs Subsystem section in the documentation
incorrectly describes the system as thread-based using threading.Thread when the
actual implementation uses subprocess.Popen with start_new_session=True. Revise
the section starting at "The system is thread-based" to accurately document that
the system spawns independent process groups via subprocess.Popen, stores the
pid on the ScraperRun row, uses os.killpg for process group termination on
cancellation, and is controlled by the run_scraper_job management command (not
the _run_scraper thread function). Remove references to threading.Thread,
django.db.connection.close() in finally blocks, and the thread-based execution
model, replacing them with subprocess-based process group management details.

---

Nitpick comments:
In `@apps/frontend/src/lib/api.ts`:
- Line 78: The runScraper function in the api.ts file uses a generic string type
for the status field in its response type, which reduces type safety. Replace
the string type with ScraperRunStatus in the response type definition of the
runScraper function (the post generic parameter) to enforce a strict API
contract and maintain consistency with other scraper-run related types. Ensure
ScraperRunStatus is imported if it is not already available in the file.

In `@apps/frontend/src/lib/types.ts`:
- Around line 119-121: In the RunAllResult interface, the status field within
the created array is typed as a plain string, which allows any value and loses
compile-time validation. Change the status field type from string to
ScraperRunStatus to enforce that only valid scraper run statuses are used,
ensuring the UI contract is correctly enforced at compile time and preventing
accidental branch drift.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a78166f2-eeb1-4459-8e10-ed407292d798

📥 Commits

Reviewing files that changed from the base of the PR and between 417bade and 0062a3d.

📒 Files selected for processing (19)
  • apps/backend/events/admin.py
  • apps/backend/events/management/commands/run_scraper_job.py
  • apps/backend/events/migrations/0010_scraperrun.py
  • apps/backend/events/migrations/0011_run_cancellation.py
  • apps/backend/events/models.py
  • apps/backend/events/runner.py
  • apps/backend/events/tests.py
  • apps/backend/events/urls.py
  • apps/backend/events/views.py
  • apps/frontend/src/lib/api.ts
  • apps/frontend/src/lib/components/Badge.svelte
  • apps/frontend/src/lib/components/PageHeader.svelte
  • apps/frontend/src/lib/types.ts
  • apps/frontend/src/routes/scrapers/+page.svelte
  • apps/frontend/src/routes/scrapers/+page.ts
  • process/context/all-context.md
  • process/context/tests/all-tests.md
  • process/general-plans/active/run-cancellation_PLAN_17-06-26.md
  • process/general-plans/completed/completed_scraper-run-jobs_PLAN_17-06-26.md

Comment thread apps/backend/events/admin.py
Comment thread apps/backend/events/management/commands/run_scraper_job.py Outdated
Comment thread apps/backend/events/runner.py
Comment on lines +757 to +765
@mock.patch("events.views.cancel_run")
def test_cancel_is_public(self, mock_cancel):
run = ScraperRun.objects.create(
scraper_key="myruntime", status=ScraperRun.Status.CANCELLED
)
mock_cancel.return_value = (run, "ok")
resp = self.client.post(f"/api/scrapers/runs/{run.id}/cancel/")
self.assertEqual(resp.status_code, 200)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm public mutation behavior is intentionally encoded in tests and views.
rg -n "test_.*is_public|unauthenticated|/api/scrapers/.*/run/|/api/scrapers/run-all/|/cancel/" apps/backend/events/tests.py
rg -n "def api_scraper_trigger|def api_scraper_run_all|def api_scraper_run_cancel|csrf_exempt|unauthenticated intentionally" apps/backend/events/views.py

# Look for explicit environment gates controlling scraper mutation exposure.
rg -n "SCRAPER|INTERNAL|ALLOW_UNAUTH|FEATURE_FLAG|AUTH" -g "*settings*.py"

Repository: hdmGOAT/veent-event-scraper

Length of output: 2135


🏁 Script executed:

# Get full context of the SECURITY NOTE in views.py
sed -n '469,490p' apps/backend/events/views.py
sed -n '492,520p' apps/backend/events/views.py
sed -n '538,560p' apps/backend/events/views.py

Repository: hdmGOAT/veent-event-scraper

Length of output: 2883


🏁 Script executed:

# Check for any middleware or decorators that gate these endpoints
rg -n "middleware|permission|is_authenticated|request.user|INTERNAL|whitelist|allowed_hosts" apps/backend/events/views.py | head -30

Repository: hdmGOAT/veent-event-scraper

Length of output: 229


🏁 Script executed:

# Look for deployment configuration files (docker, env, nginx, etc.)
find . -type f \( -name "*.env*" -o -name "docker*" -o -name "Dockerfile" -o -name "nginx*" -o -name ".github" \) 2>/dev/null | head -20

Repository: hdmGOAT/veent-event-scraper

Length of output: 127


🏁 Script executed:

# Look for URL routing and middleware application
find apps/backend -type f \( -name "urls.py" -o -name "routing.py" \) | xargs cat -n 2>/dev/null | grep -A 5 -B 5 "api_scraper"

Repository: hdmGOAT/veent-event-scraper

Length of output: 1421


🏁 Script executed:

# Check for any nginx, Docker, or deployment config files
find . -type f \( -name "nginx.conf" -o -name "Dockerfile*" -o -name ".dockerignore" -o -name "docker-compose*" -o -name "deployment*" -o -name "k8s*" \) 2>/dev/null

Repository: hdmGOAT/veent-event-scraper

Length of output: 53


🏁 Script executed:

# Look for any middleware that might gate these paths
rg -n "class.*Middleware|def.*middleware|MIDDLEWARE" apps/backend --type py | head -20

Repository: hdmGOAT/veent-event-scraper

Length of output: 121


🏁 Script executed:

# Check the MIDDLEWARE setting to see if there's any authentication applied
sed -n '73,90p' apps/backend/config/settings.py

Repository: hdmGOAT/veent-event-scraper

Length of output: 694


🏁 Script executed:

# Check if there are any auth checks in the base URL config or middleware application
rg -n "login_required|permission_required|IsAuthenticated|BasicAuth" apps/backend/config

Repository: hdmGOAT/veent-event-scraper

Length of output: 53


Confirm production deployment gates these public mutation endpoints before release.

The endpoints are intentionally unauthenticated in Phase 1 design (documented in security notes), but no enforcement mechanism is visible in code or infrastructure. The tests correctly lock in the 200 response, but before release you must verify these endpoints are restricted to internal networks or protected by reverse proxy/firewall rules in production, since no application-level gating exists.

Applies to: api_scraper_trigger (lines 798–808), api_scraper_run_all (lines 896–903), and api_scraper_run_cancel (lines 757–765).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/tests.py` around lines 757 - 765, Before releasing to
production, verify that the three public mutation endpoints
(api_scraper_trigger, api_scraper_run_all, and api_scraper_run_cancel) are
properly gated at the infrastructure level since they lack application-level
authentication. Work with the DevOps and Infrastructure team to confirm that
production deployments have reverse proxy rules, firewall restrictions, or
network policies in place that restrict access to these endpoints to only
internal networks or trusted sources. Document the deployment gates used
(firewall rules, reverse proxy configuration, load balancer restrictions) and
ensure this security configuration is enforced consistently across all
production environments before the application is released.

Comment on lines +445 to +551
# ---------------------------------------------------------------------------
# Scraper run jobs — trigger runs from the UI and poll/list their status.
# All endpoints are staff-only, mirroring the /review/ convention.
# ---------------------------------------------------------------------------


def _serialize_run(run):
"""Serialise a ScraperRun to the standard dict shape used by all run endpoints."""
return {
"id": run.id,
"scraper_key": run.scraper_key,
"status": run.status,
"started_at": run.started_at.isoformat() if run.started_at else None,
"finished_at": run.finished_at.isoformat() if run.finished_at else None,
"created_count": run.created_count,
"updated_count": run.updated_count,
"extra_counts": run.extra_counts,
"error_message": run.error_message or None,
"triggered_by": run.triggered_by.username if run.triggered_by_id else None,
"created_at": run.created_at.isoformat(),
"duration_seconds": run.duration_seconds,
}


@csrf_exempt
@require_POST
def api_scraper_trigger(request, key):
# SECURITY NOTE: This endpoint is unauthenticated intentionally. The
# SvelteKit frontend has no Django session (the "Admin User" in the sidebar
# is static UI, not a real session), and the Vite proxy makes all /api/*
# calls same-origin in dev, so there is no CSRF surface from a browser
# cross-site attack on this internal-only tool. @csrf_exempt is consistent
# with all other JSON API endpoints in this file. Re-evaluate when real
# auth is added (Phase 2 roadmap).
from .scrapers import SCRAPERS

if key not in SCRAPERS:
return JsonResponse({"error": "Unknown scraper key"}, status=404)

triggered_by = request.user if request.user.is_authenticated else None
run, already_active = trigger_scraper_run(key, triggered_by=triggered_by)
if already_active:
return JsonResponse({"error": "Scraper already running"}, status=409)

return JsonResponse({"id": run.id, "status": run.status}, status=200)


@csrf_exempt
@require_POST
def api_scraper_run_all(request):
# SECURITY NOTE: Same posture as api_scraper_trigger above — unauthenticated
# intentionally for the same reasons. Revisit when real auth is added.
from .scrapers import SCRAPERS

triggered_by = request.user if request.user.is_authenticated else None
created = []
skipped = []
for key in SCRAPERS:
run, already_active = trigger_scraper_run(key, triggered_by=triggered_by)
if already_active:
skipped.append(key)
else:
created.append({"key": key, "id": run.id, "status": run.status})

return JsonResponse({"created": created, "skipped": skipped}, status=200)


def api_scraper_runs(request):
try:
limit = min(int(request.GET.get("limit", 50)), 200)
except ValueError:
limit = 50
runs = (
ScraperRun.objects.select_related("triggered_by")
.order_by("-created_at")[:limit]
)
return JsonResponse([_serialize_run(r) for r in runs], safe=False)


def api_scraper_runs_active(request):
runs = ScraperRun.objects.filter(
status__in=[ScraperRun.Status.QUEUED, ScraperRun.Status.RUNNING]
).select_related("triggered_by")
return JsonResponse([_serialize_run(r) for r in runs], safe=False)


def api_scraper_run_detail(request, run_id):
run = get_object_or_404(
ScraperRun.objects.select_related("triggered_by"), id=run_id
)
return JsonResponse(_serialize_run(run))


@csrf_exempt
@require_POST
def api_scraper_run_cancel(request, run_id):
# SECURITY NOTE: same posture as api_scraper_trigger — unauthenticated
# intentionally (no Django session from SvelteKit). Re-evaluate when real
# auth is added.
run, signal = cancel_run(run_id)
if signal == "not_found":
return JsonResponse({"error": "Run not found"}, status=404)
if signal == "not_active":
return JsonResponse(
{"error": "Run is not active", "run": _serialize_run(run)}, status=409
)
return JsonResponse(_serialize_run(run), status=200)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Lock down scraper-run endpoints before release.

This block is documented as staff-only, but the trigger/run-all/cancel routes are currently unauthenticated and CSRF-exempt, which lets any reachable caller start/stop scraper jobs. The same unauthenticated run APIs also expose run metadata (including failure details), so this is an immediate control-plane and information-exposure risk. Add a real authorization gate across all scraper-run endpoints (staff session or a dedicated signed API credential) before shipping.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/views.py` around lines 445 - 551, Add authentication and
authorization checks to all scraper-run endpoints to enforce staff-only access.
The endpoints api_scraper_trigger, api_scraper_run_all, api_scraper_runs,
api_scraper_runs_active, api_scraper_run_detail, and api_scraper_run_cancel
currently lack any access control despite being documented as staff-only. Add
either a decorator-based approach (such as Django's staff_member_required or
login_required with manual staff check) or implement explicit authentication
logic at the start of each function to verify the user is authenticated and has
staff privileges, returning an appropriate 403 Forbidden response if
authorization fails.

Comment thread apps/backend/events/views.py
Comment thread apps/frontend/src/routes/scrapers/+page.svelte Outdated
Comment thread apps/frontend/src/routes/scrapers/+page.svelte Outdated
Comment thread process/context/all-context.md
Comment thread process/context/all-context.md
- admin: add pid to ScraperRunAdmin readonly_fields
- run_scraper_job: guard QUEUED→RUNNING with conditional filter().update()
  so a cancelled run is never resurrected by a late-starting worker
- models/runner: add unique_active_scraper_run partial DB constraint
  (scraper_key unique where status IN queued/running) + catch IntegrityError
  in trigger_scraper_run to close the exists()+create() race window (0012 migration)
- views: clamp api_scraper_runs limit to [1, 200]
- page.svelte: fix formatDuration 60s bucket bug; fall back to recentRuns
  for per-card success/failed display when no active run is in runningMap
- api.ts/types.ts: type scraper run status fields as ScraperRunStatus
- tests: update RunEndpointTests fixtures to respect unique_active_scraper_run
- all-context.md: correct thread→subprocess model, add pid field and
  CANCELLED status to ScraperRun docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@process/context/all-context.md`:
- Around line 642-644: The phrase "not a local Neon PostgreSQL DB" in the test
database documentation is contradictory and unclear. Revise this text to
explicitly state that the test database uses SQLite (not Neon PostgreSQL) as a
contrast point, so readers understand clearly which database technology is being
used for testing. Replace the confusing negative phrasing with a direct,
positive statement about the actual database choice to make the guidance
unambiguous.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 37538897-ef1f-42d0-a1b3-494b5f0e5841

📥 Commits

Reviewing files that changed from the base of the PR and between 0062a3d and 7241faa.

📒 Files selected for processing (11)
  • apps/backend/events/admin.py
  • apps/backend/events/management/commands/run_scraper_job.py
  • apps/backend/events/migrations/0012_scraperrun_unique_active_constraint.py
  • apps/backend/events/models.py
  • apps/backend/events/runner.py
  • apps/backend/events/tests.py
  • apps/backend/events/views.py
  • apps/frontend/src/lib/api.ts
  • apps/frontend/src/lib/types.ts
  • apps/frontend/src/routes/scrapers/+page.svelte
  • process/context/all-context.md
✅ Files skipped from review due to trivial changes (1)
  • apps/frontend/src/lib/types.ts
🚧 Files skipped from review as they are similar to previous changes (4)
  • apps/backend/events/management/commands/run_scraper_job.py
  • apps/backend/events/runner.py
  • apps/backend/events/tests.py
  • apps/frontend/src/routes/scrapers/+page.svelte

Comment thread process/context/all-context.md

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
apps/backend/events/models.py (1)

96-96: 💤 Low value

Replace EN DASH with ASCII hyphen in comment.

The comment contains a Unicode en-dash character (–) instead of an ASCII hyphen-minus (-) in "1–2 labels". While functionally harmless, this can cause issues with some text editors or tools that don't handle Unicode well.

📝 Suggested fix
-    # AI-assigned canonical categories (1–2 labels from CANONICAL_CATEGORIES).
+    # AI-assigned canonical categories (1-2 labels from CANONICAL_CATEGORIES).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/events/models.py` at line 96, In the comment starting with
"AI-assigned canonical categories", replace the Unicode en-dash character (–)
with a standard ASCII hyphen (-) in the text "1–2 labels". Change it from "1–2
labels" to "1-2 labels" to ensure compatibility with text editors and tools that
don't handle Unicode characters well.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@apps/backend/events/models.py`:
- Line 96: In the comment starting with "AI-assigned canonical categories",
replace the Unicode en-dash character (–) with a standard ASCII hyphen (-) in
the text "1–2 labels". Change it from "1–2 labels" to "1-2 labels" to ensure
compatibility with text editors and tools that don't handle Unicode characters
well.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 98d48337-102a-483f-9cfa-e8b89bb55e32

📥 Commits

Reviewing files that changed from the base of the PR and between 7241faa and 4f33e1d.

📒 Files selected for processing (4)
  • apps/backend/events/models.py
  • apps/backend/events/tests.py
  • apps/backend/events/views.py
  • apps/frontend/src/lib/types.ts
✅ Files skipped from review due to trivial changes (1)
  • apps/frontend/src/lib/types.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • apps/backend/events/views.py

@hdmGOAT hdmGOAT merged commit 52c4867 into development Jun 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant