feat: process approved data source uploads into /explore (#190)#196
Conversation
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
📝 WalkthroughWalkthroughAdds a staff datasource upload pipeline: backend CSV/XLSX parsing/validation, immediate normalization and bulk-insert of rows at upload, two explore endpoints for approved uploads, schema and admin UI mapping inputs, frontend picker/integration on /explore, tests, and openpyxl dependency. Changes
Sequence Diagram(s)sequenceDiagram
participant User as Staff Contributor
participant Client as Browser
participant API as Backend API
participant Parser as Datasource Parser
participant DB as Database
User->>Client: Upload CSV/XLSX + mapping
Client->>API: POST /api/admin/uploads/datasource (file, mapping...)
API->>Parser: parse_datasource_file(bytes, ext, MappingConfig)
Parser->>Parser: read_csv_bytes / read_xlsx_bytes
Parser->>Parser: validate headers, normalize rows (FIPS, numeric, year)
Parser->>Parser: apply thresholds (FIPS %, numeric %, min rows)
alt Validation fails
Parser-->>API: DatasourceParseError with detail
API-->>Client: 422 with structured error.detail
else Success
Parser-->>API: ParseResult (normalized_rows, preview)
API->>DB: BEGIN
API->>DB: INSERT uploads (pending_review + metadata)
API->>DB: BULK INSERT uploaded_datasets (normalized rows as jsonb)
API->>DB: COMMIT
API-->>Client: 200 OK (upload recorded)
end
sequenceDiagram
participant User as End User
participant Client as Browser
participant API as Backend API
participant DB as Database
User->>Client: Open /explore, select "Staff Uploads"
Client->>API: GET /api/explore/staff-uploads/available
API->>DB: SELECT uploads WHERE upload_type='datasource' AND status='approved'
DB-->>API: [{upload_id, metric_name, has_race, row_count, ...}]
API-->>Client: list of available staff uploads
User->>Client: Choose dataset + filters (state/race/year)
Client->>API: GET /api/explore/staff-uploads?upload_id=...&state_fips=...&race=...&year=...
API->>DB: SELECT aggregated values FROM uploaded_datasets WHERE upload_id=... AND filters...
DB-->>API: aggregated rows, national_average, available_* values
API-->>Client: ExploreResponse-shaped payload
Client->>Client: Render map/chart based on response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 67df103039
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Greptile SummaryThis PR ships an end-to-end staff CSV/XLSX upload pipeline: files are parsed, validated, and normalized at submit time; rows land in Key findings:
Confidence Score: 3/5Not safe to merge as-is: two backend bugs (Infinity crash + pre-read size check) and one frontend bug (race filter reset on reload) need fixing before production use. Three P1 issues were found. The Infinity-in-json.dumps bug turns a legitimate 422 into an unhandled 500, the full-file-read-before-size-check is a DoS-adjacent pattern, and the race filter page-reload regression silently breaks the explore view for race-less staff datasets. The rest of the implementation — parse pipeline, DB transaction, explore endpoints, test coverage — is well-built. Fixing these three items should be straightforward and bring the PR to merge-ready.
Important Files Changed
Sequence DiagramsequenceDiagram
participant C as Contributor (browser)
participant API as FastAPI /api/admin/uploads/datasource
participant Parser as datasource_processing.parser
participant DB as PostgreSQL
C->>API: POST multipart (file + MappingConfig form fields)
API->>API: validate file ext + size
API->>API: validate DataSourceUploadRequest schema
API->>Parser: parse_datasource_file(content, ext, mapping) [thread]
Parser->>Parser: read_csv_bytes / read_xlsx_bytes
Parser->>Parser: _check_columns_exist
Parser->>Parser: _normalize_rows (FIPS/numeric/year coercion + drop tracking)
Parser->>Parser: quality gates (bad_fips ratio, numeric ratio, MIN_VALID_ROWS)
Parser-->>API: ParseResult (normalized_rows, preview_rows, dropped_counts)
API->>DB: BEGIN txn INSERT uploads + bulk INSERT uploaded_datasets chunks
DB-->>API: COMMIT
API-->>C: 200 UploadResponse
note over C,DB: Admin approval (status flip only)
C->>API: PATCH /api/admin/uploads/{id}/review
API->>DB: UPDATE uploads SET status=approved
DB-->>API: ok
API-->>C: status approved
note over C,DB: Explore
C->>API: GET /api/explore/staff-uploads?upload_id=...
API->>DB: SELECT from uploaded_datasets JOIN uploads WHERE status=approved
DB-->>API: aggregated rows AVG by state_fips/race/year
API-->>C: ExploreResponse
|
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/d4bl/app/schemas.py`:
- Around line 642-647: Add the same whitespace-normalizing and non-empty
validation used for other required string fields to the geo_column and
metric_value_column fields so whitespace-only values are rejected; locate the
model/class in src/d4bl/app/schemas.py (the class that declares geo_column,
metric_value_column, metric_name, etc.) and add validators (matching the pattern
used for source_name) that strip surrounding whitespace and raise a validation
error if the result is empty or only whitespace, ensuring consistent downstream
behavior.
In `@src/d4bl/app/upload_routes.py`:
- Around line 94-99: Make data_year optional when a year_column exists: change
the route parameter signature so data_year is Optional[int] = Form(None) instead
of required, and add validation in the same endpoint (using the parameter names
data_year and year_column) to raise an HTTPException if year_column is
None/empty and data_year is still None. Keep existing behavior that
MappingConfig.data_year remains a fallback for files without a year column, and
ensure any downstream code that uses data_year handles the optional type (e.g.,
fall back to MappingConfig.data_year or abort with the same validation message).
In `@src/d4bl/services/datasource_processing/parser.py`:
- Around line 58-61: The except block that catches StopIteration and raises
DatasourceParseError should preserve or intentionally suppress exception
chaining; update the raise to include an explicit "from None" (i.e., raise
DatasourceParseError("file has no header row") from None) so the StopIteration
context is not leaked; modify the try/except around next(reader) where
raw_header is set in parser.py accordingly.
- Around line 79-101: The workbook opened with load_workbook (variable wb) is
not explicitly closed; ensure wb.close() is always called to release resources
by wrapping workbook usage in a try/finally (create wb, then try: use
ws/rows_iter/raw_header/rows and return; finally: wb.close()) or use
contextlib.closing(wb) as a context manager so that wb.close() runs even on
errors or early returns.
In `@tests/test_datasource_processing.py`:
- Around line 76-86: Add a test to ensure coerce_year rejects boolean inputs:
update the TestCoerceYear test suite to include True and False (e.g., via
pytest.mark.parametrize or a new test method) and assert that calling
coerce_year(True) and coerce_year(False) raises ValueError; target the
coerce_year function so the validation branch that explicitly rejects booleans
(validation.py handling) is covered.
- Around line 54-73: Add unit tests to cover native numeric passthrough for
coerce_numeric by asserting that passing an int (e.g., 42) and a float (e.g.,
14.3) returns the same numeric values (42.0 or 42 and 14.3 respectively,
matching existing behavior); place these new assertions alongside the existing
TestCoerceNumeric tests so they exercise the logic in coerce_numeric that
handles int/float inputs directly.
In `@ui-nextjs/app/explore/page.tsx`:
- Around line 362-371: When handling dataset switches in the StaffDatasetPicker
onChange handler (the callback that calls setActiveUploadSummary and
setFilters), also clear filters.metric and clear the selectedState so leftover
metric or state from the previous upload doesn't mismatch the new dataset;
update the setFilters call that currently resets uploadId, race, and year to
also set metric: null (or a default) and ensure you call the state setter for
selectedState (e.g., setSelectedState(null)) so the UI, legend, detail card, and
chart reflect the newly selected dataset.
In `@ui-nextjs/components/admin/UploadDataSource.tsx`:
- Around line 329-337: The inputs bound to raceColumn/setRaceColumn (and the
similar yearColumn/setYearColumn input) lack accessible labels; add explicit
label elements tied to each input by adding unique id attributes (e.g.,
id="race-column" and id="year-column") on the inputs and corresponding <label
htmlFor="..."> elements that describe the field (or use a visually-hidden class
if you don't want visible text), ensuring the label text conveys purpose (e.g.,
"Race column" / "Year column") and keeping existing required/placeholder
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 93ed8b5d-e7bb-4c15-a434-822e647dc79d
📒 Files selected for processing (21)
docs/superpowers/plans/2026-04-18-datasource-upload-pipeline.mddocs/superpowers/specs/2026-04-18-datasource-upload-pipeline-design.mdpyproject.tomlsrc/d4bl/app/api.pysrc/d4bl/app/schemas.pysrc/d4bl/app/upload_routes.pysrc/d4bl/services/datasource_processing/__init__.pysrc/d4bl/services/datasource_processing/parser.pysrc/d4bl/services/datasource_processing/validation.pytests/conftest.pytests/test_datasource_processing.pytests/test_explore_api.pytests/test_settings.pytests/test_upload_api.pyui-nextjs/app/explore/page.tsxui-nextjs/app/guide/page.tsxui-nextjs/components/admin/ReviewDetail.tsxui-nextjs/components/admin/UploadDataSource.tsxui-nextjs/components/explore/MetricFilterPanel.tsxui-nextjs/components/explore/StaffDatasetPicker.tsxui-nextjs/lib/explore-config.ts
- Reject non-finite numerics before json serialization; pad 4-digit FIPS - Enforce max upload size with bounded read; optional data_year when year_column set - Pydantic: non-blank geo/metric columns; year from data_year or year_column - Parser: empty data rows, StopIteration chains, close XLSX workbooks in finally - Explore: staff-uploads persistence race default; clear metric/state on dataset change - Admin upload form: accessible labels; omit data_year when per-row year column - Staff picker: nullable data_year; staff-uploads learn-more links to /guide Made-with: Cursor
Review follow-up (commit d4083ab)Addressed Greptile / Codex / CodeRabbit items:
CI: Please re-run checks and resolve review threads if this matches your expectations. |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ui-nextjs/app/explore/page.tsx (1)
151-153:⚠️ Potential issue | 🟠 MajorClear
loadingon the no-session early return.If auth flips while a request is in flight, the aborted request skips the
finallyunset, and the next invocation returns here withloadingstilltrue. That leaves the explore view stuck behind its skeleton/overlay until a remount.💡 Proposed fix
const fetchData = useCallback(async (signal: AbortSignal) => { - if (!session?.access_token) return; + if (!session?.access_token) { + setLoading(false); + setExploreData(null); + setBills([]); + return; + }Based on learnings:
PolicyExploreViewintentionally resets loading on the missing-auth early return because an abort during auth change can otherwise leave the spinner stuck indefinitely.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ui-nextjs/app/explore/page.tsx` around lines 151 - 153, The fetchData async callback (fetchData(signal: AbortSignal)) can early-return when session?.access_token is missing but leaves the loading flag true if a previous request was aborted; update fetchData to explicitly clear the loading state before returning on the no-session path (e.g., call the same setLoading(false)/resetLoading used in PolicyExploreView’s missing-auth handling) so the explore skeleton/overlay is not left visible after auth flips mid-request.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/d4bl/services/datasource_processing/parser.py`:
- Around line 203-206: The unsupported-extension error raised in parser.py (the
DatasourceParseError instance in the branch that checks file extension) sets
detail={"allowed": sorted(SUPPORTED_EXTS)} but omits a human-friendly "message"
key; update the raise to include a readable message in the detail payload (e.g.,
include "message": f"unsupported file type {ext!r}") so UI fallbacks get a
friendly string alongside the "allowed" list while keeping the existing error
text and keys.
- Around line 146-150: The persisted geo_fips value is taken directly from raw
text and can lose leading zeros (e.g., Excel-stored 01001 -> "1001"); update the
logic where geo_fips is assigned (the geo_fips variable populated from
mapping.geo_column near derive_state_fips) to canonicalize using the recovered
state_fips: after calling derive_state_fips(geo_raw), if state_fips exists and
geo_fips is digits-only and its length is shorter than the full FIPS length
(state_fips length + county code length), left-pad the numeric portion with
zeros to produce a canonical 5-digit county FIPS (state_fips + county.zfill(3))
and assign that back to geo_fips; apply the same canonicalization in the second
occurrence around the block referenced (lines ~179-185) so stored geo_fips
always preserves leading zeros.
In `@src/d4bl/services/datasource_processing/validation.py`:
- Around line 42-47: The helper pads county FIPS when len(s)==4 but misses tract
FIPS dropped to len(s)==10 and also silently accepts bad lengths; update the
logic around variable s so that you also pad when len(s)==10 (prepend "0") in
addition to the existing len==4 and len==1 cases, then validate the final length
and raise an error (or return a failure) for any s whose length is not one of
the expected canonical lengths (2, 5, or 11) before returning s[:2]; keep the
final return of s[:2] but ensure invalid inputs are rejected instead of
truncated.
In `@ui-nextjs/app/explore/page.tsx`:
- Around line 369-380: When handling StaffDatasetPicker's onChange, also clear
the upload-scoped sentinel and previous results: set
didAutoSelectDefaults.current = false and reset exploreData (via
setExploreData(null) or the relevant setter) immediately when switching uploads,
in addition to setActiveUploadSummary and setFilters so the new upload doesn't
inherit the old auto-select state or render stale exploreData if the new fetch
fails.
---
Outside diff comments:
In `@ui-nextjs/app/explore/page.tsx`:
- Around line 151-153: The fetchData async callback (fetchData(signal:
AbortSignal)) can early-return when session?.access_token is missing but leaves
the loading flag true if a previous request was aborted; update fetchData to
explicitly clear the loading state before returning on the no-session path
(e.g., call the same setLoading(false)/resetLoading used in PolicyExploreView’s
missing-auth handling) so the explore skeleton/overlay is not left visible after
auth flips mid-request.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9c824797-9c65-49c7-ac57-59a0bdf0b94a
📒 Files selected for processing (10)
src/d4bl/app/schemas.pysrc/d4bl/app/upload_routes.pysrc/d4bl/services/datasource_processing/parser.pysrc/d4bl/services/datasource_processing/validation.pytests/test_datasource_processing.pytests/test_upload_api.pyui-nextjs/app/explore/page.tsxui-nextjs/components/admin/UploadDataSource.tsxui-nextjs/components/explore/StaffDatasetPicker.tsxui-nextjs/lib/explore-config.ts
Summary
uploaded_datasetsin the same transaction as theUploadrow./explorewith a dataset picker; approved uploads render through the existing map/chart/table components.Closes #190.
Spec + plan
docs/superpowers/specs/2026-04-18-datasource-upload-pipeline-design.mddocs/superpowers/plans/2026-04-18-datasource-upload-pipeline.mdTest plan
pytest— 888 passed (excluding optionaltests/test_training/test_integration_models.py, which requires Ollama + fine-tuned models and can fail on non-JSON model output)npm run build && npm run lint— clean (one pre-existing hooks warning inapp/page.tsx)Made with Cursor
Summary by CodeRabbit
New Features
Explore
Admin Improvements
API Endpoints
Documentation & Tests