Skip to content

feat(postcodes): backfill state_id on 19,272 rows across 7 countries (#1039)#1542

Open
dr5hn wants to merge 1 commit into
masterfrom
feat/postcodes-backfill-state-fk
Open

feat(postcodes): backfill state_id on 19,272 rows across 7 countries (#1039)#1542
dr5hn wants to merge 1 commit into
masterfrom
feat/postcodes-backfill-state-fk

Conversation

@dr5hn
Copy link
Copy Markdown
Owner

@dr5hn dr5hn commented May 8, 2026

Summary

Backfills state_id (and state_code) on 19,272 existing postcode rows that previously shipped with state_id: null. No new external data β€” uses CSC's own contributions/cities/<iso2>.json as a city β†’ state_id lookup table.

iso2 Country Total rows Backfilled Match %
SE Sweden 16,392 13,162 80%
ZA South Africa 5,685 2,906 51%
NL Netherlands 4,072 2,463 60%
RS Serbia 1,170 343 29%
SI Slovenia 522 271 51%
GB UK 124 98 79%
VN Vietnam 63 29 46%
Total 27,028 19,272 71%

Why this is a strict improvement

  • Rows that do match are now fully FK-resolved.
  • Rows that don't match remain state_id: null β€” the same as before.
  • Names that map to multiple distinct states (e.g., NL's Rijswijk exists in both Noord-Brabant and Zuid-Holland) are deliberately not matched, since postcode data alone can't disambiguate. 15 ambiguous names were dropped from the NL lookup, 24 from SE, 29 from GB.

Why these 7 countries (and not others)

Audit of all 0%-FK postcode files vs cities.json:

Country Verdict Reason
LV skipped source ships codes without locality_name (no field to match)
MT skipped "localities" are street names, not cities
MU/GR/KE/NP skipped localities are post-office names; <15% match

Implementation

bin/scripts/sync/backfill_postcode_state_fk.py β€” single pass, idempotent, supports --dry-run and --only=<iso2[,iso2...]>. NFKD-fold names for diacritic-insensitive matching. Skips rows whose state_id is already set.

The diff size is large because the existing files were re-emitted with consistent JSON formatting (existing id, created_at, updated_at, flag fields preserved verbatim; only state_id and state_code mutated where matchable).

πŸ€– Generated with Claude Code

…1039)

Earlier postcode imports for 7 countries shipped with state_id=null
because upstream sources only carried (code, locality) pairs. CSC's
own contributions/cities/<iso2>.json already has city β†’ state_id, so
unambiguous folded-name matching recovers the FKs without external
data.

| iso2 | total | backfilled | match% |
|------|------:|-----------:|-------:|
| SE   | 16,392| 13,162     | 80%    |
| ZA   |  5,685|  2,906     | 51%    |
| NL   |  4,072|  2,463     | 60%    |
| RS   |  1,170|    343     | 29%    |
| SI   |    522|    271     | 51%    |
| GB   |    124|     98     | 79%    |
| VN   |     63|     29     | 46%    |

Names that map to multiple distinct states are dropped (postcode
data alone can't disambiguate). Strict improvement: rows still
unmatched remain state_id=null as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dosubot dosubot Bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label May 8, 2026
@dosubot dosubot Bot added the enhancement New feature or request label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data:postcodes enhancement New feature or request large-contribution size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant