fix(IT): restore corrupted native fields (#1349 follow-up)#1475
Conversation
Adds a self-contained Python script that joins our IT cities to the ISTAT comuni list via the Sigla automobilistica (2-letter province code) to compute the correct (state_id, state_code) for each city, preferring metropolitan-city / province / free-consortium / autonomous-province / decentralized-regional-entity over the parent region. Resolution order: name-match (region-validated) -> conjunction-half match (e.g. "Lampedusa" -> "Lampedusa e Linosa") -> k-NN proximity vote against the name-matched cluster, capped at 25km. Aliases handle English-only names like "Venice" -> "Venezia". Bundles the ISTAT CSV (CC-BY 3.0 IT) and the most recent run's report under bin/scripts/fixes/data/ for reproducibility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generated by bin/scripts/fixes/italy_remap_cities.py. Cities were
previously parented to the 20 Italian regions; they now point at the
correct ISO 3166-2 province-level entity:
- 14 metropolitan cities (BA, BO, CA, CT, FI, GE, ME, MI, NA, PA,
RC, RM, TO, VE) — all populated for the first time.
- 80 provinces, 6 free municipal consortia (Sicilia), 2 autonomous
provinces (BZ, TN), 4 decentralized regional entities (Friuli).
- Aosta Valley comuni stay on the autonomous region (no province-
level entity exists for sigla AO).
Counts: 9947 input -> 9828 changed, 119 unchanged, 0 unmapped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the mapping source (ISTAT comune list joined on Sigla automobilistica), per-state-type counts after the remap, edge cases (corrupt native fields, Lampedusa bounds gap, Tessera airport), the 8 possible-duplicate pairs flagged for maintainer review, and the validation checks run locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Past machine-translation runs polluted the native field for many IT cities (e.g. Pero -> native "Ma", Postal -> native "Postale", Panchià -> native "Possono agganciare", Pareto -> native "Libbra" which is "pound"). The name field already holds the canonical Italian form (Pomigliano d'Arco, Sant'Ambrogio di Torino, etc.), so where the city's name matches an ISTAT comune, native is now copied from name. Cities whose name is not an ISTAT comune (~2,500 frazioni) are left untouched — no authoritative replacement exists. Counts: 9947 input -> 6070 already correct, 2499 no ISTAT match, 1378 restored. Stacks on top of #1395 (the city remap PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Superseded by clean rebased branch — see new PR. |
There was a problem hiding this comment.
Pull request overview
Restores corrupted native fields for Italian cities by aligning native with name when the city name matches an ISTAT comune, and adds supporting scripts/reports/documentation.
Changes:
- Add a native-field restoration script for Italy that updates
contributions/cities/IT.jsonin place. - Add structured JSON reports capturing before/after statistics and example changes.
- Add Italy fix summary documentation (currently focused on the city remap).
Reviewed changes
Copilot reviewed 1 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| bin/scripts/fixes/italy_restore_native.py | New script to restore corrupted native values using an ISTAT name index. |
| bin/scripts/fixes/italy_remap_cities.py | Adds/contains the Italy city remap implementation (scope appears broader than this PR’s description). |
| bin/scripts/fixes/data/it_remap_report.json | Structured report output for the remap script run. |
| bin/scripts/fixes/data/it_native_restore_report.json | Structured report output for the native-restore script run. |
| .github/fixes-docs/FIX_1349_SUMMARY.md | Summary doc for FIX-1349 (currently describes remap; needs alignment with native-restore work in this PR). |
| ## Edge cases | ||
|
|
||
| ### Native field is unreliable | ||
| Many `native` values in `IT.json` have been corrupted by past machine-translation runs (e.g. `"Pero" → native: "Ma"`, `"Postal" → native: "Postale"`). The script intentionally matches on `name`, not `native`. Fixing `native` is out of scope for this PR but should be tracked separately. |
There was a problem hiding this comment.
This section says fixing the native field is out of scope, but this PR is explicitly restoring corrupted native values (and adds italy_restore_native.py + its report). Please update the summary so it matches the current PR scope (e.g., reference the native-restore fix or move this note to the remap-only doc if it’s meant to describe a different PR).
| Many `native` values in `IT.json` have been corrupted by past machine-translation runs (e.g. `"Pero" → native: "Ma"`, `"Postal" → native: "Postale"`). The script intentionally matches on `name`, not `native`. Fixing `native` is out of scope for this PR but should be tracked separately. | |
| Many `native` values in `IT.json` have been corrupted by past machine-translation runs (e.g. `"Pero" → native: "Ma"`, `"Postal" → native: "Postale"`). The remap script intentionally matches on `name`, not `native`, to avoid propagating those bad values. This PR also restores corrupted `native` values via the dedicated native-restore step (`italy_restore_native.py`) and its report, so the remap and native-field repair are both covered here. |
|
|
||
| ## Files Modified | ||
|
|
||
| - `contributions/cities/IT.json` — 9,828 records updated (only `state_id` and `state_code` fields). |
There was a problem hiding this comment.
The “Files Modified” bullet claims IT.json updates were only state_id/state_code, but this PR also changes native for many cities. Please update this line (or split the documentation) so the described fields match what the PR actually mutates.
| - `contributions/cities/IT.json` — 9,828 records updated (only `state_id` and `state_code` fields). | |
| - `contributions/cities/IT.json` — 9,828 records updated (`state_id`, `state_code`, and `native` fields). |
Re-targeted from #1397 after the parent (#1395) was merged.
Refs #1349 — restores 1,378 corrupted
nativefields on Italian cities (machine-translation artefacts likePero→Ma,Postal→Postale,Pareto→Libbra).Where a city's
namematches an ISTAT comune, this PR setsnative = name. Cities whosenameis not an ISTAT comune (~2,500 frazioni) are intentionally left untouched.Counts
🤖 Generated with Claude Code