feat(IT): remap cities to metropolitan cities and provinces (#1349)#1395
Conversation
Adds a self-contained Python script that joins our IT cities to the ISTAT comuni list via the Sigla automobilistica (2-letter province code) to compute the correct (state_id, state_code) for each city, preferring metropolitan-city / province / free-consortium / autonomous-province / decentralized-regional-entity over the parent region. Resolution order: name-match (region-validated) -> conjunction-half match (e.g. "Lampedusa" -> "Lampedusa e Linosa") -> k-NN proximity vote against the name-matched cluster, capped at 25km. Aliases handle English-only names like "Venice" -> "Venezia". Bundles the ISTAT CSV (CC-BY 3.0 IT) and the most recent run's report under bin/scripts/fixes/data/ for reproducibility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generated by bin/scripts/fixes/italy_remap_cities.py. Cities were
previously parented to the 20 Italian regions; they now point at the
correct ISO 3166-2 province-level entity:
- 14 metropolitan cities (BA, BO, CA, CT, FI, GE, ME, MI, NA, PA,
RC, RM, TO, VE) — all populated for the first time.
- 80 provinces, 6 free municipal consortia (Sicilia), 2 autonomous
provinces (BZ, TN), 4 decentralized regional entities (Friuli).
- Aosta Valley comuni stay on the autonomous region (no province-
level entity exists for sigla AO).
Counts: 9947 input -> 9828 changed, 119 unchanged, 0 unmapped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the mapping source (ISTAT comune list joined on Sigla automobilistica), per-state-type counts after the remap, edge cases (corrupt native fields, Lampedusa bounds gap, Tessera airport), the 8 possible-duplicate pairs flagged for maintainer review, and the validation checks run locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
) Lampedusa (lat ~35.50, the southernmost Italian comune in the Pelagie archipelago, part of Agrigento free municipal consortium) was outside country-bounds.json's IT box (minLat 36.65), causing validate-coordinates.js to flag the record. Lower minLat to 35.49 so Lampedusa and nearby Linosa fall inside the box. Discovered while validating PR #1395 (issue #1349 Italy city remap). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Weekly data-quality review (2026-04-27)Verdict: clean Checks
Advisory (non-blocking)
🤖 Automated weekly review — Claude (sonnet-4-6). Generated by Claude Code |
…#1397/#1399) The remap is a behavior change for downstream consumers — region-level state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays because cities live under provinces/metropolitan cities, not regions. Documents the traversal pattern (states.parent_id) needed for region-aggregate queries so users know how to migrate.
There was a problem hiding this comment.
Pull request overview
This PR addresses issue #1349 by remapping Italy’s city records from region-level parents to the correct province-level entities (metropolitan cities, provinces, autonomous provinces, etc.), using ISTAT’s official comune list as the authoritative mapping source.
Changes:
- Adds a reproducible remap script that resolves each city via name/alias/conjunction matching with a proximity (k-NN) fallback.
- Bundles ISTAT mapping data and a generated structured remap report to support auditing/reproducibility.
- Documents the methodology, counts, validation, and flagged potential duplicates for follow-up.
Reviewed changes
Copilot reviewed 1 out of 5 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
bin/scripts/fixes/italy_remap_cities.py |
Implements the two-pass city→province remapping logic (ISTAT join + proximity fallback) and rewrites contributions/cities/IT.json. |
bin/scripts/fixes/data/it_remap_report.json |
Stores the structured output from a remap run (counts, distributions, possible duplicates, samples). |
.github/fixes-docs/FIX_1349_SUMMARY.md |
Fix documentation summarizing rationale, approach, and validation results for the Italy city remap. |
| input 9947 | ||
| name_unique 7373 exact ISTAT match in same region | ||
| name_region 13 multiple ISTAT matches, region tie-break | ||
| name_ambiguous 69 matches existed but in another region (rejected -> proximity) | ||
| name_conjunction 25 matched via "X e Y" half | ||
| no_match 2467 no ISTAT name (frazione / hamlet / historical name) | ||
|
|
||
| proximity_assigned 2512 resolved by 5-NN vote within 25 km | ||
| proximity_skipped 0 none rejected for being too far | ||
|
|
||
| changed 9828 state_id and/or state_code rewritten | ||
| unchanged 119 already pointed at the correct province (mostly RA, BT) | ||
| unmapped 0 every record reached a final assignment |
There was a problem hiding this comment.
In the counts block, the key proximity_skipped doesn’t match the script/report field name (proximity_skipped_or_far). For consistency (and to make it easier to cross-check with it_remap_report.json), consider updating the label in this doc to use the same key name as the generated report.
| input 9947 | |
| name_unique 7373 exact ISTAT match in same region | |
| name_region 13 multiple ISTAT matches, region tie-break | |
| name_ambiguous 69 matches existed but in another region (rejected -> proximity) | |
| name_conjunction 25 matched via "X e Y" half | |
| no_match 2467 no ISTAT name (frazione / hamlet / historical name) | |
| proximity_assigned 2512 resolved by 5-NN vote within 25 km | |
| proximity_skipped 0 none rejected for being too far | |
| changed 9828 state_id and/or state_code rewritten | |
| unchanged 119 already pointed at the correct province (mostly RA, BT) | |
| unmapped 0 every record reached a final assignment | |
| input 9947 exact ISTAT match in same region | |
| name_unique 7373 exact ISTAT match in same region | |
| name_region 13 multiple ISTAT matches, region tie-break | |
| name_ambiguous 69 matches existed but in another region (rejected -> proximity) | |
| name_conjunction 25 matched via "X e Y" half | |
| no_match 2467 no ISTAT name (frazione / hamlet / historical name) | |
| proximity_assigned 2512 resolved by 5-NN vote within 25 km | |
| proximity_skipped_or_far 0 none rejected for being too far | |
| changed 9828 state_id and/or state_code rewritten | |
| unchanged 119 already pointed at the correct province (mostly RA, BT) | |
| unmapped 0 every record reached a final assignment |
Past machine-translation runs polluted the native field for many IT cities (e.g. Pero -> native "Ma", Postal -> native "Postale", Panchià -> native "Possono agganciare", Pareto -> native "Libbra" which is "pound"). The name field already holds the canonical Italian form (Pomigliano d'Arco, Sant'Ambrogio di Torino, etc.), so where the city's name matches an ISTAT comune, native is now copied from name. Cities whose name is not an ISTAT comune (~2,500 frazioni) are left untouched — no authoritative replacement exists. Counts: 9947 input -> 6070 already correct, 2499 no ISTAT match, 1378 restored. Stacks on top of #1395 (the city remap PR). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p) (#1479) Dropped the legacy half of each pair, keeping the ISTAT-canonical (or English-name, per repo convention) record: id 58976 'Pozzaglio' -> kept 58977 'Pozzaglio ed Uniti' id 61329 'Torino' -> kept 61575 'Turin' id 61530 'Trinità d\'Agultu' -> kept 61531 'Trinità d\'Agultu e Vignola' id 139215 'Inverno' -> kept 139216 'Inverno e Monteleone' id 139523 'Limite' -> kept 136799 'Capraia e Limite' id 140714 'Napoli' -> kept 140713 'Naples' Two pairs are intentionally NOT touched and require maintainer review, since neither record carries the ISTAT-canonical merged name: - MN: 'Sermide' (id 60744) + 'Felonica' (id 138474) canonical comune is 'Sermide e Felonica' (since 2017). - PV: 'Corteolona' (id 138065) + 'Genzone' (id 138905) canonical comune is 'Corteolona e Genzone' (since 2018). Stacks on top of #1395 and #1397. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#1481) PR #1479 deferred these two merged-comune pairs because neither half carried the modern ISTAT-canonical name. Resolves them now: - Pair MN: rename id 60744 'Sermide' → 'Sermide e Felonica' (native + name); set wikiDataId Q39681; drop id 138474 'Felonica'. - Pair PV: rename id 138065 'Corteolona' → 'Corteolona e Genzone' (native + name); set wikiDataId Q3702780; drop id 138905 'Genzone'. Row count: 9941 → 9939. Refs #1349, #1479. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(postcodes/KH): bulk-import 1,640 Cambodia postcodes via Cambodia Post + fix KH regex (#1039) Source: Cambodia Post 2017-reform 6-digit catalogue redistributed via the seanghay/cambodia-postal-codes JSON. All 25 provinces resolve at 100% via direct numeric-iso2 lookup — the source's "id" field (1-25) is identical to CSC's state.iso2 for Cambodia provinces. Records dedupe at (postcode, sangkat + district) granularity. Also fixes the Cambodia postal_code_regex/format in countries.json: the previous "#####" / "^(\\d{5})$" never matched Cambodia Post's post-2017 6-digit codes (e.g. 120101 for Phnom Penh / Khan Chamkar Mon / Tonle Basak) and would have rejected every legitimate row. Updated to "######" / "^(\\d{6})$". Refs #1039. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(IT): drop 87 placeholder Provincia rows (#1349 follow-up) Removes 87 placeholder "Provincia ..." records (ids 59104-59190) from contributions/cities/IT.json. These were leftover province-level pseudo-cities from the pre-#1395 schema; after the city→province remap, every real comune resolves directly to its province via state_id, so the placeholders are duplicate concepts. contributions/cities/IT.json: 9,947 → 9,860. Refs #1349. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#1397/#1399) The remap is a behavior change for downstream consumers — region-level state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays because cities live under provinces/metropolitan cities, not regions. Documents the traversal pattern (states.parent_id) needed for region-aggregate queries so users know how to migrate.
…n) (#1352 PR-C) (#1392) * feat(postcodes/DK): bulk-import 1,089 codes via DAWA (#1039) Adds Danish postcodes via DAWA (Danmarks Adressers Web API) — public sector data published under CC-0 by SDFI/Dataforsyningen. 1. bin/scripts/sync/import_denmark_postcodes.py — pipeline that fetches /kommuner to build a kommune-code -> region-name map, then resolves each /postnumre record's region via its first kommune. Maps the 5 Danish region names to states.json iso2 codes: Region Hovedstaden -> 84 (called "Denmark" in states.json) Region Sjælland -> 85 (Zealand) Region Syddanmark -> 83 (Southern Denmark) Region Midtjylland -> 82 (Central Denmark) Region Nordjylland -> 81 (North Denmark) 2. contributions/postcodes/DK.json — 1,089 codes covering all 5 regions with 100% state_id + 100% coordinate resolution. Validation (zero errors) - All codes match countries.postal_code_regex (^(\\d{4})\$) - All FKs resolve, all state_codes agree with state.iso2 License & attribution - Source: SDFI / Dataforsyningen DAWA (CC-0) - Each row: source: "dawa" Refs: #1039 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(postcodes/IS): bulk-import 195 codes via iceaddr (#1039) Adds Icelandic postcodes via the sveinbjornt/iceaddr Python package which embeds the canonical postcode metadata under MIT licence. 1. bin/scripts/sync/import_iceland_postcodes.py — pipeline that dynamically imports the iceaddr POSTCODES dict and resolves each code's region via prefix range to states.json iso2 1-8 (Statistics Iceland's NUTS-3 boundaries: 1xx-2xx Capital, 3xx Western, 4xx Westfjords, 5xx Northwestern, 6xx Northeastern, 7xx Eastern, 8xx-9xx Southern). 2. contributions/postcodes/IS.json — 195 records with 100% state_id resolution. Locality names combine stadur_nf + lysing (e.g. "Reykjavík, Miðborg"). License & attribution - Source: iceaddr (MIT) embedding Pósturinn data - Each row: source: "iceaddr" Refs: #1039 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(postcodes/SK+RO+SI): batch-import 15,585 codes via 3 community mirrors (#1039) Bundles three small-to-medium European countries with confirmed redistributable postcode mirrors into a single batch importer. 1. bin/scripts/sync/import_eu_batch1_postcodes.py — pipeline that ingests three different shapes (SK JSON, RO CSV, SI CSV) and writes per-country JSON files. ASCII-folding + dash-to-space normalisation handles the Romanian Caraș-Severin / Bistrița-Năsăud cases where the CSV uses spaces and states.json uses hyphens. 2. contributions/postcodes/SK.json — 1,312 records (100% state via KRAJ -> states.iso2 direct match) 3. contributions/postcodes/RO.json — 13,751 records (100% state via ASCII-folded judet name match; all 6 Bucharest sectors mapped to 'B') 4. contributions/postcodes/SI.json — 522 records, country-only by design (source has no municipality info; SI postcodes don't map cleanly to administrative regions) Validation (zero errors) - All codes match countries.postal_code_regex - All FKs resolve, all state_codes agree with state.iso2 License & attribution - SK source: github.com/FeroVolar/PSC-JSON (community Slovenská pošta data) - RO source: github.com/alexionegit/coduripostaleRomaniaPS - SI source: github.com/dlabs/postcode_si (community Posta Slovenije data) Refs: #1039 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(changelog): add notable callout for IT city→province remap (#1395/#1397/#1399) The remap is a behavior change for downstream consumers — region-level state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays because cities live under provinces/metropolitan cities, not regions. Documents the traversal pattern (states.parent_id) needed for region-aggregate queries so users know how to migrate. * docs: multi-level territories policy (FR overseas, dual representation) (#1352 PR-C) Adds MULTI_LEVEL_TERRITORIES.md documenting why 12 French overseas territories (and analogous US/CN/NO entities) appear simultaneously as ISO 3166-1 countries and as ISO 3166-2 subdivisions of their parent state. Captures the maintainer's Option C decision on #1352: keep both representations because (1) downstream API/SDK consumers filter on country_code, (2) ISO 3166-1 lists them as countries, and (3) the breaking change is unjustified for a labelling concern. Cross-links the new policy doc from .claude/CLAUDE.md (Important Rules) and README.md (contributing section). No data changes. Refs: #1352 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reassigns 8,727 of 10,079 French cities from the 12 metropolitan regions plus the Corsica collectivity (20R) to the correct INSEE department-level state (01-95, 2A, 2B, 75C). Mirrors the IT remap shipped in #1395. Endpoints like GET /v1/countries/FR/states/03/cities (Allier) used to return [] because all of Allier's communes sat under the parent region ARA. After this fix Allier holds 59 cities. Same was true for every other metropolitan department. Resolution cascade (offline, dependency-free, idempotent): 1. INSEE name match in current region (region tie-break + nearest coord) 2. INSEE name match anywhere within 25km 3. 5-NN proximity vote weighted by inverse distance, capped at 25km Only state_id / state_code are mutated. name, native, latitude, longitude, wikiDataId, translations, population, timezone are preserved verbatim. 0 unmapped, 0 deleted; re-run produces 0 changes. Bundles the geo.api.gouv.fr commune dataset (Etalab Licence Ouverte v2.0, ODbL-1.0 compatible) under bin/scripts/fixes/data/ for reproducibility. Refs #1352 — does not close (sibling PRs A/B/C/D handle other facets). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hy (#1489) Customer-facing follow-up to #1349 (Italy) and #1352 (France). Cities were re-parented onto departments (FR) and provinces (IT) by #1395 / #1394 / #1393 / #1400 / #1484, but the state records themselves still carried inconsistent 'level' values, blocking downstream filters like "all departments == level=2" or "all regions == level=1". bin/scripts/fixes/states_level_normalise.py drives the change: - FR: 29 region-tier rows None -> 1 (13 metro regions, 3 special metro collectivities incl. Corse + Alsace + Métropole de Lyon, 13 overseas regions/collectivities/territories/dependency). 95 metropolitan departments unchanged at level=2. - IT: 103 rows updated. Final state: 20 at level=1 (15 region + 5 autonomous region) and 106 at level=2 (80 province + 14 metropolitan city + 6 free municipal consortium + 4 decentralized regional entity + 2 autonomous province). Only the 'level' field is touched; idempotent on re-run; non-FR/IT states untouched. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refs #1349 (Italy data: cities reported as 'totally wrong'). Do not auto-close — this PR is the city-level follow-up to the earlier states-only fix; the issue should remain open until cleanup of the duplicates flagged below.
Summary
After-state distribution
Approach
Authoritative source: ISTAT Elenco dei comuni italiani (CC-BY 3.0 IT, 7,896 comuni), bundled at
bin/scripts/fixes/data/istat-elenco-comuni-italiani.csv. The join key is Sigla automobilistica — the 2-letter province plate code (RMfor Rome,TOfor Turin,BZfor Bolzano), which matches ourstate.iso21:1 for every province-level entity.bin/scripts/fixes/italy_remap_cities.pyresolves each city in this order:parent_id). 7,373 cities matched.The script is idempotent: re-running on the rewritten data produces 0 changes.
Local validation (mirrors
.github/scripts/validate-*)state_idresolves to an IT state; everystate_codeequalsstate.iso2.^Q\d+$.country-bounds.jsonIT box. Lampedusa (lat 35.5°) sits just south of the bounds, but that's a pre-existingcountry-bounds.jsongap — Lampedusa is the southernmost Italian comune, geographically closer to Tunisia than the mainland. Not caused by this remap.Possible duplicates flagged for maintainer review (NOT deleted)
8 pairs / groups now map to a single comune. Per the task constraints, no records were deleted:
Most are conjunction-merger artefacts (older comuni that ISTAT now lists under unified names) or English/Italian language duplicates. Recommend keeping the ISTAT-canonical row and removing the legacy half in a follow-up.
Out-of-scope but flagged for follow-up
nativefield. Manynativevalues look machine-translated (e.g.Pero→ nativeMa;Postal→ nativePostale). The script intentionally matches onnamerather thannative. A separate fix should restore correct Italian comune names tonative.country-bounds.jsonIT box does not include the Pelagie islands (Lampedusa/Linosa). Worth wideningIT.minLatto 35.5° in a one-line PR.Commits
feat(IT): add italy_remap_cities.py + ISTAT comune mapping data (#1349)— script + bundled CSV + structured JSON report.feat(IT): remap 9,828 cities to provinces / metropolitan cities (#1349)— pure data diff (onlystate_idandstate_codefields touched).docs(IT): add FIX_1349_SUMMARY for cities remap (#1349)— fix-docs entry.Test plan
python3 bin/scripts/fixes/italy_remap_cities.py --dry-runreports 9947 input / 0 changes after merge (idempotent).jq '[.[] | select(.country_code=="IT" and .type=="metropolitan city")] | length' contributions/states/states.jsonreturns 14, and each metro state has cities (jq '[.[] | select(.state_code=="MI")] | length' contributions/cities/IT.jsonetc.).validate-schema,validate-cross-reference,validate-coordinates,detect-duplicatespass.🤖 Generated with Claude Code