Skip to content

feat(IT): remap cities to metropolitan cities and provinces (#1349)#1395

Merged
dr5hn merged 3 commits into
masterfrom
feat/issue-1349-italy-city-remap
Apr 27, 2026
Merged

feat(IT): remap cities to metropolitan cities and provinces (#1349)#1395
dr5hn merged 3 commits into
masterfrom
feat/issue-1349-italy-city-remap

Conversation

@dr5hn
Copy link
Copy Markdown
Owner

@dr5hn dr5hn commented Apr 25, 2026

Refs #1349 (Italy data: cities reported as 'totally wrong'). Do not auto-close — this PR is the city-level follow-up to the earlier states-only fix; the issue should remain open until cleanup of the duplicates flagged below.

Summary

  • 9,947 IT cities were parented to the 20 region-level entities. None pointed at any of the 14 metropolitan cities, and the more granular provinces / free consortia / autonomous provinces / decentralized regional entities were similarly under-used.
  • This PR remaps 9,828 cities to the correct ISO 3166-2:IT province-level entity. 119 records (mostly Ravenna and Barletta-Andria-Trani frazioni already correctly parented) are unchanged. 0 records remain unmapped.

After-state distribution

Type Cities
Metropolitan city 1,682
Province 7,337
Autonomous province 390
Free municipal consortium 176
Decentralized regional entity 271
Autonomous region (Aosta only) 91
Total 9,947

Approach

Authoritative source: ISTAT Elenco dei comuni italiani (CC-BY 3.0 IT, 7,896 comuni), bundled at bin/scripts/fixes/data/istat-elenco-comuni-italiani.csv. The join key is Sigla automobilistica — the 2-letter province plate code (RM for Rome, TO for Turin, BZ for Bolzano), which matches our state.iso2 1:1 for every province-level entity.

bin/scripts/fixes/italy_remap_cities.py resolves each city in this order:

  1. Name match, region-validated. Folds diacritics / apostrophes, looks up ISTAT denomination, requires the candidate's region to equal the city's current region (walked up via parent_id). 7,373 cities matched.
  2. English-name aliases (Venice → Venezia, Florence → Firenze, etc.).
  3. Conjunction-half match for "X e Y"-style comuni (Lampedusa → Lampedusa e Linosa). 25 cities.
  4. k-NN proximity vote (k=5, ≤25km cap) for frazioni and historical names. 2,512 cities. More robust than single-nearest at province borders — fixes Mestre / Murano / Lido / Burano correctly snapping to Venice metro instead of Treviso province.

The script is idempotent: re-running on the rewritten data produces 0 changes.

Local validation (mirrors .github/scripts/validate-*)

  • ✅ Schema: required fields present, country_code=IT, country_id=107.
  • ✅ Cross-reference: every state_id resolves to an IT state; every state_code equals state.iso2.
  • ✅ wikiDataId: all match ^Q\d+$.
  • ✅ Coordinates: 9,946/9,947 within country-bounds.json IT box. Lampedusa (lat 35.5°) sits just south of the bounds, but that's a pre-existing country-bounds.json gap — Lampedusa is the southernmost Italian comune, geographically closer to Tunisia than the mainland. Not caused by this remap.
  • ✅ Same-name + ≤5km duplicate scan: 0 hits.
  • ✅ JSON parses cleanly.

Possible duplicates flagged for maintainer review (NOT deleted)

8 pairs / groups now map to a single comune. Per the task constraints, no records were deleted:

Sigla Comune Cities
CR Pozzaglio ed Uniti "Pozzaglio", "Pozzaglio ed Uniti"
FI Capraia e Limite "Capraia e Limite", "Limite"
MN Sermide e Felonica "Sermide", "Felonica"
NA Napoli "Naples", "Napoli"
PV Corteolona e Genzone "Corteolona", "Genzone"
PV Inverno e Monteleone "Inverno", "Inverno e Monteleone"
SS Trinità d'Agultu e Vignola "Trinità d'Agultu", "Trinità d'Agultu e Vignola"
TO Torino "Torino", "Turin"

Most are conjunction-merger artefacts (older comuni that ISTAT now lists under unified names) or English/Italian language duplicates. Recommend keeping the ISTAT-canonical row and removing the legacy half in a follow-up.

Out-of-scope but flagged for follow-up

  • Corrupt native field. Many native values look machine-translated (e.g. Pero → native Ma; Postal → native Postale). The script intentionally matches on name rather than native. A separate fix should restore correct Italian comune names to native.
  • country-bounds.json IT box does not include the Pelagie islands (Lampedusa/Linosa). Worth widening IT.minLat to 35.5° in a one-line PR.
  • Tessera (Venice airport area) still maps to Treviso because its 5-NN cluster is mixed; not enough impact to hand-tune.

Commits

  1. feat(IT): add italy_remap_cities.py + ISTAT comune mapping data (#1349) — script + bundled CSV + structured JSON report.
  2. feat(IT): remap 9,828 cities to provinces / metropolitan cities (#1349) — pure data diff (only state_id and state_code fields touched).
  3. docs(IT): add FIX_1349_SUMMARY for cities remap (#1349) — fix-docs entry.

Test plan

  • python3 bin/scripts/fixes/italy_remap_cities.py --dry-run reports 9947 input / 0 changes after merge (idempotent).
  • jq '[.[] | select(.country_code=="IT" and .type=="metropolitan city")] | length' contributions/states/states.json returns 14, and each metro state has cities (jq '[.[] | select(.state_code=="MI")] | length' contributions/cities/IT.json etc.).
  • CI's validate-schema, validate-cross-reference, validate-coordinates, detect-duplicates pass.
  • After Phinx import + JSON export, the round-trip preserves the new state assignments.

🤖 Generated with Claude Code

dr5hn and others added 3 commits April 25, 2026 19:11
Adds a self-contained Python script that joins our IT cities to the
ISTAT comuni list via the Sigla automobilistica (2-letter province code)
to compute the correct (state_id, state_code) for each city, preferring
metropolitan-city / province / free-consortium / autonomous-province /
decentralized-regional-entity over the parent region.

Resolution order: name-match (region-validated) -> conjunction-half
match (e.g. "Lampedusa" -> "Lampedusa e Linosa") -> k-NN proximity
vote against the name-matched cluster, capped at 25km. Aliases handle
English-only names like "Venice" -> "Venezia".

Bundles the ISTAT CSV (CC-BY 3.0 IT) and the most recent run's report
under bin/scripts/fixes/data/ for reproducibility.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generated by bin/scripts/fixes/italy_remap_cities.py. Cities were
previously parented to the 20 Italian regions; they now point at the
correct ISO 3166-2 province-level entity:
  - 14 metropolitan cities (BA, BO, CA, CT, FI, GE, ME, MI, NA, PA,
    RC, RM, TO, VE) — all populated for the first time.
  - 80 provinces, 6 free municipal consortia (Sicilia), 2 autonomous
    provinces (BZ, TN), 4 decentralized regional entities (Friuli).
  - Aosta Valley comuni stay on the autonomous region (no province-
    level entity exists for sigla AO).

Counts: 9947 input -> 9828 changed, 119 unchanged, 0 unmapped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the mapping source (ISTAT comune list joined on Sigla
automobilistica), per-state-type counts after the remap, edge cases
(corrupt native fields, Lampedusa bounds gap, Tessera airport), the
8 possible-duplicate pairs flagged for maintainer review, and the
validation checks run locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dr5hn added a commit that referenced this pull request Apr 25, 2026
)

Lampedusa (lat ~35.50, the southernmost Italian comune in the Pelagie
archipelago, part of Agrigento free municipal consortium) was outside
country-bounds.json's IT box (minLat 36.65), causing
validate-coordinates.js to flag the record. Lower minLat to 35.49 so
Lampedusa and nearby Linosa fall inside the box.

Discovered while validating PR #1395 (issue #1349 Italy city remap).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Owner Author

dr5hn commented Apr 27, 2026

Weekly data-quality review (2026-04-27)

Verdict: clean

Checks

  • Schema: ✅ Only state_id and state_code modified on existing records; no new records and no auto-managed fields added.
  • FK integrity: ✅ Author ran cross-reference validator: all 9,947 cities resolve state_id to an IT state whose iso2 matches state_code. ISTAT CSV join + k-NN proximity fallback methodology is sound and independently reproducible.
  • Coordinates: ✅ No coordinate changes. Pre-existing edge case: Lampedusa (~35.5°N) sits just south of the country-bounds.json IT minLat: 35.49 — not introduced by this PR.
  • Wikidata: N/A (no Wikidata field changes)
  • Naming convention: ✅ No name/native field changes. English-name convention preserved (Turin, Naples retained); duplicate Italian-name rows handled in PR fix(IT): drop 6 duplicate pairs flagged by remap (#1349 follow-up) #1399.

Advisory (non-blocking)

  • Scale — 9,828 of 9,947 records remapped in a single PR. Individual mapping verification is impractical; the ISTAT join methodology, the idempotency --dry-run check, and the bundled it_remap_report.json are the primary audit trails.
  • 8 pending duplicate pairs — Sermide/Felonica and Corteolona/Genzone (both ambiguous merged comuni where neither row carries the modern ISTAT canonical name) remain open; requires maintainer sign-off on which ID to rename and which to drop. The 6 unambiguous pairs are addressed in PR fix(IT): drop 6 duplicate pairs flagged by remap (#1349 follow-up) #1399.
  • country-bounds.json IT minLat gap — The Pelagie islands (Lampedusa ~35.5°N, Linosa ~35.8°N) lie south of the current bound. A one-line change widening IT.minLat to 35.0 would eliminate false coordinate-validation failures for those islands.

🤖 Automated weekly review — Claude (sonnet-4-6).


Generated by Claude Code

dr5hn added a commit that referenced this pull request Apr 27, 2026
…#1397/#1399)

The remap is a behavior change for downstream consumers — region-level
state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays
because cities live under provinces/metropolitan cities, not regions.
Documents the traversal pattern (states.parent_id) needed for
region-aggregate queries so users know how to migrate.
@dr5hn dr5hn marked this pull request as ready for review April 27, 2026 14:44
Copilot AI review requested due to automatic review settings April 27, 2026 14:44
@dr5hn dr5hn merged commit 6d05e6e into master Apr 27, 2026
1 check passed
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 27, 2026
@dr5hn dr5hn deleted the feat/issue-1349-italy-city-remap branch April 27, 2026 14:44
@dosubot dosubot Bot added the enhancement New feature or request label Apr 27, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses issue #1349 by remapping Italy’s city records from region-level parents to the correct province-level entities (metropolitan cities, provinces, autonomous provinces, etc.), using ISTAT’s official comune list as the authoritative mapping source.

Changes:

  • Adds a reproducible remap script that resolves each city via name/alias/conjunction matching with a proximity (k-NN) fallback.
  • Bundles ISTAT mapping data and a generated structured remap report to support auditing/reproducibility.
  • Documents the methodology, counts, validation, and flagged potential duplicates for follow-up.

Reviewed changes

Copilot reviewed 1 out of 5 changed files in this pull request and generated 1 comment.

File Description
bin/scripts/fixes/italy_remap_cities.py Implements the two-pass city→province remapping logic (ISTAT join + proximity fallback) and rewrites contributions/cities/IT.json.
bin/scripts/fixes/data/it_remap_report.json Stores the structured output from a remap run (counts, distributions, possible duplicates, samples).
.github/fixes-docs/FIX_1349_SUMMARY.md Fix documentation summarizing rationale, approach, and validation results for the Italy city remap.

Comment on lines +50 to +62
input 9947
name_unique 7373 exact ISTAT match in same region
name_region 13 multiple ISTAT matches, region tie-break
name_ambiguous 69 matches existed but in another region (rejected -> proximity)
name_conjunction 25 matched via "X e Y" half
no_match 2467 no ISTAT name (frazione / hamlet / historical name)

proximity_assigned 2512 resolved by 5-NN vote within 25 km
proximity_skipped 0 none rejected for being too far

changed 9828 state_id and/or state_code rewritten
unchanged 119 already pointed at the correct province (mostly RA, BT)
unmapped 0 every record reached a final assignment
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the counts block, the key proximity_skipped doesn’t match the script/report field name (proximity_skipped_or_far). For consistency (and to make it easier to cross-check with it_remap_report.json), consider updating the label in this doc to use the same key name as the generated report.

Suggested change
input 9947
name_unique 7373 exact ISTAT match in same region
name_region 13 multiple ISTAT matches, region tie-break
name_ambiguous 69 matches existed but in another region (rejected -> proximity)
name_conjunction 25 matched via "X e Y" half
no_match 2467 no ISTAT name (frazione / hamlet / historical name)
proximity_assigned 2512 resolved by 5-NN vote within 25 km
proximity_skipped 0 none rejected for being too far
changed 9828 state_id and/or state_code rewritten
unchanged 119 already pointed at the correct province (mostly RA, BT)
unmapped 0 every record reached a final assignment
input 9947 exact ISTAT match in same region
name_unique 7373 exact ISTAT match in same region
name_region 13 multiple ISTAT matches, region tie-break
name_ambiguous 69 matches existed but in another region (rejected -> proximity)
name_conjunction 25 matched via "X e Y" half
no_match 2467 no ISTAT name (frazione / hamlet / historical name)
proximity_assigned 2512 resolved by 5-NN vote within 25 km
proximity_skipped_or_far 0 none rejected for being too far
changed 9828 state_id and/or state_code rewritten
unchanged 119 already pointed at the correct province (mostly RA, BT)
unmapped 0 every record reached a final assignment

Copilot uses AI. Check for mistakes.
dr5hn added a commit that referenced this pull request Apr 27, 2026
Past machine-translation runs polluted the native field for many IT
cities (e.g. Pero -> native "Ma", Postal -> native "Postale",
Panchià -> native "Possono agganciare", Pareto -> native "Libbra"
which is "pound"). The name field already holds the canonical Italian
form (Pomigliano d'Arco, Sant'Ambrogio di Torino, etc.), so where the
city's name matches an ISTAT comune, native is now copied from name.

Cities whose name is not an ISTAT comune (~2,500 frazioni) are left
untouched — no authoritative replacement exists.

Counts: 9947 input -> 6070 already correct, 2499 no ISTAT match, 1378
restored.

Stacks on top of #1395 (the city remap PR).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dr5hn added a commit that referenced this pull request Apr 27, 2026
…p) (#1479)

Dropped the legacy half of each pair, keeping the ISTAT-canonical (or
English-name, per repo convention) record:

  id 58976  'Pozzaglio'              -> kept 58977  'Pozzaglio ed Uniti'
  id 61329  'Torino'                 -> kept 61575  'Turin'
  id 61530  'Trinità d\'Agultu'      -> kept 61531  'Trinità d\'Agultu e Vignola'
  id 139215 'Inverno'                -> kept 139216 'Inverno e Monteleone'
  id 139523 'Limite'                 -> kept 136799 'Capraia e Limite'
  id 140714 'Napoli'                 -> kept 140713 'Naples'

Two pairs are intentionally NOT touched and require maintainer review,
since neither record carries the ISTAT-canonical merged name:
  - MN: 'Sermide' (id 60744) + 'Felonica' (id 138474)
        canonical comune is 'Sermide e Felonica' (since 2017).
  - PV: 'Corteolona' (id 138065) + 'Genzone' (id 138905)
        canonical comune is 'Corteolona e Genzone' (since 2018).

Stacks on top of #1395 and #1397.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dr5hn added a commit that referenced this pull request Apr 27, 2026
…#1481)

PR #1479 deferred these two merged-comune pairs because neither half
carried the modern ISTAT-canonical name. Resolves them now:

- Pair MN: rename id 60744 'Sermide' → 'Sermide e Felonica'
  (native + name); set wikiDataId Q39681; drop id 138474 'Felonica'.
- Pair PV: rename id 138065 'Corteolona' → 'Corteolona e Genzone'
  (native + name); set wikiDataId Q3702780; drop id 138905 'Genzone'.

Row count: 9941 → 9939. Refs #1349, #1479.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dr5hn added a commit that referenced this pull request Apr 27, 2026
* feat(postcodes/KH): bulk-import 1,640 Cambodia postcodes via Cambodia Post + fix KH regex (#1039)

Source: Cambodia Post 2017-reform 6-digit catalogue redistributed via
the seanghay/cambodia-postal-codes JSON. All 25 provinces resolve at
100% via direct numeric-iso2 lookup — the source's "id" field (1-25)
is identical to CSC's state.iso2 for Cambodia provinces. Records
dedupe at (postcode, sangkat + district) granularity.

Also fixes the Cambodia postal_code_regex/format in countries.json:
the previous "#####" / "^(\\d{5})$" never matched Cambodia Post's
post-2017 6-digit codes (e.g. 120101 for Phnom Penh / Khan Chamkar
Mon / Tonle Basak) and would have rejected every legitimate row.
Updated to "######" / "^(\\d{6})$".

Refs #1039.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(IT): drop 87 placeholder Provincia rows (#1349 follow-up)

Removes 87 placeholder "Provincia ..." records (ids 59104-59190)
from contributions/cities/IT.json. These were leftover province-level
pseudo-cities from the pre-#1395 schema; after the city→province
remap, every real comune resolves directly to its province via
state_id, so the placeholders are duplicate concepts.

contributions/cities/IT.json: 9,947 → 9,860.

Refs #1349.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dr5hn added a commit that referenced this pull request Apr 27, 2026
…#1397/#1399)

The remap is a behavior change for downstream consumers — region-level
state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays
because cities live under provinces/metropolitan cities, not regions.
Documents the traversal pattern (states.parent_id) needed for
region-aggregate queries so users know how to migrate.
dr5hn added a commit that referenced this pull request Apr 27, 2026
…n) (#1352 PR-C) (#1392)

* feat(postcodes/DK): bulk-import 1,089 codes via DAWA (#1039)

Adds Danish postcodes via DAWA (Danmarks Adressers Web API) — public
sector data published under CC-0 by SDFI/Dataforsyningen.

1. bin/scripts/sync/import_denmark_postcodes.py — pipeline that fetches
   /kommuner to build a kommune-code -> region-name map, then resolves
   each /postnumre record's region via its first kommune. Maps the 5
   Danish region names to states.json iso2 codes:
     Region Hovedstaden -> 84 (called "Denmark" in states.json)
     Region Sjælland    -> 85 (Zealand)
     Region Syddanmark  -> 83 (Southern Denmark)
     Region Midtjylland -> 82 (Central Denmark)
     Region Nordjylland -> 81 (North Denmark)

2. contributions/postcodes/DK.json — 1,089 codes covering all 5 regions
   with 100% state_id + 100% coordinate resolution.

Validation (zero errors)
- All codes match countries.postal_code_regex (^(\\d{4})\$)
- All FKs resolve, all state_codes agree with state.iso2

License & attribution
- Source: SDFI / Dataforsyningen DAWA (CC-0)
- Each row: source: "dawa"

Refs: #1039

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(postcodes/IS): bulk-import 195 codes via iceaddr (#1039)

Adds Icelandic postcodes via the sveinbjornt/iceaddr Python package
which embeds the canonical postcode metadata under MIT licence.

1. bin/scripts/sync/import_iceland_postcodes.py — pipeline that
   dynamically imports the iceaddr POSTCODES dict and resolves each
   code's region via prefix range to states.json iso2 1-8 (Statistics
   Iceland's NUTS-3 boundaries: 1xx-2xx Capital, 3xx Western, 4xx
   Westfjords, 5xx Northwestern, 6xx Northeastern, 7xx Eastern,
   8xx-9xx Southern).

2. contributions/postcodes/IS.json — 195 records with 100% state_id
   resolution. Locality names combine stadur_nf + lysing
   (e.g. "Reykjavík, Miðborg").

License & attribution
- Source: iceaddr (MIT) embedding Pósturinn data
- Each row: source: "iceaddr"

Refs: #1039

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(postcodes/SK+RO+SI): batch-import 15,585 codes via 3 community mirrors (#1039)

Bundles three small-to-medium European countries with confirmed
redistributable postcode mirrors into a single batch importer.

1. bin/scripts/sync/import_eu_batch1_postcodes.py — pipeline that
   ingests three different shapes (SK JSON, RO CSV, SI CSV) and writes
   per-country JSON files. ASCII-folding + dash-to-space normalisation
   handles the Romanian Caraș-Severin / Bistrița-Năsăud cases where
   the CSV uses spaces and states.json uses hyphens.

2. contributions/postcodes/SK.json — 1,312 records (100% state via
   KRAJ -> states.iso2 direct match)
3. contributions/postcodes/RO.json — 13,751 records (100% state via
   ASCII-folded judet name match; all 6 Bucharest sectors mapped to 'B')
4. contributions/postcodes/SI.json — 522 records, country-only by
   design (source has no municipality info; SI postcodes don't map
   cleanly to administrative regions)

Validation (zero errors)
- All codes match countries.postal_code_regex
- All FKs resolve, all state_codes agree with state.iso2

License & attribution
- SK source: github.com/FeroVolar/PSC-JSON (community Slovenská pošta data)
- RO source: github.com/alexionegit/coduripostaleRomaniaPS
- SI source: github.com/dlabs/postcode_si (community Posta Slovenije data)

Refs: #1039

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(changelog): add notable callout for IT city→province remap (#1395/#1397/#1399)

The remap is a behavior change for downstream consumers — region-level
state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays
because cities live under provinces/metropolitan cities, not regions.
Documents the traversal pattern (states.parent_id) needed for
region-aggregate queries so users know how to migrate.

* docs: multi-level territories policy (FR overseas, dual representation) (#1352 PR-C)

Adds MULTI_LEVEL_TERRITORIES.md documenting why 12 French overseas
territories (and analogous US/CN/NO entities) appear simultaneously as
ISO 3166-1 countries and as ISO 3166-2 subdivisions of their parent state.

Captures the maintainer's Option C decision on #1352: keep both
representations because (1) downstream API/SDK consumers filter on
country_code, (2) ISO 3166-1 lists them as countries, and (3) the
breaking change is unjustified for a labelling concern.

Cross-links the new policy doc from .claude/CLAUDE.md (Important Rules)
and README.md (contributing section).

No data changes.

Refs: #1352

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dr5hn added a commit that referenced this pull request Apr 27, 2026
Reassigns 8,727 of 10,079 French cities from the 12 metropolitan regions
plus the Corsica collectivity (20R) to the correct INSEE department-level
state (01-95, 2A, 2B, 75C). Mirrors the IT remap shipped in #1395.

Endpoints like GET /v1/countries/FR/states/03/cities (Allier) used to
return [] because all of Allier's communes sat under the parent region
ARA. After this fix Allier holds 59 cities. Same was true for every
other metropolitan department.

Resolution cascade (offline, dependency-free, idempotent):
1. INSEE name match in current region (region tie-break + nearest coord)
2. INSEE name match anywhere within 25km
3. 5-NN proximity vote weighted by inverse distance, capped at 25km

Only state_id / state_code are mutated. name, native, latitude, longitude,
wikiDataId, translations, population, timezone are preserved verbatim.
0 unmapped, 0 deleted; re-run produces 0 changes.

Bundles the geo.api.gouv.fr commune dataset (Etalab Licence Ouverte v2.0,
ODbL-1.0 compatible) under bin/scripts/fixes/data/ for reproducibility.
Refs #1352 — does not close (sibling PRs A/B/C/D handle other facets).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dr5hn added a commit that referenced this pull request Apr 27, 2026
…hy (#1489)

Customer-facing follow-up to #1349 (Italy) and #1352 (France). Cities
were re-parented onto departments (FR) and provinces (IT) by #1395 /
#1394 / #1393 / #1400 / #1484, but the state records themselves still
carried inconsistent 'level' values, blocking downstream filters like
"all departments == level=2" or "all regions == level=1".

bin/scripts/fixes/states_level_normalise.py drives the change:
  - FR: 29 region-tier rows None -> 1 (13 metro regions, 3 special
        metro collectivities incl. Corse + Alsace + Métropole de Lyon,
        13 overseas regions/collectivities/territories/dependency).
        95 metropolitan departments unchanged at level=2.
  - IT: 103 rows updated. Final state: 20 at level=1
        (15 region + 5 autonomous region) and 106 at level=2
        (80 province + 14 metropolitan city + 6 free municipal
        consortium + 4 decentralized regional entity + 2 autonomous
        province).

Only the 'level' field is touched; idempotent on re-run; non-FR/IT
states untouched.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants