Skip to content

fix(IT): restore corrupted native fields (#1349 follow-up)#1475

Closed
dr5hn wants to merge 4 commits into
masterfrom
fix/issue-1349-italy-native-field
Closed

fix(IT): restore corrupted native fields (#1349 follow-up)#1475
dr5hn wants to merge 4 commits into
masterfrom
fix/issue-1349-italy-native-field

Conversation

@dr5hn
Copy link
Copy Markdown
Owner

@dr5hn dr5hn commented Apr 27, 2026

Re-targeted from #1397 after the parent (#1395) was merged.

Refs #1349 — restores 1,378 corrupted native fields on Italian cities (machine-translation artefacts like PeroMa, PostalPostale, ParetoLibbra).

Where a city's name matches an ISTAT comune, this PR sets native = name. Cities whose name is not an ISTAT comune (~2,500 frazioni) are intentionally left untouched.

Counts

  • input 9,941
  • already correct 6,070 (native already == name)
  • no ISTAT match 2,499 (left untouched — frazioni)
  • restored 1,378

🤖 Generated with Claude Code

dr5hn and others added 4 commits April 25, 2026 19:11
Adds a self-contained Python script that joins our IT cities to the
ISTAT comuni list via the Sigla automobilistica (2-letter province code)
to compute the correct (state_id, state_code) for each city, preferring
metropolitan-city / province / free-consortium / autonomous-province /
decentralized-regional-entity over the parent region.

Resolution order: name-match (region-validated) -> conjunction-half
match (e.g. "Lampedusa" -> "Lampedusa e Linosa") -> k-NN proximity
vote against the name-matched cluster, capped at 25km. Aliases handle
English-only names like "Venice" -> "Venezia".

Bundles the ISTAT CSV (CC-BY 3.0 IT) and the most recent run's report
under bin/scripts/fixes/data/ for reproducibility.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generated by bin/scripts/fixes/italy_remap_cities.py. Cities were
previously parented to the 20 Italian regions; they now point at the
correct ISO 3166-2 province-level entity:
  - 14 metropolitan cities (BA, BO, CA, CT, FI, GE, ME, MI, NA, PA,
    RC, RM, TO, VE) — all populated for the first time.
  - 80 provinces, 6 free municipal consortia (Sicilia), 2 autonomous
    provinces (BZ, TN), 4 decentralized regional entities (Friuli).
  - Aosta Valley comuni stay on the autonomous region (no province-
    level entity exists for sigla AO).

Counts: 9947 input -> 9828 changed, 119 unchanged, 0 unmapped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the mapping source (ISTAT comune list joined on Sigla
automobilistica), per-state-type counts after the remap, edge cases
(corrupt native fields, Lampedusa bounds gap, Tessera airport), the
8 possible-duplicate pairs flagged for maintainer review, and the
validation checks run locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Past machine-translation runs polluted the native field for many IT
cities (e.g. Pero -> native "Ma", Postal -> native "Postale",
Panchià -> native "Possono agganciare", Pareto -> native "Libbra"
which is "pound"). The name field already holds the canonical Italian
form (Pomigliano d'Arco, Sant'Ambrogio di Torino, etc.), so where the
city's name matches an ISTAT comune, native is now copied from name.

Cities whose name is not an ISTAT comune (~2,500 frazioni) are left
untouched — no authoritative replacement exists.

Counts: 9947 input -> 6070 already correct, 2499 no ISTAT match, 1378
restored.

Stacks on top of #1395 (the city remap PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 14:50
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 27, 2026
@dr5hn
Copy link
Copy Markdown
Owner Author

dr5hn commented Apr 27, 2026

Superseded by clean rebased branch — see new PR.

@dr5hn dr5hn closed this Apr 27, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Restores corrupted native fields for Italian cities by aligning native with name when the city name matches an ISTAT comune, and adds supporting scripts/reports/documentation.

Changes:

  • Add a native-field restoration script for Italy that updates contributions/cities/IT.json in place.
  • Add structured JSON reports capturing before/after statistics and example changes.
  • Add Italy fix summary documentation (currently focused on the city remap).

Reviewed changes

Copilot reviewed 1 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
bin/scripts/fixes/italy_restore_native.py New script to restore corrupted native values using an ISTAT name index.
bin/scripts/fixes/italy_remap_cities.py Adds/contains the Italy city remap implementation (scope appears broader than this PR’s description).
bin/scripts/fixes/data/it_remap_report.json Structured report output for the remap script run.
bin/scripts/fixes/data/it_native_restore_report.json Structured report output for the native-restore script run.
.github/fixes-docs/FIX_1349_SUMMARY.md Summary doc for FIX-1349 (currently describes remap; needs alignment with native-restore work in this PR).

## Edge cases

### Native field is unreliable
Many `native` values in `IT.json` have been corrupted by past machine-translation runs (e.g. `"Pero" → native: "Ma"`, `"Postal" → native: "Postale"`). The script intentionally matches on `name`, not `native`. Fixing `native` is out of scope for this PR but should be tracked separately.
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says fixing the native field is out of scope, but this PR is explicitly restoring corrupted native values (and adds italy_restore_native.py + its report). Please update the summary so it matches the current PR scope (e.g., reference the native-restore fix or move this note to the remap-only doc if it’s meant to describe a different PR).

Suggested change
Many `native` values in `IT.json` have been corrupted by past machine-translation runs (e.g. `"Pero" → native: "Ma"`, `"Postal" → native: "Postale"`). The script intentionally matches on `name`, not `native`. Fixing `native` is out of scope for this PR but should be tracked separately.
Many `native` values in `IT.json` have been corrupted by past machine-translation runs (e.g. `"Pero" → native: "Ma"`, `"Postal" → native: "Postale"`). The remap script intentionally matches on `name`, not `native`, to avoid propagating those bad values. This PR also restores corrupted `native` values via the dedicated native-restore step (`italy_restore_native.py`) and its report, so the remap and native-field repair are both covered here.

Copilot uses AI. Check for mistakes.

## Files Modified

- `contributions/cities/IT.json` — 9,828 records updated (only `state_id` and `state_code` fields).
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “Files Modified” bullet claims IT.json updates were only state_id/state_code, but this PR also changes native for many cities. Please update this line (or split the documentation) so the described fields match what the PR actually mutates.

Suggested change
- `contributions/cities/IT.json` — 9,828 records updated (only `state_id` and `state_code` fields).
- `contributions/cities/IT.json` — 9,828 records updated (`state_id`, `state_code`, and `native` fields).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data:cities large-contribution size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants