Skip to content

Commit 2952960

Browse files
dr5hnclaude
andauthored
fix(IT): drop 88 placeholder Provincia rows (#1349 follow-up) (#1482)
* feat(postcodes/KH): bulk-import 1,640 Cambodia postcodes via Cambodia Post + fix KH regex (#1039) Source: Cambodia Post 2017-reform 6-digit catalogue redistributed via the seanghay/cambodia-postal-codes JSON. All 25 provinces resolve at 100% via direct numeric-iso2 lookup — the source's "id" field (1-25) is identical to CSC's state.iso2 for Cambodia provinces. Records dedupe at (postcode, sangkat + district) granularity. Also fixes the Cambodia postal_code_regex/format in countries.json: the previous "#####" / "^(\\d{5})$" never matched Cambodia Post's post-2017 6-digit codes (e.g. 120101 for Phnom Penh / Khan Chamkar Mon / Tonle Basak) and would have rejected every legitimate row. Updated to "######" / "^(\\d{6})$". Refs #1039. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(IT): drop 87 placeholder Provincia rows (#1349 follow-up) Removes 87 placeholder "Provincia ..." records (ids 59104-59190) from contributions/cities/IT.json. These were leftover province-level pseudo-cities from the pre-#1395 schema; after the city→province remap, every real comune resolves directly to its province via state_id, so the placeholders are duplicate concepts. contributions/cities/IT.json: 9,947 → 9,860. Refs #1349. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent a95f2fb commit 2952960

3 files changed

Lines changed: 179 additions & 3567 deletions

File tree

.github/fixes-docs/FIX_1349_SUMMARY.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,3 +148,46 @@ The script is idempotent: a second run on already-remapped data produces 0 chang
148148
- Wikipedia — Provinces of Italy: https://en.wikipedia.org/wiki/Provinces_of_Italy
149149
- Wikipedia — Metropolitan cities of Italy: https://en.wikipedia.org/wiki/Metropolitan_cities_of_Italy
150150
- Predecessor fix: `.github/fixes-docs/ITALY_MISSING_METROPOLITAN_CITIES_AND_AUTONOMOUS_PROVINCES.md`
151+
152+
---
153+
154+
## Follow-up: drop placeholder "Provincia ..." rows (2026-04-27)
155+
156+
The reporter on #1349 specifically called out "Provincia di Lucca don't have sense to exist". Those rows were not real comuni — they were **province-level placeholder cities** carried over from the pre-#1395 schema, when cities pointed at regions and a separate "Provincia di X" pseudo-city stood in for the province below the region. After the #1395 remap, every real comune resolves directly to its province via `state_id` / `state_code`, so the placeholders are a duplicate concept and have been removed.
157+
158+
### Selection
159+
160+
- **Range:** ids 59104–59190 inclusive (contiguous, no gaps).
161+
- **Count:** 87 rows.
162+
- **Names:** all start with `Provincia ` — covers the standard `Provincia di X` form plus a handful of variants (`Provincia autonoma di Trento`, `Provincia dell' Aquila`, `Provincia Verbano-Cusio-Ossola`).
163+
164+
### Counts
165+
166+
| | Before | After |
167+
|---|--:|--:|
168+
| `IT.json` city records | 9,947 | 9,860 |
169+
| Rows starting with `Provincia ` | 87 | 0 |
170+
171+
(Note: the prompt's expected post-#1479 baseline of 9,941 was off by 6; the correct baseline at the time of this PR was 9,947, confirmed by `git log` and `jq '. \| length'`. Range arithmetic 59190 − 59104 + 1 = 87, not 88.)
172+
173+
### Implementation
174+
175+
`bin/scripts/fixes/italy_drop_provincia_placeholders.py` — defensive double-predicate filter (id range AND name prefix). Refuses to touch rows in the id range that don't match the name prefix (exit 2). Idempotent — second run is a no-op.
176+
177+
### Validation
178+
179+
-`jq '. | length'` → 9,860.
180+
-`jq '[.[] | select(.name | startswith("Provincia "))] | length'` → 0.
181+
- ✅ No `parent_id` references to the dropped id range.
182+
- ✅ Neighbour ids 59103 and 59191 preserved.
183+
- ✅ JSON parses cleanly via `python3 -m json.tool`.
184+
-`normalize_json.py` reports "All records already have IDs and timestamps" — no other fields touched.
185+
186+
### Files
187+
188+
- Modified: `contributions/cities/IT.json` (87 rows removed).
189+
- Added: `bin/scripts/fixes/italy_drop_provincia_placeholders.py`.
190+
191+
### Still open on #1349
192+
193+
Two dedup pairs from the eight flagged in the original remap remain to be merged. They are tracked separately and not closed by this PR.
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
#!/usr/bin/env python3
2+
"""Drop placeholder "Provincia ..." rows from contributions/cities/IT.json.
3+
4+
Issue #1349 follow-up. The reporter flagged entries like "Provincia di Lucca"
5+
that "don't have sense to exist" — they are not comuni, they are leftover
6+
province-level placeholder records from the pre-#1395 schema (cities used
7+
to point at regions, so a separate "Provincia di X" pseudo-city stood in
8+
for the province). Since #1395 (`italy_remap_cities.py`) re-parented every
9+
real comune onto its province via `state_id` / `state_code`, these 87
10+
placeholder rows are now duplicate concepts and should be removed.
11+
12+
Selection (defensive — both predicates must hold):
13+
- id in the contiguous range [59104, 59190] (87 records, no gaps).
14+
- name starts with "Provincia " (covers "Provincia di X",
15+
"Provincia autonoma di Trento", "Provincia dell' Aquila",
16+
"Provincia Verbano-Cusio-Ossola").
17+
18+
The script preserves every other record byte-for-byte; canonical
19+
re-formatting is delegated to `bin/scripts/sync/normalize_json.py`.
20+
21+
Idempotent: a second run on already-cleaned data writes nothing and exits 0.
22+
23+
Usage:
24+
python3 bin/scripts/fixes/italy_drop_provincia_placeholders.py [--dry-run]
25+
"""
26+
27+
from __future__ import annotations
28+
29+
import argparse
30+
import json
31+
import sys
32+
from pathlib import Path
33+
from typing import List
34+
35+
REPO_ROOT = Path(__file__).resolve().parents[3]
36+
CITIES_JSON = REPO_ROOT / "contributions/cities/IT.json"
37+
38+
ID_MIN = 59104
39+
ID_MAX = 59190
40+
NAME_PREFIX = "Provincia "
41+
EXPECTED_DROP_COUNT = 87
42+
EXPECTED_PRE_COUNT = 9947
43+
EXPECTED_POST_COUNT = EXPECTED_PRE_COUNT - EXPECTED_DROP_COUNT # 9860
44+
45+
46+
def load_cities(path: Path) -> List[dict]:
47+
"""Read the cities JSON file and verify the top-level shape."""
48+
with path.open("r", encoding="utf-8") as fh:
49+
data = json.load(fh)
50+
if not isinstance(data, list):
51+
raise SystemExit(f"Expected JSON array at {path}, got {type(data).__name__}")
52+
return data
53+
54+
55+
def is_placeholder(row: dict) -> bool:
56+
"""Return True if the row is a legacy 'Provincia ...' placeholder."""
57+
rid = row.get("id")
58+
name = row.get("name", "")
59+
return (
60+
isinstance(rid, int)
61+
and ID_MIN <= rid <= ID_MAX
62+
and isinstance(name, str)
63+
and name.startswith(NAME_PREFIX)
64+
)
65+
66+
67+
def main() -> int:
68+
parser = argparse.ArgumentParser(description=__doc__.splitlines()[0])
69+
parser.add_argument(
70+
"--dry-run",
71+
action="store_true",
72+
help="Report what would be dropped without rewriting the file.",
73+
)
74+
args = parser.parse_args()
75+
76+
if not CITIES_JSON.exists():
77+
raise SystemExit(f"Cities file not found: {CITIES_JSON}")
78+
79+
cities = load_cities(CITIES_JSON)
80+
pre_count = len(cities)
81+
82+
to_drop = [row for row in cities if is_placeholder(row)]
83+
kept = [row for row in cities if not is_placeholder(row)]
84+
85+
drop_count = len(to_drop)
86+
post_count = len(kept)
87+
88+
# Cross-check that nothing else in the id range slipped in.
89+
in_range = [row for row in cities if isinstance(row.get("id"), int)
90+
and ID_MIN <= row["id"] <= ID_MAX]
91+
non_placeholder_in_range = [row for row in in_range if not is_placeholder(row)]
92+
93+
print(f"Input file: {CITIES_JSON.relative_to(REPO_ROOT)}")
94+
print(f"Pre-clean count: {pre_count}")
95+
print(f"Rows in id range: {len(in_range)} (ids {ID_MIN}-{ID_MAX})")
96+
print(f"Placeholder matches: {drop_count}")
97+
print(f"Post-clean count: {post_count}")
98+
99+
if non_placeholder_in_range:
100+
print("\nERROR: rows in the placeholder id range that don't match "
101+
"the name prefix — refusing to touch them:", file=sys.stderr)
102+
for row in non_placeholder_in_range:
103+
print(f" id={row.get('id')} name={row.get('name')!r}", file=sys.stderr)
104+
return 2
105+
106+
if drop_count == 0:
107+
print("\nNo placeholder rows found — nothing to do (idempotent).")
108+
return 0
109+
110+
if drop_count != EXPECTED_DROP_COUNT or pre_count != EXPECTED_PRE_COUNT:
111+
print(
112+
f"\nWARNING: counts diverge from the expected baseline "
113+
f"({EXPECTED_PRE_COUNT} → drop {EXPECTED_DROP_COUNT} → "
114+
f"{EXPECTED_POST_COUNT}). Got {pre_count} → drop {drop_count} "
115+
f"→ {post_count}. Review the diff before merging.",
116+
file=sys.stderr,
117+
)
118+
119+
if args.dry_run:
120+
print("\n--dry-run: not writing.")
121+
sample = ", ".join(repr(r["name"]) for r in to_drop[:3])
122+
print(f"Sample drops: {sample}, ...")
123+
return 0
124+
125+
with CITIES_JSON.open("w", encoding="utf-8") as fh:
126+
json.dump(kept, fh, ensure_ascii=False, indent=2)
127+
fh.write("\n")
128+
129+
print(f"\nWrote {post_count} records to {CITIES_JSON.relative_to(REPO_ROOT)}.")
130+
print("Run `python3 bin/scripts/sync/normalize_json.py "
131+
"contributions/cities/IT.json` next to canonicalize formatting.")
132+
return 0
133+
134+
135+
if __name__ == "__main__":
136+
sys.exit(main())

0 commit comments

Comments
 (0)