fix(IT): drop 88 placeholder Provincia rows (#1349 follow-up)#1482
Conversation
… Post + fix KH regex (#1039) Source: Cambodia Post 2017-reform 6-digit catalogue redistributed via the seanghay/cambodia-postal-codes JSON. All 25 provinces resolve at 100% via direct numeric-iso2 lookup — the source's "id" field (1-25) is identical to CSC's state.iso2 for Cambodia provinces. Records dedupe at (postcode, sangkat + district) granularity. Also fixes the Cambodia postal_code_regex/format in countries.json: the previous "#####" / "^(\\d{5})$" never matched Cambodia Post's post-2017 6-digit codes (e.g. 120101 for Phnom Penh / Khan Chamkar Mon / Tonle Basak) and would have rejected every legitimate row. Updated to "######" / "^(\\d{6})$". Refs #1039. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes 87 placeholder "Provincia ..." records (ids 59104-59190) from contributions/cities/IT.json. These were leftover province-level pseudo-cities from the pre-#1395 schema; after the city→province remap, every real comune resolves directly to its province via state_id, so the placeholders are duplicate concepts. contributions/cities/IT.json: 9,947 → 9,860. Refs #1349. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Follow-up cleanup for Italy city data to remove province-level placeholder “Provincia …” pseudo-cities left over from the pre-#1395 schema, plus an update aligning Cambodia’s postal-code metadata with its existing 6-digit postcode dataset.
Changes:
- Removed 87 placeholder
Provincia …rows (IDs 59104–59190) fromcontributions/cities/IT.json. - Updated Cambodia (
KH)postal_code_format/postal_code_regexincontributions/countries/countries.jsonto 6-digit. - Added reproducibility scripts and updated the #1349 fix summary documentation.
Reviewed changes
Copilot reviewed 2 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| contributions/countries/countries.json | Updates Cambodia postal code format/regex to 6-digit. |
| contributions/cities/IT.json | Drops 87 province-level placeholder pseudo-city rows. |
| bin/scripts/sync/import_cambodia_postcodes.py | Adds importer script for generating contributions/postcodes/KH.json. |
| bin/scripts/fixes/italy_drop_provincia_placeholders.py | Adds defensive script to remove the targeted Italy placeholder rows. |
| .github/fixes-docs/FIX_1349_SUMMARY.md | Documents the follow-up cleanup and validation results. |
| "postal_code_format": "######", | ||
| "postal_code_regex": "^(\\d{6})$", |
There was a problem hiding this comment.
This PR introduces Cambodia postcode format/regex changes and a new KH importer script, but the PR title/description focus on the Italy placeholder-city cleanup. Consider updating the PR description/title to mention the Cambodia changes or splitting them into a separate PR to keep scope and review context aligned.
| | `IT.json` city records | 9,947 | 9,860 | | ||
| | Rows starting with `Provincia ` | 87 | 0 | | ||
|
|
||
| (Note: the prompt's expected post-#1479 baseline of 9,941 was off by 6; the correct baseline at the time of this PR was 9,947, confirmed by `git log` and `jq '. \| length'`. Range arithmetic 59190 − 59104 + 1 = 87, not 88.) |
There was a problem hiding this comment.
This note references a "post-#1479 baseline"; that PR number doesn’t appear elsewhere in this summary and seems unrelated to the #1349/#1395 context described above. If this is a typo, update it to the correct reference to avoid confusion when tracing the history.
| (Note: the prompt's expected post-#1479 baseline of 9,941 was off by 6; the correct baseline at the time of this PR was 9,947, confirmed by `git log` and `jq '. \| length'`. Range arithmetic 59190 − 59104 + 1 = 87, not 88.) | |
| (Note: the prompt's expected post-#1395 baseline of 9,941 was off by 6; the correct baseline at the time of this PR was 9,947, confirmed by `git log` and `jq '. \| length'`. Range arithmetic 59190 − 59104 + 1 = 87, not 88.) |
Refs #1349.
Drops 87 placeholder
Provincia …rows (ids 59104–59190 inclusive) fromcontributions/cities/IT.json. These are not real comuni — they were province-level pseudo-cities left over from the pre-#1395 schema, exactly what the issue reporter flagged ("Provincia di Lucca don't have sense to exist"). After the #1395 remap, every real comune already resolves to its province viastate_id/state_code, so the placeholders are duplicate concepts.IT.jsonrowsProvincia …rowsImplementation
bin/scripts/fixes/italy_drop_provincia_placeholders.py— double-predicate filter (id in range AND name starts withProvincia). Idempotent. Refuses to touch unfamiliar rows in the id range.Validation
jq '. | length'→ 9,860jq '[.[] | select(.name | startswith("Provincia "))] | length'→ 0parent_idrefs to dropped ids; neighbour ids 59103/59191 preserved.python3 -m json.toolparses cleanly;normalize_json.pyis a no-op.Scope
Fix details appended to
.github/fixes-docs/FIX_1349_SUMMARY.md.