Skip to content

Commit 57eb424

Browse files
authored
fix(ES): drop 22 admin-level placeholder rows from cities (#1498) (#1516)
Drops 22 placeholder records from contributions/cities/ES.json: - 21 "Provincia de X" / "Província de X" rows (ids 36362, 36364, 36365, 36373, 36375, 36376, 36377, 36379, 36381, 36383, 36385, 36386, 36387, 36389, 36390, 36391, 36392, 36393, 36394, 36396, 36400). Spanish provinces are already represented as proper states in states.json, making these pseudo-cities duplicate concepts. Their own state_code values are inconsistent (e.g. "Provincia de Burgos" parented under state_code=LE), confirming stub-data status. - 1 cross-state Alicante stub (id 32244, state_code=V) flagged by the reporter as a cross-province leak in Valencia's city list. Canonical row is id 152158 ("Alicante/Alacant", state_code=A). Counts: 8,427 -> 8,405 rows. Out-of-bounds coordinate violations drop from 129 to 127 (the dropped stubs included 2 invalid coords). 0 schema errors, 0 cross-reference errors, same-name <5km duplicate pairs unchanged at 45 (all pre-existing). Refs #1498. Does not close it -- PR-B follow-up retags ~6,920 mistyped admin-level rows to type=city.
1 parent 085bfd5 commit 57eb424

3 files changed

Lines changed: 262 additions & 903 deletions

File tree

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# FIX #1498 — Spain: drop admin-level placeholders from cities (PR-A of 2)
2+
3+
**Issue:** [#1498 — Bug ES: GetCity returns province-level admin entries as cities](https://github.com/dr5hn/countries-states-cities-database/issues/1498)
4+
**Scope:** Drop 22 placeholder rows from `contributions/cities/ES.json`.
5+
**Sibling PR:** PR-B retags ~6,920 mistyped admin-level rows (currently `type` in `adm1`/`adm2`/`adm3`) to `type='city'`. PR-B closes #1498.
6+
**Date:** 2026-05-04
7+
8+
## Problem
9+
10+
The reporter flagged that `GetCity(country=ES, state=Madrid)` returns "Provincia de Madrid" as a city, with the same pattern across other Spanish provinces, plus a cross-province leak ("Provincia de Alicante" inside Valencia's city list). These are admin-level placeholder rows, not municipalities.
11+
12+
Spain's `states.json` already lists the 50 provinces as proper states, so the "Provincia de X" pseudo-cities are duplicate concepts and can be dropped without re-parenting any other data.
13+
14+
## Drops (22 rows)
15+
16+
### 21 "Provincia de X" / "Província de X" placeholders
17+
18+
| id | name | state_code | type |
19+
|----|------|-----------|------|
20+
| 36362 | Provincia de Alicante | V | city |
21+
| 36364 | Provincia de Burgos | LE | adm2 |
22+
| 36365 | Provincia de Cantabria | S | adm3 |
23+
| 36373 | Provincia de Huesca | HU | adm3 |
24+
| 36375 | Provincia de La Rioja | LO | adm3 |
25+
| 36376 | Provincia de Las Palmas | GC | city |
26+
| 36377 | Provincia de León | LE | adm3 |
27+
| 36379 | Provincia de Madrid | M | section |
28+
| 36381 | Provincia de Navarra | NA | adm1 |
29+
| 36383 | Provincia de Palencia | LE | adm3 |
30+
| 36385 | Provincia de Salamanca | LE | adm3 |
31+
| 36386 | Provincia de Santa Cruz de Tenerife | GC | city |
32+
| 36387 | Provincia de Segovia | LE | adm3 |
33+
| 36389 | Provincia de Soria | LE | adm3 |
34+
| 36390 | Provincia de Teruel | HU | adm3 |
35+
| 36391 | Provincia de Valladolid | LE | city |
36+
| 36392 | Provincia de Zamora | LE | adm3 |
37+
| 36393 | Provincia de Zaragoza | HU | adm3 |
38+
| 36394 | Provincia de Ávila | LE | adm3 |
39+
| 36396 | Província de Castelló | V | adm3 |
40+
| 36400 | Província de València | V | city |
41+
42+
The state_code values are themselves messy (e.g. "Provincia de Burgos" sits under `LE`/León, "Provincia de Zaragoza" under `HU`/Huesca) — further evidence these rows are stub data, not curated municipality records.
43+
44+
### 1 cross-state Alicante stub
45+
46+
| id | name | state_code | reason |
47+
|----|------|-----------|--------|
48+
| 32244 | Alicante | V (Valencia) | Wrong province; missing Valencian endonym |
49+
50+
The canonical Alicante row is **id 152158** (`Alicante/Alacant`, state_code `A`) under the Alicante province. Row 32244 is a legacy duplicate from when Valencia community was the parent of three provinces, and its `state_code='V'` is what the issue reporter flagged as the cross-province leak.
51+
52+
## Counts
53+
54+
| | Before | After |
55+
|---|--:|--:|
56+
| `ES.json` rows | 8,427 | 8,405 |
57+
| Rows named `Provincia *` / `Província *` | 21 | 0 |
58+
| Rows where `state_code='V'` and name='Alicante' | 1 | 0 |
59+
60+
## Implementation
61+
62+
`bin/scripts/fixes/spain_drop_provincia_placeholders.py` — explicit id allowlist + name/state verification per id. Refuses to touch rows in the allowlist if their name/state has shifted from what was audited. Idempotent: a second run on cleaned data writes nothing and exits 0.
63+
64+
## Validation (mirrors `.github/scripts/validate-*`)
65+
66+
- Schema: 0 errors. All rows still have name/state_id/state_code/country_id/country_code/lat/lon. country_code/country_id consistent (ES/207).
67+
- Cross-reference: 0 errors. Every `state_id` resolves to an ES state and `state_code` matches the resolved state's `iso2`.
68+
- Coordinates: 127 rows out of `country-bounds.json` ES box (down from 129 on master — the drop reduced OOB by 2). The 127 remaining are all Canary Islands (state_codes `TF`, `GC`) — pre-existing, same pattern as IT/Lampedusa noted in #1395.
69+
- Duplicate scan (same name + ≤5km): 45 pairs, **unchanged from master**.
70+
- `python3 -m json.tool` parses cleanly; `normalize_json.py` is a no-op.
71+
72+
## Scope
73+
74+
- Touches **only** the 22 placeholder rows.
75+
- Does **not** modify `states.json` or `countries.json`.
76+
- Does **not** close #1498. PR-B (the type-field retag of ~6,920 mistyped rows) is the closing PR.
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
#!/usr/bin/env python3
2+
"""Drop admin-level placeholder rows from contributions/cities/ES.json.
3+
4+
Issue #1498. The reporter flagged that `GetCity(...)` returns province-level
5+
admin entries inside Spanish city dropdowns ("Provincia de Madrid" etc.).
6+
This is the city-level cleanup PR — drops 22 rows that are not real cities:
7+
8+
- 21 "Provincia de X" / "Província de X" rows (ids 36362, 36364, 36365,
9+
36373, 36375, 36376, 36377, 36379, 36381, 36383, 36385, 36386, 36387,
10+
36389, 36390, 36391, 36392, 36393, 36394, 36396, 36400). All are admin
11+
placeholders that duplicate concepts already represented by entries in
12+
contributions/states/states.json. The 50 Spanish provinces are already
13+
states, so a separate "Provincia de X" pseudo-city is redundant.
14+
15+
- 1 cross-state placeholder (id 32244, name "Alicante", state_code=V).
16+
The real Alicante city lives at id 152158 ("Alicante/Alacant",
17+
state_code=A) under the Alicante province (iso2=A). Row 32244 is a
18+
legacy stub from when Valencia community was the parent of three
19+
provinces (Valencia, Alicante, Castellon) — Alicante now has its own
20+
state, so the stub is wrong both in name (missing the Valencian
21+
endonym) and in parentage.
22+
23+
The 21 "Provincia ..." ids are an explicit allowlist rather than a name-
24+
prefix filter, because two real Spanish municipalities also begin with
25+
"Provincia" in obscure forms — using ids guarantees we touch only what we
26+
mean to. The script verifies each id's current name still matches the
27+
expected prefix before dropping, refusing to touch rows that don't.
28+
29+
Idempotent: a second run on already-cleaned data writes nothing and exits 0.
30+
31+
Usage:
32+
python3 bin/scripts/fixes/spain_drop_provincia_placeholders.py [--dry-run]
33+
"""
34+
35+
from __future__ import annotations
36+
37+
import argparse
38+
import json
39+
import sys
40+
from pathlib import Path
41+
from typing import List
42+
43+
REPO_ROOT = Path(__file__).resolve().parents[3]
44+
CITIES_JSON = REPO_ROOT / "contributions/cities/ES.json"
45+
46+
# 21 admin-placeholder ids, expected to start with "Provincia " or "Província ".
47+
PROVINCIA_IDS = frozenset({
48+
36362, 36364, 36365, 36373, 36375, 36376, 36377, 36379, 36381, 36383,
49+
36385, 36386, 36387, 36389, 36390, 36391, 36392, 36393, 36394, 36396,
50+
36400,
51+
})
52+
53+
# 1 cross-state placeholder. id, expected name, expected wrong state_code.
54+
CROSS_STATE_ALICANTE_ID = 32244
55+
CROSS_STATE_ALICANTE_NAME = "Alicante"
56+
CROSS_STATE_ALICANTE_STATE_CODE = "V"
57+
58+
EXPECTED_DROP_COUNT = len(PROVINCIA_IDS) + 1 # 22
59+
EXPECTED_PRE_COUNT = 8427
60+
EXPECTED_POST_COUNT = EXPECTED_PRE_COUNT - EXPECTED_DROP_COUNT # 8405
61+
62+
63+
def load_cities(path: Path) -> List[dict]:
64+
"""Read the cities JSON file and verify the top-level shape."""
65+
with path.open("r", encoding="utf-8") as fh:
66+
data = json.load(fh)
67+
if not isinstance(data, list):
68+
raise SystemExit(f"Expected JSON array at {path}, got {type(data).__name__}")
69+
return data
70+
71+
72+
def matches_provincia(row: dict) -> bool:
73+
"""Row is one of the 21 'Provincia ...' admin placeholders, name verified."""
74+
if row.get("id") not in PROVINCIA_IDS:
75+
return False
76+
name = row.get("name", "")
77+
return isinstance(name, str) and (
78+
name.startswith("Provincia ") or name.startswith("Província ")
79+
)
80+
81+
82+
def matches_cross_state_alicante(row: dict) -> bool:
83+
"""Row is the cross-state Alicante stub (id 32244, name 'Alicante', state V)."""
84+
return (
85+
row.get("id") == CROSS_STATE_ALICANTE_ID
86+
and row.get("name") == CROSS_STATE_ALICANTE_NAME
87+
and row.get("state_code") == CROSS_STATE_ALICANTE_STATE_CODE
88+
)
89+
90+
91+
def is_target(row: dict) -> bool:
92+
"""True if this row is on the drop list."""
93+
return matches_provincia(row) or matches_cross_state_alicante(row)
94+
95+
96+
def main() -> int:
97+
parser = argparse.ArgumentParser(description=__doc__.splitlines()[0])
98+
parser.add_argument(
99+
"--dry-run",
100+
action="store_true",
101+
help="Report what would be dropped without rewriting the file.",
102+
)
103+
args = parser.parse_args()
104+
105+
if not CITIES_JSON.exists():
106+
raise SystemExit(f"Cities file not found: {CITIES_JSON}")
107+
108+
cities = load_cities(CITIES_JSON)
109+
pre_count = len(cities)
110+
111+
to_drop = [row for row in cities if is_target(row)]
112+
kept = [row for row in cities if not is_target(row)]
113+
114+
drop_count = len(to_drop)
115+
post_count = len(kept)
116+
117+
# Defensive check: any row whose id is in our allowlist but whose name
118+
# doesn't match the expected pattern (could indicate the data has
119+
# shifted since this script was written).
120+
rows_by_id = {row.get("id"): row for row in cities}
121+
suspicious: List[dict] = []
122+
for pid in PROVINCIA_IDS:
123+
row = rows_by_id.get(pid)
124+
if row is not None and not matches_provincia(row):
125+
suspicious.append(row)
126+
cross = rows_by_id.get(CROSS_STATE_ALICANTE_ID)
127+
if cross is not None and not matches_cross_state_alicante(cross):
128+
suspicious.append(cross)
129+
130+
print(f"Input file: {CITIES_JSON.relative_to(REPO_ROOT)}")
131+
print(f"Pre-clean count: {pre_count}")
132+
print(f"Provincia drops: "
133+
f"{sum(1 for r in to_drop if matches_provincia(r))} / "
134+
f"{len(PROVINCIA_IDS)} expected")
135+
print(f"Cross-state Alicante drop: "
136+
f"{sum(1 for r in to_drop if matches_cross_state_alicante(r))} / 1 expected")
137+
print(f"Total drops: {drop_count}")
138+
print(f"Post-clean count: {post_count}")
139+
140+
if suspicious:
141+
print(
142+
"\nERROR: target ids exist but do not match expected name/state — "
143+
"refusing to drop them:",
144+
file=sys.stderr,
145+
)
146+
for row in suspicious:
147+
print(
148+
f" id={row.get('id')} name={row.get('name')!r} "
149+
f"state_code={row.get('state_code')!r}",
150+
file=sys.stderr,
151+
)
152+
return 2
153+
154+
if drop_count == 0:
155+
print("\nNo placeholder rows found — nothing to do (idempotent).")
156+
return 0
157+
158+
if drop_count != EXPECTED_DROP_COUNT or pre_count != EXPECTED_PRE_COUNT:
159+
print(
160+
f"\nWARNING: counts diverge from the expected baseline "
161+
f"({EXPECTED_PRE_COUNT} -> drop {EXPECTED_DROP_COUNT} -> "
162+
f"{EXPECTED_POST_COUNT}). Got {pre_count} -> drop {drop_count} "
163+
f"-> {post_count}. Review the diff before merging.",
164+
file=sys.stderr,
165+
)
166+
167+
if args.dry_run:
168+
print("\n--dry-run: not writing.")
169+
for r in to_drop:
170+
print(f" would drop id={r['id']} name={r['name']!r} "
171+
f"state_code={r.get('state_code')!r} type={r.get('type')!r}")
172+
return 0
173+
174+
with CITIES_JSON.open("w", encoding="utf-8") as fh:
175+
json.dump(kept, fh, ensure_ascii=False, indent=2)
176+
fh.write("\n")
177+
178+
print(f"\nWrote {post_count} records to {CITIES_JSON.relative_to(REPO_ROOT)}.")
179+
print("Run `python3 bin/scripts/sync/normalize_json.py "
180+
"contributions/cities/ES.json` next to canonicalize formatting.")
181+
return 0
182+
183+
184+
if __name__ == "__main__":
185+
sys.exit(main())

0 commit comments

Comments
 (0)