Skip to content

Commit 8662afc

Browse files
committed
fix(ES): retag 6,920 mistyped admin rows to type=city (#1498 follow-up)
After PR-A dropped 22 admin-level placeholders, 6,920 rows in contributions/cities/ES.json still carried type='adm1', 'adm2', or 'adm3' even though they are real Spanish municipalities (provincial capitals typed adm1/adm2; small towns typed adm3). Consuming apps that filter on type='city' were excluding these from city dropdowns -- the failure mode the issue reporter described. Spot-check (60 random rows across all three adm levels) confirmed every sample is a real Spanish municipality with coordinates, and 99%+ have wikiDataId references. Aggregate signals: 6,913/6,920 have wikiDataId, 6,516/6,920 have non-zero population, 6,920/6,920 have coordinates. Type counts (post PR-A -> post PR-B): adm3: 6,860 -> 0 adm2: 40 -> 0 adm1: 20 -> 0 city: 1,416 -> 8,336 section: 60 -> 60 (left alone -- mixed quality, needs row-level review) locality: 6 -> 6 capital/historical_capital/adm4: 1 each -> 1 each Diff is 6,920 single-field mutations only -- coordinates, names, state codes, populations all byte-for-byte unchanged. Schema/cross-reference validators report 0 errors; coord-bounds and same-name<5km duplicate counts identical to PR-A head. Closes #1498.
1 parent 8ee17a4 commit 8662afc

3 files changed

Lines changed: 7138 additions & 6920 deletions

File tree

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# FIX #1498 — Spain: retag mistyped admin rows to type=city (PR-B of 2)
2+
3+
**Issue:** [#1498 — Bug ES: GetCity returns province-level admin entries as cities](https://github.com/dr5hn/countries-states-cities-database/issues/1498)
4+
**Scope:** Bulk-retag 6,920 rows in `contributions/cities/ES.json` whose `type` is `adm1`/`adm2`/`adm3` to `type='city'`. Touches only the `type` field.
5+
**Sibling PR:** PR-A (already merged / in review) dropped 22 admin-level placeholder rows.
6+
**Closes:** #1498.
7+
**Date:** 2026-05-04
8+
9+
## Problem
10+
11+
After PR-A removed the 22 admin-level placeholders, ES.json still carried 6,920 records whose `type` field was set to `adm1`, `adm2`, or `adm3` even though the rows themselves are real Spanish municipalities. Consuming apps that filter on `type='city'` (or sort city-vs-admin records) treat these as administrative regions and exclude them from city dropdowns — exactly the failure mode the reporter described.
12+
13+
## Spot-check (why these are real cities)
14+
15+
A 60-row random sample (20 of each adm type) gave:
16+
17+
- **adm1 (20 rows total in file):** every sample is a major Spanish city — Barcelona, Valencia, Sevilla, Zaragoza, Murcia, Pamplona, Valladolid, Las Palmas de Gran Canaria, Santiago de Compostela, Santander, Toledo, Mérida, Logroño, Vitoria-Gasteiz, Oviedo, Palma, etc.
18+
- **adm2 (40 rows total):** every sample is a provincial capital or municipality — Córdoba, Málaga, Lleida, Girona, Soria, Segovia, Jaén, Lugo, Albacete, Guadalajara, Palencia, Zamora, Castelló de la Plana, Ciudad Real, Alicante/Alacant (id 152158, the canonical Alicante), etc.
19+
- **adm3 (6,860 rows total):** every sample is a real Spanish municipality with coordinates and (mostly) population data.
20+
21+
Aggregate quality signals across the 6,920 candidates:
22+
23+
| Type | Count | Has wikiDataId | Has population > 0 | Has coords |
24+
|------|------:|---------------:|-------------------:|-----------:|
25+
| adm1 | 20 | 20/20 | 20/20 | 20/20 |
26+
| adm2 | 40 | 39/40 | 38/40 | 40/40 |
27+
| adm3 | 6,860 | 6,854/6,860 | 6,458/6,860 | 6,860/6,860 |
28+
29+
## Counts (post PR-A → post PR-B)
30+
31+
| Type | Before | After |
32+
|------|--:|--:|
33+
| city | 1,416 | **8,336** |
34+
| section | 60 | 60 |
35+
| adm3 | 6,860 | 0 |
36+
| adm2 | 40 | 0 |
37+
| adm1 | 20 | 0 |
38+
| locality | 6 | 6 |
39+
| historical_capital | 1 | 1 |
40+
| capital | 1 | 1 |
41+
| adm4 | 1 | 1 |
42+
| **Total** | **8,405** | **8,405** |
43+
44+
(Note: the brief gave a back-of-envelope estimate of "6,936 to retag → 8,357 city". The actual numbers are 6,920 → 8,336, because 16 of the 22 rows PR-A dropped were themselves typed adm1/adm2/adm3.)
45+
46+
## Out of scope (deliberately not touched)
47+
48+
- **`type='section'` (60 rows):** mixed quality — some are real neighbourhoods of Barcelona/Madrid that should stay typed `section` (they're already correctly excluded from city dropdowns), some look like real towns. Needs row-by-row review, not bulk retag.
49+
- **`type='locality'` (6 rows):** small, defensible category; left alone.
50+
- **`type='capital'`, `type='historical_capital'`, `type='adm4'` (1 each):** not city-equivalents in this dataset's vocabulary. Left alone.
51+
52+
## Implementation
53+
54+
`bin/scripts/fixes/spain_retag_admin_types.py` — single-pass mutation, only touches the `type` field, only for ES rows whose current type is in `{adm1, adm2, adm3}`. Asserts row count is preserved and no admin-typed rows remain. Idempotent.
55+
56+
## Validation (mirrors `.github/scripts/validate-*`)
57+
58+
- Schema: 0 errors.
59+
- Cross-reference: 0 errors. Every `state_id` resolves to an ES state and `state_code` matches the resolved state's `iso2`.
60+
- Coordinate-bounds: 127 out-of-box (Canary Islands, `TF`/`GC` — pre-existing, identical to PR-A head).
61+
- Same-name + ≤5km duplicate pairs: 45 — **identical to PR-A head**. The retag only changed the `type` field; coordinates and names are byte-for-byte unchanged.
62+
- Diff inspection: 6,920 rows changed, every change is exactly one field (`type`), source value in `{adm1, adm2, adm3}`, target value `'city'`. Zero collateral changes.
63+
- Idempotent re-run: 0 candidates remaining.
64+
65+
## Constraints honoured
66+
67+
- Touches **only** the `type` field of `country_code='ES'` rows.
68+
- Does **not** touch `state_code`, `state_id`, coordinates, or any other field.
69+
- Does **not** modify `states.json` or `countries.json`.
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
#!/usr/bin/env python3
2+
"""Retag mistyped admin-level rows in contributions/cities/ES.json to type='city'.
3+
4+
Issue #1498 follow-up to PR-A (which dropped 22 admin-level placeholders).
5+
After PR-A, ES.json contains ~6,920 rows whose `type` field is set to
6+
`adm1`, `adm2`, or `adm3` even though they are real Spanish municipalities
7+
(provincial capitals, big cities, and small towns alike). This script
8+
retags those rows to `type='city'` so consuming apps stop misclassifying
9+
real cities as administrative regions.
10+
11+
Why these rows are real cities (spot-checked, see PR description for
12+
samples):
13+
- 100% have geographic coordinates.
14+
- 99%+ have a `wikiDataId` reference.
15+
- Almost all have a non-zero `population`.
16+
- The names map cleanly to municipalities visible on Spanish gov
17+
sources (provincial capitals like Zaragoza/Murcia/Sevilla typed
18+
`adm1`; province-capitals-or-comparable typed `adm2`; small towns
19+
typed `adm3`).
20+
21+
Conservative scope:
22+
- Only `country_code='ES'` rows are touched.
23+
- Only `type` in {'adm1', 'adm2', 'adm3'} is rewritten.
24+
- **Not** touched: `type='section'` (61 rows, mixed quality — needs
25+
separate per-row review), `type='locality'` (6 rows), `type='city'`
26+
(already correct), `type='capital'`/`'historical_capital'`/`'adm4'`
27+
(1 each — left alone).
28+
29+
Idempotent: re-running on already-retagged data writes nothing and exits 0.
30+
31+
Usage:
32+
python3 bin/scripts/fixes/spain_retag_admin_types.py [--dry-run]
33+
"""
34+
35+
from __future__ import annotations
36+
37+
import argparse
38+
import collections
39+
import json
40+
import sys
41+
from pathlib import Path
42+
from typing import List
43+
44+
REPO_ROOT = Path(__file__).resolve().parents[3]
45+
CITIES_JSON = REPO_ROOT / "contributions/cities/ES.json"
46+
47+
ADMIN_TYPES = ("adm1", "adm2", "adm3")
48+
TARGET_TYPE = "city"
49+
COUNTRY_CODE = "ES"
50+
51+
52+
def load_cities(path: Path) -> List[dict]:
53+
"""Read the cities JSON file and verify the top-level shape."""
54+
with path.open("r", encoding="utf-8") as fh:
55+
data = json.load(fh)
56+
if not isinstance(data, list):
57+
raise SystemExit(f"Expected JSON array at {path}, got {type(data).__name__}")
58+
return data
59+
60+
61+
def is_target(row: dict) -> bool:
62+
"""Row is an ES record whose type is one of the admin levels we retag."""
63+
return row.get("country_code") == COUNTRY_CODE and row.get("type") in ADMIN_TYPES
64+
65+
66+
def type_distribution(rows: List[dict]) -> "collections.Counter[str]":
67+
"""Counter of `type` values across rows."""
68+
return collections.Counter(r.get("type") for r in rows)
69+
70+
71+
def main() -> int:
72+
parser = argparse.ArgumentParser(description=__doc__.splitlines()[0])
73+
parser.add_argument(
74+
"--dry-run",
75+
action="store_true",
76+
help="Report the rewrite plan without rewriting the file.",
77+
)
78+
args = parser.parse_args()
79+
80+
if not CITIES_JSON.exists():
81+
raise SystemExit(f"Cities file not found: {CITIES_JSON}")
82+
83+
cities = load_cities(CITIES_JSON)
84+
pre_count = len(cities)
85+
86+
before = type_distribution(cities)
87+
targets = [r for r in cities if is_target(r)]
88+
retag_count = len(targets)
89+
90+
# Track per-source-type retag counts for reporting.
91+
by_source = collections.Counter(r.get("type") for r in targets)
92+
93+
# Sample 5 retagged rows for human eyes.
94+
sample = targets[:5]
95+
96+
# State_code distribution: every row keeps its existing state_code.
97+
# Verify no row would change state_code as a side-effect.
98+
state_code_changes = 0 # this script never modifies state_code; sanity assert.
99+
100+
print(f"Input file: {CITIES_JSON.relative_to(REPO_ROOT)}")
101+
print(f"Pre-retag count: {pre_count}")
102+
print(f"Type distribution before:")
103+
for t, c in sorted(before.items(), key=lambda x: -x[1]):
104+
print(f" {t!r:>20}: {c}")
105+
print(f"Retag candidates: {retag_count}")
106+
for src in ADMIN_TYPES:
107+
print(f" from {src!r}: {by_source.get(src, 0)}")
108+
109+
if retag_count == 0:
110+
print("\nNo admin-typed rows found — nothing to do (idempotent).")
111+
return 0
112+
113+
print("\nSample of rows that will be retagged:")
114+
for r in sample:
115+
print(
116+
f" id={r['id']} {r['name']!r} state_code={r.get('state_code')!r} "
117+
f"old_type={r.get('type')!r} -> {TARGET_TYPE!r}"
118+
)
119+
120+
if args.dry_run:
121+
print("\n--dry-run: not writing.")
122+
return 0
123+
124+
for row in cities:
125+
if is_target(row):
126+
row["type"] = TARGET_TYPE
127+
128+
after = type_distribution(cities)
129+
print(f"\nType distribution after:")
130+
for t, c in sorted(after.items(), key=lambda x: -x[1]):
131+
print(f" {t!r:>20}: {c}")
132+
assert sum(after.values()) == sum(before.values()), "row count drift!"
133+
assert state_code_changes == 0
134+
# Sanity: no row in ADMIN_TYPES remains.
135+
leftover = sum(after[t] for t in ADMIN_TYPES if t in after)
136+
assert leftover == 0, f"{leftover} admin-typed rows still present"
137+
138+
with CITIES_JSON.open("w", encoding="utf-8") as fh:
139+
json.dump(cities, fh, ensure_ascii=False, indent=2)
140+
fh.write("\n")
141+
142+
print(f"\nWrote {pre_count} records to {CITIES_JSON.relative_to(REPO_ROOT)}.")
143+
print("Run `python3 bin/scripts/sync/normalize_json.py "
144+
"contributions/cities/ES.json` next to canonicalize formatting.")
145+
return 0
146+
147+
148+
if __name__ == "__main__":
149+
sys.exit(main())

0 commit comments

Comments
 (0)