Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions .github/fixes-docs/FIX_1498_PR_A_SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# FIX #1498 — Spain: drop admin-level placeholders from cities (PR-A of 2)

**Issue:** [#1498 — Bug ES: GetCity returns province-level admin entries as cities](https://github.com/dr5hn/countries-states-cities-database/issues/1498)
**Scope:** Drop 22 placeholder rows from `contributions/cities/ES.json`.
**Sibling PR:** PR-B retags ~6,920 mistyped admin-level rows (currently `type` in `adm1`/`adm2`/`adm3`) to `type='city'`. PR-B closes #1498.
**Date:** 2026-05-04

## Problem

The reporter flagged that `GetCity(country=ES, state=Madrid)` returns "Provincia de Madrid" as a city, with the same pattern across other Spanish provinces, plus a cross-province leak ("Provincia de Alicante" inside Valencia's city list). These are admin-level placeholder rows, not municipalities.

Spain's `states.json` already lists the 50 provinces as proper states, so the "Provincia de X" pseudo-cities are duplicate concepts and can be dropped without re-parenting any other data.

## Drops (22 rows)

### 21 "Provincia de X" / "Província de X" placeholders

| id | name | state_code | type |
|----|------|-----------|------|
| 36362 | Provincia de Alicante | V | city |
| 36364 | Provincia de Burgos | LE | adm2 |
| 36365 | Provincia de Cantabria | S | adm3 |
| 36373 | Provincia de Huesca | HU | adm3 |
| 36375 | Provincia de La Rioja | LO | adm3 |
| 36376 | Provincia de Las Palmas | GC | city |
| 36377 | Provincia de León | LE | adm3 |
| 36379 | Provincia de Madrid | M | section |
| 36381 | Provincia de Navarra | NA | adm1 |
| 36383 | Provincia de Palencia | LE | adm3 |
| 36385 | Provincia de Salamanca | LE | adm3 |
| 36386 | Provincia de Santa Cruz de Tenerife | GC | city |
| 36387 | Provincia de Segovia | LE | adm3 |
| 36389 | Provincia de Soria | LE | adm3 |
| 36390 | Provincia de Teruel | HU | adm3 |
| 36391 | Provincia de Valladolid | LE | city |
| 36392 | Provincia de Zamora | LE | adm3 |
| 36393 | Provincia de Zaragoza | HU | adm3 |
| 36394 | Provincia de Ávila | LE | adm3 |
| 36396 | Província de Castelló | V | adm3 |
| 36400 | Província de València | V | city |

The state_code values are themselves messy (e.g. "Provincia de Burgos" sits under `LE`/León, "Provincia de Zaragoza" under `HU`/Huesca) — further evidence these rows are stub data, not curated municipality records.

### 1 cross-state Alicante stub

| id | name | state_code | reason |
|----|------|-----------|--------|
| 32244 | Alicante | V (Valencia) | Wrong province; missing Valencian endonym |

The canonical Alicante row is **id 152158** (`Alicante/Alacant`, state_code `A`) under the Alicante province. Row 32244 is a legacy duplicate from when Valencia community was the parent of three provinces, and its `state_code='V'` is what the issue reporter flagged as the cross-province leak.

## Counts

| | Before | After |
|---|--:|--:|
| `ES.json` rows | 8,427 | 8,405 |
| Rows named `Provincia *` / `Província *` | 21 | 0 |
| Rows where `state_code='V'` and name='Alicante' | 1 | 0 |

## Implementation

`bin/scripts/fixes/spain_drop_provincia_placeholders.py` — explicit id allowlist + name/state verification per id. Refuses to touch rows in the allowlist if their name/state has shifted from what was audited. Idempotent: a second run on cleaned data writes nothing and exits 0.

## Validation (mirrors `.github/scripts/validate-*`)

- Schema: 0 errors. All rows still have name/state_id/state_code/country_id/country_code/lat/lon. country_code/country_id consistent (ES/207).
- Cross-reference: 0 errors. Every `state_id` resolves to an ES state and `state_code` matches the resolved state's `iso2`.
- Coordinates: 127 rows out of `country-bounds.json` ES box (down from 129 on master — the drop reduced OOB by 2). The 127 remaining are all Canary Islands (state_codes `TF`, `GC`) — pre-existing, same pattern as IT/Lampedusa noted in #1395.
- Duplicate scan (same name + ≤5km): 45 pairs, **unchanged from master**.
- `python3 -m json.tool` parses cleanly; `normalize_json.py` is a no-op.

## Scope

- Touches **only** the 22 placeholder rows.
- Does **not** modify `states.json` or `countries.json`.
- Does **not** close #1498. PR-B (the type-field retag of ~6,920 mistyped rows) is the closing PR.
185 changes: 185 additions & 0 deletions bin/scripts/fixes/spain_drop_provincia_placeholders.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
#!/usr/bin/env python3
"""Drop admin-level placeholder rows from contributions/cities/ES.json.

Issue #1498. The reporter flagged that `GetCity(...)` returns province-level
admin entries inside Spanish city dropdowns ("Provincia de Madrid" etc.).
This is the city-level cleanup PR — drops 22 rows that are not real cities:

- 21 "Provincia de X" / "Província de X" rows (ids 36362, 36364, 36365,
36373, 36375, 36376, 36377, 36379, 36381, 36383, 36385, 36386, 36387,
36389, 36390, 36391, 36392, 36393, 36394, 36396, 36400). All are admin
placeholders that duplicate concepts already represented by entries in
contributions/states/states.json. The 50 Spanish provinces are already
states, so a separate "Provincia de X" pseudo-city is redundant.

- 1 cross-state placeholder (id 32244, name "Alicante", state_code=V).
The real Alicante city lives at id 152158 ("Alicante/Alacant",
state_code=A) under the Alicante province (iso2=A). Row 32244 is a
legacy stub from when Valencia community was the parent of three
provinces (Valencia, Alicante, Castellon) — Alicante now has its own
state, so the stub is wrong both in name (missing the Valencian
endonym) and in parentage.

The 21 "Provincia ..." ids are an explicit allowlist rather than a name-
prefix filter, because two real Spanish municipalities also begin with
"Provincia" in obscure forms — using ids guarantees we touch only what we
mean to. The script verifies each id's current name still matches the
expected prefix before dropping, refusing to touch rows that don't.

Idempotent: a second run on already-cleaned data writes nothing and exits 0.

Usage:
python3 bin/scripts/fixes/spain_drop_provincia_placeholders.py [--dry-run]
"""

from __future__ import annotations

import argparse
import json
import sys
from pathlib import Path
from typing import List

REPO_ROOT = Path(__file__).resolve().parents[3]
CITIES_JSON = REPO_ROOT / "contributions/cities/ES.json"

# 21 admin-placeholder ids, expected to start with "Provincia " or "Província ".
PROVINCIA_IDS = frozenset({
36362, 36364, 36365, 36373, 36375, 36376, 36377, 36379, 36381, 36383,
36385, 36386, 36387, 36389, 36390, 36391, 36392, 36393, 36394, 36396,
36400,
})

# 1 cross-state placeholder. id, expected name, expected wrong state_code.
CROSS_STATE_ALICANTE_ID = 32244
CROSS_STATE_ALICANTE_NAME = "Alicante"
CROSS_STATE_ALICANTE_STATE_CODE = "V"

EXPECTED_DROP_COUNT = len(PROVINCIA_IDS) + 1 # 22
EXPECTED_PRE_COUNT = 8427
EXPECTED_POST_COUNT = EXPECTED_PRE_COUNT - EXPECTED_DROP_COUNT # 8405


def load_cities(path: Path) -> List[dict]:
"""Read the cities JSON file and verify the top-level shape."""
with path.open("r", encoding="utf-8") as fh:
data = json.load(fh)
if not isinstance(data, list):
raise SystemExit(f"Expected JSON array at {path}, got {type(data).__name__}")
return data


def matches_provincia(row: dict) -> bool:
"""Row is one of the 21 'Provincia ...' admin placeholders, name verified."""
if row.get("id") not in PROVINCIA_IDS:
return False
name = row.get("name", "")
return isinstance(name, str) and (
name.startswith("Provincia ") or name.startswith("Província ")
)


def matches_cross_state_alicante(row: dict) -> bool:
"""Row is the cross-state Alicante stub (id 32244, name 'Alicante', state V)."""
return (
row.get("id") == CROSS_STATE_ALICANTE_ID
and row.get("name") == CROSS_STATE_ALICANTE_NAME
and row.get("state_code") == CROSS_STATE_ALICANTE_STATE_CODE
)


def is_target(row: dict) -> bool:
"""True if this row is on the drop list."""
return matches_provincia(row) or matches_cross_state_alicante(row)


def main() -> int:
parser = argparse.ArgumentParser(description=__doc__.splitlines()[0])
parser.add_argument(
"--dry-run",
action="store_true",
help="Report what would be dropped without rewriting the file.",
)
args = parser.parse_args()

if not CITIES_JSON.exists():
raise SystemExit(f"Cities file not found: {CITIES_JSON}")

cities = load_cities(CITIES_JSON)
pre_count = len(cities)

to_drop = [row for row in cities if is_target(row)]
kept = [row for row in cities if not is_target(row)]

drop_count = len(to_drop)
post_count = len(kept)

# Defensive check: any row whose id is in our allowlist but whose name
# doesn't match the expected pattern (could indicate the data has
# shifted since this script was written).
rows_by_id = {row.get("id"): row for row in cities}
suspicious: List[dict] = []
for pid in PROVINCIA_IDS:
row = rows_by_id.get(pid)
if row is not None and not matches_provincia(row):
suspicious.append(row)
cross = rows_by_id.get(CROSS_STATE_ALICANTE_ID)
if cross is not None and not matches_cross_state_alicante(cross):
suspicious.append(cross)

print(f"Input file: {CITIES_JSON.relative_to(REPO_ROOT)}")
print(f"Pre-clean count: {pre_count}")
print(f"Provincia drops: "
f"{sum(1 for r in to_drop if matches_provincia(r))} / "
f"{len(PROVINCIA_IDS)} expected")
print(f"Cross-state Alicante drop: "
f"{sum(1 for r in to_drop if matches_cross_state_alicante(r))} / 1 expected")
print(f"Total drops: {drop_count}")
print(f"Post-clean count: {post_count}")

if suspicious:
print(
"\nERROR: target ids exist but do not match expected name/state — "
"refusing to drop them:",
file=sys.stderr,
)
for row in suspicious:
print(
f" id={row.get('id')} name={row.get('name')!r} "
f"state_code={row.get('state_code')!r}",
file=sys.stderr,
)
return 2

if drop_count == 0:
print("\nNo placeholder rows found — nothing to do (idempotent).")
return 0

if drop_count != EXPECTED_DROP_COUNT or pre_count != EXPECTED_PRE_COUNT:
print(
f"\nWARNING: counts diverge from the expected baseline "
f"({EXPECTED_PRE_COUNT} -> drop {EXPECTED_DROP_COUNT} -> "
f"{EXPECTED_POST_COUNT}). Got {pre_count} -> drop {drop_count} "
f"-> {post_count}. Review the diff before merging.",
file=sys.stderr,
)

if args.dry_run:
print("\n--dry-run: not writing.")
for r in to_drop:
print(f" would drop id={r['id']} name={r['name']!r} "
f"state_code={r.get('state_code')!r} type={r.get('type')!r}")
return 0

with CITIES_JSON.open("w", encoding="utf-8") as fh:
json.dump(kept, fh, ensure_ascii=False, indent=2)
fh.write("\n")

print(f"\nWrote {post_count} records to {CITIES_JSON.relative_to(REPO_ROOT)}.")
print("Run `python3 bin/scripts/sync/normalize_json.py "
"contributions/cities/ES.json` next to canonicalize formatting.")
return 0


if __name__ == "__main__":
sys.exit(main())
Loading
Loading