Skip to content

fix(ES): retag 6,920 mistyped admin rows to type=city (#1498 follow-up)#1517

Merged
dr5hn merged 2 commits into
masterfrom
fix/issue-1498-es-retag-admin-types
May 5, 2026
Merged

fix(ES): retag 6,920 mistyped admin rows to type=city (#1498 follow-up)#1517
dr5hn merged 2 commits into
masterfrom
fix/issue-1498-es-retag-admin-types

Conversation

@dr5hn
Copy link
Copy Markdown
Owner

@dr5hn dr5hn commented May 4, 2026

Refs #1498. Stacks on top of #1516 — please review/merge that one first.

Bulk-retags 6,920 rows in contributions/cities/ES.json whose type field is adm1/adm2/adm3 to type='city'. Touches only the type field — coordinates, names, state codes, populations all unchanged.

Why these are real cities, not admin regions

Spot-check across a 60-row random sample (20 of each adm type):

  • adm1 (20 in file): Barcelona, Valencia, Sevilla, Zaragoza, Murcia, Pamplona, Valladolid, Las Palmas, Santiago de Compostela, Santander, Toledo, Mérida, Logroño, Vitoria-Gasteiz, Oviedo, Palma, …
  • adm2 (40 in file): Córdoba, Málaga, Lleida, Girona, Soria, Segovia, Jaén, Lugo, Albacete, Guadalajara, Palencia, Zamora, Castelló de la Plana, Ciudad Real, Alicante/Alacant, …
  • adm3 (6,860 in file): real Spanish municipalities, all with coordinates, most with population.
Type Count wikiDataId population>0 coords
adm1 20 20/20 20/20 20/20
adm2 40 39/40 38/40 40/40
adm3 6,860 6,854/6,860 6,458/6,860 6,860/6,860

Type counts

Type Before After
city 1,416 8,336
section 60 60
adm3 6,860 0
adm2 40 0
adm1 20 0
locality 6 6
capital / historical_capital / adm4 1 each 1 each
Total 8,405 8,405

(The brief estimated "6,936 → 8,357 city". Actual is 6,920 → 8,336 because 16 of the 22 rows PR-A dropped were already typed adm1/adm2/adm3.)

Out of scope (deliberately not touched)

  • type='section' (60 rows): mixed — some are real Madrid/Barcelona neighbourhoods, some look like real towns. Needs row-by-row review, not bulk retag.
  • type='locality' (6 rows), type='capital' / 'historical_capital' / 'adm4' (1 each): left alone.

Implementation

bin/scripts/fixes/spain_retag_admin_types.py — single-pass mutation. Asserts row count is preserved and zero admin-typed rows remain. Idempotent: re-run produces 0 candidates.

Validation

  • Schema: 0 errors.
  • Cross-reference: 0 errors. State codes byte-for-byte unchanged, so 0 leakage.
  • Coordinate-bounds: 127 OOB (Canary Islands TF/GC), identical to PR-A head.
  • Same-name + ≤5km duplicate pairs: 45, identical to PR-A head.
  • Diff inspection: 6,920 row mutations, every change is exactly the type field. Zero collateral changes.

Constraints honoured

  • Touches only the type field of country_code='ES' rows.
  • Does not modify states.json or countries.json.

Closes #1498.

dr5hn added 2 commits May 4, 2026 19:48
Drops 22 placeholder records from contributions/cities/ES.json:

- 21 "Provincia de X" / "Província de X" rows (ids 36362, 36364, 36365,
  36373, 36375, 36376, 36377, 36379, 36381, 36383, 36385, 36386, 36387,
  36389, 36390, 36391, 36392, 36393, 36394, 36396, 36400). Spanish
  provinces are already represented as proper states in states.json,
  making these pseudo-cities duplicate concepts. Their own state_code
  values are inconsistent (e.g. "Provincia de Burgos" parented under
  state_code=LE), confirming stub-data status.

- 1 cross-state Alicante stub (id 32244, state_code=V) flagged by the
  reporter as a cross-province leak in Valencia's city list. Canonical
  row is id 152158 ("Alicante/Alacant", state_code=A).

Counts: 8,427 -> 8,405 rows. Out-of-bounds coordinate violations drop
from 129 to 127 (the dropped stubs included 2 invalid coords). 0 schema
errors, 0 cross-reference errors, same-name <5km duplicate pairs
unchanged at 45 (all pre-existing).

Refs #1498. Does not close it -- PR-B follow-up retags ~6,920 mistyped
admin-level rows to type=city.
After PR-A dropped 22 admin-level placeholders, 6,920 rows in
contributions/cities/ES.json still carried type='adm1', 'adm2', or 'adm3'
even though they are real Spanish municipalities (provincial capitals
typed adm1/adm2; small towns typed adm3). Consuming apps that filter on
type='city' were excluding these from city dropdowns -- the failure mode
the issue reporter described.

Spot-check (60 random rows across all three adm levels) confirmed every
sample is a real Spanish municipality with coordinates, and 99%+ have
wikiDataId references. Aggregate signals: 6,913/6,920 have wikiDataId,
6,516/6,920 have non-zero population, 6,920/6,920 have coordinates.

Type counts (post PR-A -> post PR-B):
  adm3:    6,860 ->     0
  adm2:       40 ->     0
  adm1:       20 ->     0
  city:    1,416 -> 8,336
  section:    60 ->    60  (left alone -- mixed quality, needs row-level review)
  locality:    6 ->     6
  capital/historical_capital/adm4: 1 each -> 1 each

Diff is 6,920 single-field mutations only -- coordinates, names, state
codes, populations all byte-for-byte unchanged. Schema/cross-reference
validators report 0 errors; coord-bounds and same-name<5km duplicate
counts identical to PR-A head.

Closes #1498.
@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 4, 2026
@dosubot dosubot Bot added the fixed Issue has been fixed label May 4, 2026
Base automatically changed from fix/issue-1498-es-drop-provincia-rows to master May 5, 2026 10:03
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels May 5, 2026
@dr5hn dr5hn merged commit de838cd into master May 5, 2026
0 of 2 checks passed
@dr5hn dr5hn deleted the fix/issue-1498-es-retag-admin-types branch May 5, 2026 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data:cities fixed Issue has been fixed large-contribution size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug][ES] GetCity returns province-level administrative entries as cities (e.g., 'Provincia de Madrid')

1 participant