fix(ES): retag 6,920 mistyped admin rows to type=city (#1498 follow-up)#1517
Merged
Conversation
Drops 22 placeholder records from contributions/cities/ES.json:
- 21 "Provincia de X" / "Província de X" rows (ids 36362, 36364, 36365,
36373, 36375, 36376, 36377, 36379, 36381, 36383, 36385, 36386, 36387,
36389, 36390, 36391, 36392, 36393, 36394, 36396, 36400). Spanish
provinces are already represented as proper states in states.json,
making these pseudo-cities duplicate concepts. Their own state_code
values are inconsistent (e.g. "Provincia de Burgos" parented under
state_code=LE), confirming stub-data status.
- 1 cross-state Alicante stub (id 32244, state_code=V) flagged by the
reporter as a cross-province leak in Valencia's city list. Canonical
row is id 152158 ("Alicante/Alacant", state_code=A).
Counts: 8,427 -> 8,405 rows. Out-of-bounds coordinate violations drop
from 129 to 127 (the dropped stubs included 2 invalid coords). 0 schema
errors, 0 cross-reference errors, same-name <5km duplicate pairs
unchanged at 45 (all pre-existing).
Refs #1498. Does not close it -- PR-B follow-up retags ~6,920 mistyped
admin-level rows to type=city.
After PR-A dropped 22 admin-level placeholders, 6,920 rows in contributions/cities/ES.json still carried type='adm1', 'adm2', or 'adm3' even though they are real Spanish municipalities (provincial capitals typed adm1/adm2; small towns typed adm3). Consuming apps that filter on type='city' were excluding these from city dropdowns -- the failure mode the issue reporter described. Spot-check (60 random rows across all three adm levels) confirmed every sample is a real Spanish municipality with coordinates, and 99%+ have wikiDataId references. Aggregate signals: 6,913/6,920 have wikiDataId, 6,516/6,920 have non-zero population, 6,920/6,920 have coordinates. Type counts (post PR-A -> post PR-B): adm3: 6,860 -> 0 adm2: 40 -> 0 adm1: 20 -> 0 city: 1,416 -> 8,336 section: 60 -> 60 (left alone -- mixed quality, needs row-level review) locality: 6 -> 6 capital/historical_capital/adm4: 1 each -> 1 each Diff is 6,920 single-field mutations only -- coordinates, names, state codes, populations all byte-for-byte unchanged. Schema/cross-reference validators report 0 errors; coord-bounds and same-name<5km duplicate counts identical to PR-A head. Closes #1498.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refs #1498. Stacks on top of #1516 — please review/merge that one first.
Bulk-retags 6,920 rows in
contributions/cities/ES.jsonwhosetypefield isadm1/adm2/adm3totype='city'. Touches only thetypefield — coordinates, names, state codes, populations all unchanged.Why these are real cities, not admin regions
Spot-check across a 60-row random sample (20 of each adm type):
Type counts
(The brief estimated "6,936 → 8,357 city". Actual is 6,920 → 8,336 because 16 of the 22 rows PR-A dropped were already typed adm1/adm2/adm3.)
Out of scope (deliberately not touched)
type='section'(60 rows): mixed — some are real Madrid/Barcelona neighbourhoods, some look like real towns. Needs row-by-row review, not bulk retag.type='locality'(6 rows),type='capital'/'historical_capital'/'adm4'(1 each): left alone.Implementation
bin/scripts/fixes/spain_retag_admin_types.py— single-pass mutation. Asserts row count is preserved and zero admin-typed rows remain. Idempotent: re-run produces 0 candidates.Validation
TF/GC), identical to PR-A head.typefield. Zero collateral changes.Constraints honoured
typefield ofcountry_code='ES'rows.states.jsonorcountries.json.Closes #1498.