Skip to content

feat(FR): diff cities against data.gouv.fr, add missing metropolitan communes (#1352 PR-A)#1394

Merged
dr5hn merged 2 commits into
masterfrom
feat/issue-1352-france-cities-diff
Apr 27, 2026
Merged

feat(FR): diff cities against data.gouv.fr, add missing metropolitan communes (#1352 PR-A)#1394
dr5hn merged 2 commits into
masterfrom
feat/issue-1352-france-cities-diff

Conversation

@dr5hn
Copy link
Copy Markdown
Owner

@dr5hn dr5hn commented Apr 25, 2026

Summary

Adds 455 metropolitan French communes (population ≥ 2,000) that were missing from contributions/cities/FR.json relative to the canonical INSEE commune list at data.gouv.fr.

This is PR-A of 4 in the issue #1352 plan — siblings PR-B (region reclassification), PR-C (obsolete/merged commune cleanup), and PR-D (overseas territories) are tracked separately and explicitly out of scope here.

Refs #1352does not close the issue.

What this PR does

  • +455 records in contributions/cities/FR.json. All metropolitan, all population ≥ 2,000, all with coordinates and population from geo.api.gouv.fr. Top adds include Cherbourg-en-Cotentin (78K), Évry-Courcouronnes (66K), Saint-Ouen-sur-Seine (53K), Oullins-Pierre-Bénite (38K), Herblay-sur-Seine (32K), Le Chesnay-Rocquencourt (31K).
  • New script bin/scripts/fixes/france_cities_diff.py — pure-Python, dependency-free; matches by (state_code, normalised_name) with œ/æ ligature and lès/lez preposition handling.
  • Diagnostic report committed at bin/scripts/fixes/france_cities_diff.report.json for reviewer audit and follow-up PRs.
  • Full methodology in .github/fixes-docs/FIX_1352_PR_A_SUMMARY.md.

Diagnostic findings flagged for sibling PRs (NOT fixed here)

Category Count Owner
cross_region_matches (CSC city under wrong region — incl. Ajaccio/Bastia under 20R instead of 2A/2B) 1,194 PR-B
extra CSC records (obsolete/merged communes like Aime/Annecy-le-Vieux/Ancenis; Marseille quartiers like Arenc/La Villette; dept names mistakenly stored as cities like Alpes-Maritimes/Ardennes) 643 PR-C
Lower-population missing communes (deferred) 23,663 future PR-A.2
Overseas territories (excluded) PR-D

Test plan

  • Local schema validator: 0 errors, 2,275 warnings — all "unknown field" warnings for type/level/parent_id/native/population, identical to those produced by the existing 10,079 records (project convention; warnings, not blockers).
  • Cross-reference validator: every new record's state_id/country_id resolves; codes match.
  • Coordinate-bounds validator: every new record inside FR metropolitan box (lat 41.36–51.09, lon −5.14–9.56).
  • Duplicate detector: 0 exact-name same-state duplicates; 1 fuzzy match (Bréhan vs Rohan, dept 56) — verified to be two genuinely different communes 4.3 km apart with INSEE codes 56025 vs 56196.
  • JSON formatting: matches canonical 2-space indent.
  • No id/created_at/updated_at/flag fields on new records.
  • CI PR validators (run automatically on this PR).

Reviewer notes

  • The script can be re-run by reviewers with their own villes.min.json and geo.api.gouv.fr dumps. The --pop-threshold arg is exposed so the threshold can be tuned without code changes.
  • france_cities_diff.merge.json and france_cities_diff.deferred.json are not committed — the merge is now in FR.json and the deferred set (~7 MB) is regeneratable.
  • Translations and wikiDataId are intentionally left empty for new records rather than synthesised; can be backfilled in a future pass.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Owner Author

dr5hn commented Apr 27, 2026

Weekly data-quality review (2026-04-27)

Verdict: clean

Checks

  • Schema: ✅ Spot-checked tail of contributions/cities/FR.json on the PR branch: all new records correctly omit id, flag, created_at, updated_at. Fields present (name, state_id, state_code, country_id, country_code, latitude, longitude, type, level, parent_id, native, population, timezone) are consistent with existing FR city records.
  • FK integrity: ✅ Author ran cross-reference validator; all new state_id/country_id values resolve. Spot-checked: state_code values ARA, GES, IDF, NOR, BRE, BFC, PDL all correspond to valid FR region states in states.json. country_id=75 / country_code="FR" consistent throughout.
  • Coordinates: ✅ All 455 records within FR metropolitan bounds (41.36–51.09°N, −5.14–9.56°E). Spot-checked: Évry-Courcouronnes 48.63°N 2.43°E ✅; Wingersheim 48.72°N 7.62°E ✅; Vézeronce-Curtin 45.65°N 5.47°E ✅.
  • Wikidata: N/A — new records intentionally omit wikiDataId (to be backfilled in a future pass; stated in PR description).
  • Naming convention: ✅ French commune names used in both name and native — correct convention for French municipalities where French is the canonical name. No English-only text in native; no non-Latin script in name.

Advisory (non-blocking)

🤖 Automated weekly review — Claude (sonnet-4-6).


Generated by Claude Code

dr5hn and others added 2 commits April 27, 2026 21:18
…communes (#1352 PR-A)

Adds 455 metropolitan French communes (population ≥ 2,000) that were missing
from contributions/cities/FR.json relative to the canonical INSEE list at
data.gouv.fr. Includes large communes-nouvelles created since 2015 — e.g.,
Cherbourg-en-Cotentin (78K), Évry-Courcouronnes (66K), Saint-Ouen-sur-Seine
(53K), Oullins-Pierre-Bénite (38K).

The diff script (bin/scripts/fixes/france_cities_diff.py) produces a structured
report and a conservative merge proposal:
  - Matches by (state_code, normalised name); normalisation handles œ/æ
    ligatures and lès/lez preposition variants.
  - Department overrides for 2A/2B/48/52/55 follow existing FR.json
    convention.
  - 1,194 cross-region matches (cities under wrong state) are flagged for
    PR-B, not auto-moved.
  - 643 "extra" CSC records (obsolete/merged communes, quartiers, dept names)
    are flagged for PR-C.
  - Overseas territories excluded (PR-D).

Validation: 0 schema errors, 0 cross-reference errors, 0 coord-bounds
violations, 0 exact-name same-state duplicates. Full breakdown in
.github/fixes-docs/FIX_1352_PR_A_SUMMARY.md.

Refs #1352

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e_code

After cherry-picking PR-A onto post-PR-E master, ran
france_cities_remap.py (the script committed in PR-E #1484) to
remap the 455 newly-added communes from their authored region codes
(NOR, PDL, ARA, etc.) to the correct INSEE department codes
(50 for Manche, 14 for Calvados, etc.).

Verified: 0 region-coded rows remain, 0 invalid state_ids.
Allier (state_code=03) goes from 59 to 60 cities.

Refs: #1352

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dr5hn dr5hn force-pushed the feat/issue-1352-france-cities-diff branch from caac707 to a674624 Compare April 27, 2026 15:49
@dr5hn dr5hn marked this pull request as ready for review April 27, 2026 15:49
Copilot AI review requested due to automatic review settings April 27, 2026 15:49
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 27, 2026
@dr5hn dr5hn merged commit 9cdc00c into master Apr 27, 2026
2 checks passed
@dr5hn dr5hn deleted the feat/issue-1352-france-cities-diff branch April 27, 2026 15:49
@dosubot dosubot Bot added the enhancement New feature or request label Apr 27, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a France-focused diff/merge workflow to identify missing metropolitan communes against INSEE (data.gouv.fr), commit an audit report, and document the methodology for issue #1352 (PR-A of a planned series).

Changes:

  • Add 455 metropolitan French commune records to contributions/cities/FR.json (population ≥ 2,000).
  • Add a new diff script (bin/scripts/fixes/france_cities_diff.py) plus a committed diagnostic report JSON artifact.
  • Add a methodology write-up under .github/fixes-docs/.

Reviewed changes

Copilot reviewed 1 out of 4 changed files in this pull request and generated 2 comments.

File Description
contributions/cities/FR.json Adds missing metropolitan communes to the contributed France cities dataset.
bin/scripts/fixes/france_cities_diff.py New script to diff CSC vs INSEE/geo.api.gouv.fr and generate merge/report artifacts.
bin/scripts/fixes/france_cities_diff.report.json Committed snapshot of diff statistics and representative samples for reviewer audit.
.github/fixes-docs/FIX_1352_PR_A_SUMMARY.md Documents scope, sources, matching strategy, and validation for PR-A.

Comment on lines +78 to +85
- `target_state_code` is derived from the upstream `(departement, region)` pair using:
- INSEE region code → CSC region iso2 (e.g., `84 → ARA`, `11 → IDF`, `94 → 2A/2B`).
- **Department-level overrides** for the five departments that the existing
FR.json stores at department-level rather than region-level: `2A`, `2B`,
`48`, `52`, `55`. (Discovered empirically; new records in those depts must
follow suit to match existing convention.)

If the primary state lookup misses, the script also tries every other metropolitan CSC state code. A hit there is **not** a successful match — it's a *cross-region match*, flagged for PR-B (region reclassification), and the upstream record is still considered "missing" only if no fallback hit exists.
Comment on lines +100 to +105
Following the existing FR.json convention:

```json
{
"name": "Évry-Courcouronnes", // INSEE official French name
"state_id": 4796, "state_code": "IDF",
dr5hn added a commit that referenced this pull request Apr 27, 2026
…hy (#1489)

Customer-facing follow-up to #1349 (Italy) and #1352 (France). Cities
were re-parented onto departments (FR) and provinces (IT) by #1395 /
#1394 / #1393 / #1400 / #1484, but the state records themselves still
carried inconsistent 'level' values, blocking downstream filters like
"all departments == level=2" or "all regions == level=1".

bin/scripts/fixes/states_level_normalise.py drives the change:
  - FR: 29 region-tier rows None -> 1 (13 metro regions, 3 special
        metro collectivities incl. Corse + Alsace + Métropole de Lyon,
        13 overseas regions/collectivities/territories/dependency).
        95 metropolitan departments unchanged at level=2.
  - IT: 103 rows updated. Final state: 20 at level=1
        (15 region + 5 autonomous region) and 106 at level=2
        (80 province + 14 metropolitan city + 6 free municipal
        consortium + 4 decentralized regional entity + 2 autonomous
        province).

Only the 'level' field is touched; idempotent on re-run; non-FR/IT
states untouched.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants