Skip to content

docs: multi-level territories policy (FR overseas, dual representation) (#1352 PR-C)#1392

Merged
dr5hn merged 7 commits into
masterfrom
feat/issue-1352-multi-level-territories-doc
Apr 27, 2026
Merged

docs: multi-level territories policy (FR overseas, dual representation) (#1352 PR-C)#1392
dr5hn merged 7 commits into
masterfrom
feat/issue-1352-multi-level-territories-doc

Conversation

@dr5hn
Copy link
Copy Markdown
Owner

@dr5hn dr5hn commented Apr 25, 2026

Summary

Adds MULTI_LEVEL_TERRITORIES.md at the repo root, documenting the policy decision on issue #1352 (Option C — keep both representations):

  • 12 French overseas territories (GF, PF, TF, GP, MQ, YT, NC, PM, BL, MF, WF, RE) appear simultaneously as ISO 3166-1 countries and as ISO 3166-2 subdivisions of FR. Both representations are kept; neither is canonical.
  • Documents the rationale (ISO compliance, downstream-consumer compatibility via country_code filtering, locale/timezone metadata, reversibility).
  • Provides query patterns for downstream consumers ("everything in FR", "Martinique alone", "metropolitan only").
  • Lists the existing precedent (US territories, CN SARs, NO Svalbard) and the known gaps (Greenland, Aruba, Crown Dependencies, etc.) that are not dual-modeled today and remain out of scope.
  • "Future considerations" section preserves Option A (full reclassify) as documented-but-rejected, with a sketch of what a future migration would require.

Cross-references added in:

  • .claude/CLAUDE.md — one-line bullet under Important Rules → DO
  • README.md — one-line link in the contributing/full-guide row

This is PR-C of a 4-PR Option-C plan for #1352. Pure documentation: no data changes.

Test plan

  • Markdown renders cleanly on GitHub (tables, anchors, relative links)
  • Relative links resolve: MULTI_LEVEL_TERRITORIES.md.claude/CLAUDE.mdREADME.md
  • Diff is exactly 3 files: 1 new, 2 single-line edits — git diff --stat HEAD~1 confirms
  • No files in contributions/, json/, csv/, xml/, yml/, sql/, sqlite/, mongodb/, etc. were touched
  • Issue [Bug]: France data — missing cities and regions misclassified #1352 stays open (this is one of four PRs against it)

Refs: #1352

🤖 Generated with Claude Code

Copy link
Copy Markdown
Owner Author

dr5hn commented Apr 27, 2026

Weekly data-quality review (2026-04-27)

Verdict: needs-discussion

Checks

  • Schema: ❌ PR description states "Pure documentation: no data changes" and "Diff is exactly 3 files: 1 new, 2 single-line edits" — but the actual PR modifies contributions/countries/countries.json (25 additions, 25 deletions — postal code fields for 12 countries) and adds .github/fixes-docs/FIX_1039_SUMMARY.md. These are data changes, not just documentation.
  • FK integrity: ✅ Postal code fields (postal_code_format, postal_code_regex) have no FK relationships.
  • Coordinates: N/A (no city or state records)
  • Wikidata: N/A
  • Naming convention: ✅

Concerns

  1. PR description vs. actual diff mismatchcontributions/countries/countries.json is listed as modified in the PR's changed files, contradicting the "no data changes" claim. Either the description needs updating, or the FIX_1039 commits landed on this branch unintentionally.

  2. Cross-PR FIX_1039 postal conflict — The identical postal code changes (12 countries, postal_code_format/postal_code_regex) and FIX_1039_SUMMARY.md also appear in PRs feat(FR): polish department + region metadata vs data.gouv.fr (#1352 PR-B) #1393, feat(FR): diff cities against data.gouv.fr, add missing metropolitan communes (#1352 PR-A) #1394, and feat(FR-overseas): populate missing city files for GF, BL, MF, PM, TF (#1352 PR-D) #1400. Merging any two of these four PRs sequentially will produce a conflict on contributions/countries/countries.json and a "file already exists" error for FIX_1039_SUMMARY.md. The FIX_1039 changes should be isolated to a single PR (or its own standalone PR) before any of the four are merged.

🤖 Automated weekly review — Claude (sonnet-4-6).


Generated by Claude Code

dr5hn and others added 7 commits April 27, 2026 16:56
Adds Danish postcodes via DAWA (Danmarks Adressers Web API) — public
sector data published under CC-0 by SDFI/Dataforsyningen.

1. bin/scripts/sync/import_denmark_postcodes.py — pipeline that fetches
   /kommuner to build a kommune-code -> region-name map, then resolves
   each /postnumre record's region via its first kommune. Maps the 5
   Danish region names to states.json iso2 codes:
     Region Hovedstaden -> 84 (called "Denmark" in states.json)
     Region Sjælland    -> 85 (Zealand)
     Region Syddanmark  -> 83 (Southern Denmark)
     Region Midtjylland -> 82 (Central Denmark)
     Region Nordjylland -> 81 (North Denmark)

2. contributions/postcodes/DK.json — 1,089 codes covering all 5 regions
   with 100% state_id + 100% coordinate resolution.

Validation (zero errors)
- All codes match countries.postal_code_regex (^(\\d{4})\$)
- All FKs resolve, all state_codes agree with state.iso2

License & attribution
- Source: SDFI / Dataforsyningen DAWA (CC-0)
- Each row: source: "dawa"

Refs: #1039

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Icelandic postcodes via the sveinbjornt/iceaddr Python package
which embeds the canonical postcode metadata under MIT licence.

1. bin/scripts/sync/import_iceland_postcodes.py — pipeline that
   dynamically imports the iceaddr POSTCODES dict and resolves each
   code's region via prefix range to states.json iso2 1-8 (Statistics
   Iceland's NUTS-3 boundaries: 1xx-2xx Capital, 3xx Western, 4xx
   Westfjords, 5xx Northwestern, 6xx Northeastern, 7xx Eastern,
   8xx-9xx Southern).

2. contributions/postcodes/IS.json — 195 records with 100% state_id
   resolution. Locality names combine stadur_nf + lysing
   (e.g. "Reykjavík, Miðborg").

License & attribution
- Source: iceaddr (MIT) embedding Pósturinn data
- Each row: source: "iceaddr"

Refs: #1039

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…irrors (#1039)

Bundles three small-to-medium European countries with confirmed
redistributable postcode mirrors into a single batch importer.

1. bin/scripts/sync/import_eu_batch1_postcodes.py — pipeline that
   ingests three different shapes (SK JSON, RO CSV, SI CSV) and writes
   per-country JSON files. ASCII-folding + dash-to-space normalisation
   handles the Romanian Caraș-Severin / Bistrița-Năsăud cases where
   the CSV uses spaces and states.json uses hyphens.

2. contributions/postcodes/SK.json — 1,312 records (100% state via
   KRAJ -> states.iso2 direct match)
3. contributions/postcodes/RO.json — 13,751 records (100% state via
   ASCII-folded judet name match; all 6 Bucharest sectors mapped to 'B')
4. contributions/postcodes/SI.json — 522 records, country-only by
   design (source has no municipality info; SI postcodes don't map
   cleanly to administrative regions)

Validation (zero errors)
- All codes match countries.postal_code_regex
- All FKs resolve, all state_codes agree with state.iso2

License & attribution
- SK source: github.com/FeroVolar/PSC-JSON (community Slovenská pošta data)
- RO source: github.com/alexionegit/coduripostaleRomaniaPS
- SI source: github.com/dlabs/postcode_si (community Posta Slovenije data)

Refs: #1039

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#1397/#1399)

The remap is a behavior change for downstream consumers — region-level
state_code queries (e.g. Sicily=82, Lombardy=25) now return empty arrays
because cities live under provinces/metropolitan cities, not regions.
Documents the traversal pattern (states.parent_id) needed for
region-aggregate queries so users know how to migrate.
…n) (#1352 PR-C)

Adds MULTI_LEVEL_TERRITORIES.md documenting why 12 French overseas
territories (and analogous US/CN/NO entities) appear simultaneously as
ISO 3166-1 countries and as ISO 3166-2 subdivisions of their parent state.

Captures the maintainer's Option C decision on #1352: keep both
representations because (1) downstream API/SDK consumers filter on
country_code, (2) ISO 3166-1 lists them as countries, and (3) the
breaking change is unjustified for a labelling concern.

Cross-links the new policy doc from .claude/CLAUDE.md (Important Rules)
and README.md (contributing section).

No data changes.

Refs: #1352

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dr5hn dr5hn force-pushed the feat/issue-1352-multi-level-territories-doc branch from c57a9db to a30f8ea Compare April 27, 2026 15:35
@dr5hn dr5hn marked this pull request as ready for review April 27, 2026 15:35
Copilot AI review requested due to automatic review settings April 27, 2026 15:35
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 27, 2026
@dr5hn dr5hn merged commit 2a09a95 into master Apr 27, 2026
2 checks passed
@dr5hn dr5hn deleted the feat/issue-1352-multi-level-territories-doc branch April 27, 2026 15:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new top-level documentation page describing the repository policy for “multi-level territories” (dual ISO 3166-1 country + ISO 3166-2 subdivision representation), and links it from existing maintainer/contributor docs.

Changes:

  • Add MULTI_LEVEL_TERRITORIES.md documenting the Option C policy for dual-modeled territories (with rationale and suggested consumer query patterns).
  • Link the new policy doc from README.md.
  • Add a cross-reference in .claude/CLAUDE.md to guide maintainers before changing country/state records for dual-ISO territories.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
README.md Adds a link to the new multi-level territories policy doc in the contributing section.
MULTI_LEVEL_TERRITORIES.md Introduces the policy document explaining dual representations and downstream querying guidance.
.claude/CLAUDE.md Adds a maintainer-facing reminder/link to consult the new policy before modifying dual-ISO territory records.

Comment on lines +23 to +25
1. There is a row in `contributions/countries/countries.json` (with its own `id`, `iso2`, `iso3`).
2. There is a row in `contributions/states/states.json` whose `country_code` points at the **parent** state (e.g. `FR`, `US`, `CN`), and whose `iso2` / `state_code` matches the territory.
3. Cities under the territory live in `contributions/cities/<TERRITORY_ISO2>.json` (e.g. `MQ.json`), and reference both their `country_id` (= the territory) and their `state_id` (= the territory-as-subdivision-of-parent).
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Policy bullets describe territory cities as referencing state_id of the territory-as-subdivision-of-parent (e.g., FR-level state row). In the current dataset, territory city files (e.g., contributions/cities/MQ.json) use state_id for the territory’s own internal subdivisions (whose states.country_code is MQ, not FR). Please reword these bullets to match the actual model (country row + separate parent-country state row, but cities still point to the territory’s own states), or explicitly describe what would need to change in the data model for the stated state_id linkage to be true.

Suggested change
1. There is a row in `contributions/countries/countries.json` (with its own `id`, `iso2`, `iso3`).
2. There is a row in `contributions/states/states.json` whose `country_code` points at the **parent** state (e.g. `FR`, `US`, `CN`), and whose `iso2` / `state_code` matches the territory.
3. Cities under the territory live in `contributions/cities/<TERRITORY_ISO2>.json` (e.g. `MQ.json`), and reference both their `country_id` (= the territory) and their `state_id` (= the territory-as-subdivision-of-parent).
1. There is a row in `contributions/countries/countries.json` for the territory itself (with its own `id`, `iso2`, `iso3`).
2. There is also a separate row in `contributions/states/states.json` representing the territory as a subdivision of its **parent** country (e.g. `FR`, `US`, `CN`), with `country_code` set to the parent and `iso2` / `state_code` matching the territory.
3. Cities under the territory live in `contributions/cities/<TERRITORY_ISO2>.json` (e.g. `MQ.json`), with `country_id` pointing to the territory country row. In the current data model, their `state_id` points to the territory's own internal subdivisions in `contributions/states/states.json` (where `states.country_code` is the territory ISO2, e.g. `MQ`), not to the parent-country subdivision row described in item 2.

Copilot uses AI. Check for mistakes.
| `MF` | `FR-MF` | Saint-Martin (French part) | 190 | 4809 | overseas collectivity |
| `WF` | `FR-WF` | Wallis and Futuna | 243 | 4810 | overseas collectivity |

> The five **DROM** (Départements et régions d'outre-mer) — `GF`, `GP`, `MQ`, `RE`, `YT` — currently use **INSEE numeric codes** (`971`–`976`) as their `state_code` in `states.json`, while the overseas collectivities use the ISO 3166-2 alphabetic codes. Aligning the DROM to ISO 3166-2 alphabetic codes is tracked separately and is **out of scope** for this policy doc.
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note says the DROM use INSEE numeric codes as their state_code in states.json, but states.json does not contain a state_code field (it uses iso2 / iso3166_2). Consider updating the wording to reference the actual fields (states.iso2 and the corresponding cities.state_code values) so maintainers don’t look for a non-existent key.

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +44
| ISO 3166-1 | ISO 3166-2 / INSEE | Name (English) | `countries.id` | `states.id` | State `type` |
| :--------- | :----------------- | :----------------------------------- | -------------: | ----------: | :------------------------------------------ |
| `GF` | `FR-GF` / `973` | French Guiana | 76 | 4822 | overseas region |
| `PF` | `FR-PF` | French Polynesia | 77 | 4824 | overseas collectivity |
| `TF` | `FR-TF` | French Southern and Antarctic Lands | 78 | 5065 | overseas territory |
| `GP` | `FR-GP` / `971` | Guadeloupe | 88 | 4829 | overseas region |
| `MQ` | `FR-MQ` / `972` | Martinique | 138 | 4827 | overseas region |
| `YT` | `FR-YT` / `976` | Mayotte | 141 | 4797 | overseas region |
| `NC` | `FR-NC` | New Caledonia | 157 | 5538 | overseas collectivity with special status |
| `RE` | `FR-RE` / `974` | Réunion | 180 | 4823 | overseas region |
| `PM` | `FR-PM` | Saint Pierre and Miquelon | 187 | 4821 | overseas collectivity |
| `BL` | `FR-BL` | Saint-Barthélemy | 189 | 4794 | overseas collectivity |
| `MF` | `FR-MF` | Saint-Martin (French part) | 190 | 4809 | overseas collectivity |
| `WF` | `FR-WF` | Wallis and Futuna | 243 | 4810 | overseas collectivity |
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “ISO 3166-2 / INSEE” column shows codes like FR-GF / 973, but the current states.iso3166_2 values for DROM in contributions/states/states.json are stored as FR-973, FR-972, etc. (numeric), not FR-GF, FR-MQ, etc. Please either align this table with the repository’s actual stored codes, or clearly label which value is the official ISO 3166-2 code vs. what the repo currently stores.

Suggested change
| ISO 3166-1 | ISO 3166-2 / INSEE | Name (English) | `countries.id` | `states.id` | State `type` |
| :--------- | :----------------- | :----------------------------------- | -------------: | ----------: | :------------------------------------------ |
| `GF` | `FR-GF` / `973` | French Guiana | 76 | 4822 | overseas region |
| `PF` | `FR-PF` | French Polynesia | 77 | 4824 | overseas collectivity |
| `TF` | `FR-TF` | French Southern and Antarctic Lands | 78 | 5065 | overseas territory |
| `GP` | `FR-GP` / `971` | Guadeloupe | 88 | 4829 | overseas region |
| `MQ` | `FR-MQ` / `972` | Martinique | 138 | 4827 | overseas region |
| `YT` | `FR-YT` / `976` | Mayotte | 141 | 4797 | overseas region |
| `NC` | `FR-NC` | New Caledonia | 157 | 5538 | overseas collectivity with special status |
| `RE` | `FR-RE` / `974` | Réunion | 180 | 4823 | overseas region |
| `PM` | `FR-PM` | Saint Pierre and Miquelon | 187 | 4821 | overseas collectivity |
| `BL` | `FR-BL` | Saint-Barthélemy | 189 | 4794 | overseas collectivity |
| `MF` | `FR-MF` | Saint-Martin (French part) | 190 | 4809 | overseas collectivity |
| `WF` | `FR-WF` | Wallis and Futuna | 243 | 4810 | overseas collectivity |
| ISO 3166-1 | Official ISO 3166-2 | Stored `states.iso3166_2` | INSEE | Name (English) | `countries.id` | `states.id` | State `type` |
| :--------- | :------------------ | :------------------------ | :---- | :----------------------------------- | -------------: | ----------: | :------------------------------------------ |
| `GF` | `FR-GF` | `FR-973` | `973` | French Guiana | 76 | 4822 | overseas region |
| `PF` | `FR-PF` | `FR-PF` || French Polynesia | 77 | 4824 | overseas collectivity |
| `TF` | `FR-TF` | `FR-TF` || French Southern and Antarctic Lands | 78 | 5065 | overseas territory |
| `GP` | `FR-GP` | `FR-971` | `971` | Guadeloupe | 88 | 4829 | overseas region |
| `MQ` | `FR-MQ` | `FR-972` | `972` | Martinique | 138 | 4827 | overseas region |
| `YT` | `FR-YT` | `FR-976` | `976` | Mayotte | 141 | 4797 | overseas region |
| `NC` | `FR-NC` | `FR-NC` || New Caledonia | 157 | 5538 | overseas collectivity with special status |
| `RE` | `FR-RE` | `FR-974` | `974` | Réunion | 180 | 4823 | overseas region |
| `PM` | `FR-PM` | `FR-PM` || Saint Pierre and Miquelon | 187 | 4821 | overseas collectivity |
| `BL` | `FR-BL` | `FR-BL` || Saint-Barthélemy | 189 | 4794 | overseas collectivity |
| `MF` | `FR-MF` | `FR-MF` || Saint-Martin (French part) | 190 | 4809 | overseas collectivity |
| `WF` | `FR-WF` | `FR-WF` || Wallis and Futuna | 243 | 4810 | overseas collectivity |

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +73
Use the `FR` country, then traverse via `state_id`:

```sql
SELECT c.*
FROM cities c
JOIN states s ON c.state_id = s.id
WHERE s.country_code = 'FR'; -- includes all 12 overseas territories
```

This works because every overseas territory has a state row whose `country_code = 'FR'`.
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “everything in the French Republic” SQL example won’t include overseas-territory cities with the current data model: for territory city files (e.g., MQ.json, GP.json), cities.state_id points to a state whose states.country_code is the territory (e.g., MQ), not FR, so WHERE s.country_code = 'FR' filters them out. Consider revising this section to a query pattern that matches current exports (e.g., cities.country_code IN ('FR', ...territory codes...), or another explicit mapping approach).

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +94
### "Give me metropolitan France only" (exclude overseas)

Exclude the 12 overseas codes explicitly. The metropolitan vs. overseas split is a political/administrative distinction, not a data-model distinction:

```sql
SELECT * FROM cities
WHERE country_code = 'FR'
AND state_code NOT IN ('GF','PF','TF','GP','MQ','YT','NC','RE','PM','BL','MF','WF',
'971','972','973','974','976'); -- INSEE for DROM
```
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “metropolitan France only” query uses country_code = 'FR' and then excludes overseas codes via state_code NOT IN (...). Given territory cities use country_code equal to the territory ISO2 (e.g., MQ, GP) rather than FR, country_code = 'FR' already excludes them, making the state_code exclusion redundant/misleading. Either simplify this query to just filter country_code = 'FR', or (if you intend to start from the earlier JOIN states ... WHERE s.country_code='FR' pattern) show the correct exclusion at the joined states level.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants