Skip to content

Commit 0cadfff

Browse files
dr5hnclaude
andcommitted
feat(postcodes): add postcodes table and infrastructure (#1039)
Introduces a separate `postcodes` table (Tier 4 architecture) for storing postal codes as their own entity, FK'd to country (required) and state/city (both nullable). Captures multi-postcode-per-state and multi-state-per-postcode cases that flat-column shapes cannot. This is a foundation-only PR. Country data lands in follow-up PRs sourced from OpenPLZ (DACH), Wikidata (long tail), and per-country official sources (US Census, India Post, Japan Post, La Poste, Australia Post). GeoNames is deliberately not used. Includes - Phinx migration creating `postcodes` table with FKs and indexes - Manual mirror in bin/db/schema.sql for review readability - contributions/postcodes/ directory with field-shape README - import_postcodes() in bin/scripts/sync/import_json_to_mysql.py; gracefully no-ops when table or directory is absent - Validator: postcodes entity recognised in .github/scripts/utils.js - Cross-reference validator checks country_id, state_id, and matches postcode against countries.postal_code_regex when defined - ADR in .github/fixes-docs/FIX_1039_POSTCODES_TABLE.md covering the Shape A/B/C/D decision, sourcing plan, and roll-out sequence Out of scope (deferred to follow-up PRs): - Country data files (no contributions/postcodes/{ISO2}.json yet) - Export commands (Csv/Json/MongoDB/Plist/SqlServer/Xml/Yaml) — must be updated to emit the postcodes table - sync_mysql_to_json.py (reverse sync) - Coordinate validator and duplicate detector for postcodes Refs: #1039 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 2505f92 commit 0cadfff

7 files changed

Lines changed: 539 additions & 2 deletions

File tree

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# FIX #1039`postcodes` Table (Tier 4 Architecture)
2+
3+
**Issue:** [#1039 — Can we add a postcode for this?](https://github.com/dr5hn/the-countries-states-cities-database/issues/1039)
4+
**Scope:** New entity table for postal codes (Tier 4 from the roadmap).
5+
**Date:** 2026-04-25
6+
**Companion PR:** Country-level `postal_code_format`/`postal_code_regex` backfill (#1391).
7+
8+
## Decision
9+
10+
After exploring four shape options for postcode storage:
11+
- **Shape A** — single prefix string per state (lossy: ~40% of states have multiple prefixes)
12+
- **Shape B** — array of prefixes per state (lossy at sub-state granularity)
13+
- **Shape C** — min/max range per state (fails for non-contiguous and alphanumeric systems)
14+
- **Shape D** — separate `postcodes` table with FKs to country/state/city ✅
15+
16+
**Shape D was chosen** because:
17+
- Lossless at every granularity (full / outward / sector / district / area)
18+
- Naturally handles "state has many postcodes" (US, UK, DE, etc.)
19+
- Naturally handles "postcode spans multiple states" (rare but real — some US ZIPs)
20+
- FK to existing `cities` lets postcodes be denormalised against the existing city dataset
21+
- Independent of state-level schema decisions made earlier or later
22+
23+
## Schema Shape
24+
25+
```sql
26+
CREATE TABLE `postcodes` (
27+
`id` int unsigned NOT NULL AUTO_INCREMENT,
28+
`code` varchar(20) NOT NULL,
29+
`country_id` mediumint unsigned NOT NULL, -- FK countries.id
30+
`country_code` char(2) NOT NULL, -- denormalised
31+
`state_id` mediumint unsigned NULL, -- FK states.id (nullable)
32+
`state_code` varchar(255) NULL, -- denormalised
33+
`city_id` mediumint unsigned NULL, -- FK cities.id (nullable)
34+
`locality_name` varchar(255) NULL,
35+
`type` varchar(32) NULL, -- full | outward | sector | district | area
36+
`latitude` decimal(10,8) NULL,
37+
`longitude` decimal(11,8) NULL,
38+
`source` varchar(64) NULL, -- attribution: openplz | wikidata | census | ...
39+
`wikiDataId` varchar(255) NULL,
40+
`created_at` timestamp NOT NULL DEFAULT '2014-01-01 12:01:01',
41+
`updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
42+
`flag` tinyint(1) NOT NULL DEFAULT '1',
43+
PRIMARY KEY (`id`),
44+
KEY `idx_postcodes_code` (`code`),
45+
KEY `idx_postcodes_country_code` (`country_id`,`code`),
46+
KEY `idx_postcodes_state` (`state_id`),
47+
KEY `idx_postcodes_city` (`city_id`),
48+
CONSTRAINT `postcodes_country_fk` FOREIGN KEY (`country_id`) REFERENCES `countries` (`id`),
49+
CONSTRAINT `postcodes_state_fk` FOREIGN KEY (`state_id`) REFERENCES `states` (`id`) ON DELETE SET NULL,
50+
CONSTRAINT `postcodes_city_fk` FOREIGN KEY (`city_id`) REFERENCES `cities` (`id`) ON DELETE SET NULL
51+
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
52+
```
53+
54+
## What this PR includes (foundation only)
55+
56+
- Phinx migration: `bin/db/migrations/20260425000000_create_postcodes_table.php`
57+
- Manual mirror in `bin/db/schema.sql` so reviewers can read the schema without running Phinx
58+
- New contributions directory: `contributions/postcodes/` with `README.md` documenting field shape and sourcing plan
59+
- Importer support: `import_postcodes()` in `bin/scripts/sync/import_json_to_mysql.py` — gracefully skips if table or directory is absent
60+
- Validator support: `postcodes` entity recognised in `.github/scripts/utils.js` with field rules
61+
- Cross-reference validator: FK checks for `country_id`, `state_id`; format check against `countries.postal_code_regex`
62+
- Docs in `.github/fixes-docs/`
63+
64+
## What this PR does NOT include (deliberate)
65+
66+
1. **Country data** — no `contributions/postcodes/{ISO2}.json` files yet. Each country comes in a follow-up PR sourced from one of the providers in the sourcing plan.
67+
2. **Export commands**`bin/Commands/Export*.php` still emit only `regions, subregions, countries, states, cities`. Adding `postcodes` to each of the 7 export formats (Csv, Json, MongoDB, Plist, SqlServer, Xml, Yaml) is a separate PR — that work is mechanical but touches every format and needs separate review.
68+
3. **`sync_mysql_to_json.py`** — the reverse-sync script does not yet know about `postcodes`. Same scope decision as exports; add in follow-up.
69+
4. **PR validator updates beyond cross-reference**`validate-coordinates.js` and `detect-duplicates.js` are not yet postcode-aware; coordinates on postcodes are coarse centroids and duplicate detection has different semantics for postcodes (exact-code-match vs. fuzzy-name-match).
70+
71+
## Sourcing Plan (Combo B, GeoNames-free)
72+
73+
Per discussion in the issue, GeoNames is excluded. Each country is sourced from one of:
74+
75+
| Source | License | Countries |
76+
|--------|---------|-----------|
77+
| **OpenPLZ API** (https://openplzapi.org) | **ODbL-1.0** ← matches repo | DE, AT, CH, LI |
78+
| **Wikidata P281** (SPARQL) | CC-0 | Long-tail backfill |
79+
| **US Census ZCTA** | Public domain | US |
80+
| **India Post pincode CSV** | Open (gov.in) | IN |
81+
| **Japan Post KEN_ALL.csv** | Free | JP |
82+
| **France La Poste** (data.gouv.fr) | etalab-2.0 | FR |
83+
| **Australia Post Boundaries** | CC-BY 4.0 | AU |
84+
| **Statistics Canada FSA** | Open Government | CA |
85+
86+
The `source` field on each postcode record tracks attribution, so README/footer attribution can be programmatically generated from data presence.
87+
88+
**Coverage projection:** Combo B can reach ~30–40% of the world's postcodes by row count. Reaching higher coverage (UK Royal Mail, Eircode, Deutsche Post) is structurally blocked by license restrictions, not by effort.
89+
90+
## Validation Strategy
91+
92+
PRs adding `contributions/postcodes/{ISO2}.json` are validated by:
93+
94+
1. **Schema validator** (`validate-schema.js`) — required fields, type rules, no auto-managed fields
95+
2. **Cross-reference validator** (`validate-cross-reference.js`) — `country_id` and `country_code` agreement, `state_id` belongs to declared country, postcode `code` matches `countries.postal_code_regex` if defined
96+
3. **JSON syntax** — standard JSON parsing
97+
4. **PR format checks** — source URL required in PR body for license attribution
98+
99+
## Roll-Out Plan (Suggested)
100+
101+
| PR | Country | Source | Approx rows | Notes |
102+
|----|---------|--------|-------------|-------|
103+
| **This PR** | (foundation only) || 0 | Schema + infra |
104+
| Next | Liechtenstein | OpenPLZ | ~9 | Smallest, proves OpenPLZ adapter |
105+
| | Luxembourg | OpenPLZ-style or PT.LU | ~200 | Small, fits in tiny JSON |
106+
| | Iceland | Iceland Post | ~150 | Small |
107+
| | Estonia | Eesti Post | ~700 | Small |
108+
| | Switzerland | OpenPLZ | ~3,200 | DACH first wave |
109+
| | Austria | OpenPLZ | ~2,100 | DACH first wave |
110+
| | Germany | OpenPLZ | ~8,200 | Largest DACH |
111+
| | India | India Post | ~19,000 | First non-DACH at scale |
112+
| | France | La Poste | ~6,400 | etalab license |
113+
| | Australia | Australia Post | ~3,500 | CC-BY |
114+
| | Japan | KEN_ALL | ~150,000 | Largest single pipeline; will need `.gz` distribution |
115+
| | US | Census ZCTA | ~33,000 | Public domain |
116+
117+
After ~5 country PRs, schedule the export-command update PR (touches all 7 formats).
118+
119+
## Rollback
120+
121+
If the postcodes table or any country data needs to be removed:
122+
123+
```sql
124+
DROP TABLE IF EXISTS `postcodes`;
125+
```
126+
127+
The Phinx migration is a `change()` method without an explicit `down()`; rollback is via the manual DROP above. Removing `contributions/postcodes/` is a plain directory deletion.
128+
129+
No existing tables or columns are modified by this PR; rollback is clean.

.github/scripts/utils.js

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,27 @@ const SCHEMA = {
5757
wikiDataId: { type: 'string', pattern: /^Q\d+$/ },
5858
},
5959
},
60+
postcodes: {
61+
required: ['code', 'country_id', 'country_code'],
62+
optional: [
63+
'state_id', 'state_code', 'city_id', 'locality_name', 'type',
64+
'latitude', 'longitude', 'source', 'wikiDataId',
65+
],
66+
rules: {
67+
code: { type: 'string', maxLength: 20, nonEmpty: true },
68+
country_id: { type: 'integer', positive: true },
69+
country_code: { type: 'string', exactLength: 2 },
70+
state_id: { type: 'integer', positive: true },
71+
state_code: { type: 'string', maxLength: 255 },
72+
city_id: { type: 'integer', positive: true },
73+
locality_name: { type: 'string', maxLength: 255 },
74+
type: { type: 'string', maxLength: 32 },
75+
latitude: { type: 'coordinate', min: -90, max: 90 },
76+
longitude: { type: 'coordinate', min: -180, max: 180 },
77+
source: { type: 'string', maxLength: 64 },
78+
wikiDataId: { type: 'string', pattern: /^Q\d+$/ },
79+
},
80+
},
6081
};
6182

6283
/**
@@ -66,6 +87,7 @@ const SCHEMA = {
6687
*/
6788
function getEntityType(filePath) {
6889
const normalized = filePath.toLowerCase();
90+
if (normalized.includes('postcodes')) return 'postcodes';
6991
if (normalized.includes('cities')) return 'cities';
7092
if (normalized.includes('states')) return 'states';
7193
if (normalized.includes('countries')) return 'countries';
@@ -143,7 +165,8 @@ function validateRecord(record, entityType, index) {
143165

144166
const errors = [];
145167
const warnings = [];
146-
const prefix = `Record ${index + 1}${record.name ? ` ("${record.name}")` : ''}`;
168+
const label = record.name || record.code;
169+
const prefix = `Record ${index + 1}${label ? ` ("${label}")` : ''}`;
147170

148171
// Check for auto-managed fields that should NOT be present
149172
for (const field of AUTO_MANAGED_FIELDS) {

.github/scripts/validate-cross-reference.js

Lines changed: 53 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,8 @@ async function run() {
8181

8282
for (let i = 0; i < records.length; i++) {
8383
const record = records[i];
84-
const prefix = `${filePath}: Record ${i + 1}${record.name ? ` ("${record.name}")` : ''}`;
84+
const label = record.name || record.code;
85+
const prefix = `${filePath}: Record ${i + 1}${label ? ` ("${label}")` : ''}`;
8586

8687
if (entityType === 'cities') {
8788
// Validate country_id exists
@@ -146,6 +147,57 @@ async function run() {
146147
}
147148
}
148149
}
150+
151+
if (entityType === 'postcodes') {
152+
// Validate country_id exists (required FK)
153+
if (record.country_id) {
154+
const country = countryById.get(Number(record.country_id));
155+
if (!country) {
156+
errors.push(`${prefix}: country_id ${record.country_id} does not exist`);
157+
} else {
158+
validCount++;
159+
if (record.country_code && country.iso2) {
160+
if (record.country_code.toUpperCase() !== country.iso2.toUpperCase()) {
161+
errors.push(
162+
`${prefix}: country_code "${record.country_code}" does not match country_id ${record.country_id} (expected "${country.iso2}")`
163+
);
164+
}
165+
}
166+
}
167+
}
168+
169+
// Validate state_id exists if provided (optional FK)
170+
if (record.state_id != null && states) {
171+
const state = stateById.get(Number(record.state_id));
172+
if (!state) {
173+
errors.push(`${prefix}: state_id ${record.state_id} does not exist`);
174+
} else {
175+
validCount++;
176+
if (record.country_id && Number(state.country_id) !== Number(record.country_id)) {
177+
errors.push(
178+
`${prefix}: state_id ${record.state_id} ("${state.name}") belongs to country_id ${state.country_id}, not ${record.country_id}`
179+
);
180+
}
181+
}
182+
}
183+
184+
// Validate postcode format against country regex if defined
185+
if (record.code && record.country_id) {
186+
const country = countryById.get(Number(record.country_id));
187+
if (country && country.postal_code_regex) {
188+
try {
189+
const re = new RegExp(country.postal_code_regex);
190+
if (!re.test(record.code)) {
191+
errors.push(
192+
`${prefix}: code "${record.code}" does not match postal_code_regex "${country.postal_code_regex}" of ${country.iso2}`
193+
);
194+
}
195+
} catch (e) {
196+
// Invalid regex on the country side — skip silently rather than blocking PR
197+
}
198+
}
199+
}
200+
}
149201
}
150202
}
151203

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
<?php
2+
3+
declare(strict_types=1);
4+
5+
use Phinx\Migration\AbstractMigration;
6+
7+
/**
8+
* Creates the `postcodes` table for issue #1039.
9+
*
10+
* Stores postal codes as their own entity (Tier 4 architecture):
11+
* one row per postcode, foreign-keyed to country (required) and
12+
* state/city (both nullable to handle country-only and disputed regions).
13+
*
14+
* Mirrors the column conventions of the `cities` table (denormalised
15+
* country_code/state_code, lat/lng decimals, auto-managed timestamps).
16+
*/
17+
final class CreatePostcodesTable extends AbstractMigration
18+
{
19+
public function change(): void
20+
{
21+
if ($this->hasTable('postcodes')) {
22+
return;
23+
}
24+
25+
$this->table('postcodes', [
26+
'id' => false,
27+
'primary_key' => ['id'],
28+
'engine' => 'InnoDB',
29+
'collation' => 'utf8mb4_unicode_ci',
30+
'comment' => 'Postal codes (issue #1039) — Tier 4: one row per postcode',
31+
])
32+
->addColumn('id', 'integer', [
33+
'identity' => true,
34+
'signed' => false,
35+
'limit' => \Phinx\Db\Adapter\MysqlAdapter::INT_REGULAR,
36+
])
37+
->addColumn('code', 'string', [
38+
'limit' => 20,
39+
'null' => false,
40+
'comment' => 'The postal code value (alphanumeric, country-specific format)',
41+
])
42+
->addColumn('country_id', 'integer', [
43+
'signed' => false,
44+
'limit' => \Phinx\Db\Adapter\MysqlAdapter::INT_MEDIUM,
45+
'null' => false,
46+
])
47+
->addColumn('country_code', 'char', [
48+
'limit' => 2,
49+
'null' => false,
50+
])
51+
->addColumn('state_id', 'integer', [
52+
'signed' => false,
53+
'limit' => \Phinx\Db\Adapter\MysqlAdapter::INT_MEDIUM,
54+
'null' => true,
55+
'default' => null,
56+
])
57+
->addColumn('state_code', 'string', [
58+
'limit' => 255,
59+
'null' => true,
60+
'default' => null,
61+
])
62+
->addColumn('city_id', 'integer', [
63+
'signed' => false,
64+
'limit' => \Phinx\Db\Adapter\MysqlAdapter::INT_MEDIUM,
65+
'null' => true,
66+
'default' => null,
67+
])
68+
->addColumn('locality_name', 'string', [
69+
'limit' => 255,
70+
'null' => true,
71+
'default' => null,
72+
'comment' => 'Human-readable place name associated with the postcode',
73+
])
74+
->addColumn('type', 'string', [
75+
'limit' => 32,
76+
'null' => true,
77+
'default' => null,
78+
'comment' => 'Granularity: full | outward | sector | district | area',
79+
])
80+
->addColumn('latitude', 'decimal', [
81+
'precision' => 10,
82+
'scale' => 8,
83+
'null' => true,
84+
'default' => null,
85+
])
86+
->addColumn('longitude', 'decimal', [
87+
'precision' => 11,
88+
'scale' => 8,
89+
'null' => true,
90+
'default' => null,
91+
])
92+
->addColumn('source', 'string', [
93+
'limit' => 64,
94+
'null' => true,
95+
'default' => null,
96+
'comment' => 'Originating data source for license/attribution tracking (e.g. openplz, wikidata, census)',
97+
])
98+
->addColumn('wikiDataId', 'string', [
99+
'limit' => 255,
100+
'null' => true,
101+
'default' => null,
102+
'comment' => 'Wikidata Q-ID for cross-referencing',
103+
])
104+
->addColumn('created_at', 'timestamp', [
105+
'default' => '2014-01-01 12:01:01',
106+
'null' => false,
107+
])
108+
->addColumn('updated_at', 'timestamp', [
109+
'default' => 'CURRENT_TIMESTAMP',
110+
'update' => 'CURRENT_TIMESTAMP',
111+
'null' => false,
112+
])
113+
->addColumn('flag', 'boolean', [
114+
'default' => true,
115+
'null' => false,
116+
])
117+
->addIndex(['code'], ['name' => 'idx_postcodes_code'])
118+
->addIndex(['country_id', 'code'], ['name' => 'idx_postcodes_country_code'])
119+
->addIndex(['state_id'], ['name' => 'idx_postcodes_state'])
120+
->addIndex(['city_id'], ['name' => 'idx_postcodes_city'])
121+
->addForeignKey('country_id', 'countries', 'id', [
122+
'constraint' => 'postcodes_country_fk',
123+
])
124+
->addForeignKey('state_id', 'states', 'id', [
125+
'constraint' => 'postcodes_state_fk',
126+
'delete' => 'SET_NULL',
127+
'update' => 'NO_ACTION',
128+
])
129+
->addForeignKey('city_id', 'cities', 'id', [
130+
'constraint' => 'postcodes_city_fk',
131+
'delete' => 'SET_NULL',
132+
'update' => 'NO_ACTION',
133+
])
134+
->create();
135+
}
136+
}

0 commit comments

Comments
 (0)