Skip to content

fix(export): gzip postcode files + gitignore raw versions#1490

Merged
dr5hn merged 2 commits into
masterfrom
fix/export-postcodes-gzip
Apr 27, 2026
Merged

fix(export): gzip postcode files + gitignore raw versions#1490
dr5hn merged 2 commits into
masterfrom
fix/export-postcodes-gzip

Conversation

@dr5hn
Copy link
Copy Markdown
Owner

@dr5hn dr5hn commented Apr 27, 2026

The 15:45 UTC export run (#25004960042) failed with pre-receive hook declined because raw postcode export files exceed GitHub's 100MB hard limit:

remote: error: File xml/postcodes.xml is 250.80 MB ... exceeds 100.00 MB
remote: error: File yml/postcodes.yml is 163.30 MB ... exceeds 100.00 MB
remote: error: File json/postcodes.json is 228.02 MB ... exceeds 100.00 MB
remote: error: File psql/postcodes.sql is 117.25 MB ... exceeds 100.00 MB
remote: error: File sqlite/postcodes.sqlite3 is 101.77 MB ... exceeds 100.00 MB

The cities.* files have always been gzipped + gitignored for the same reason. This PR mirrors the same pattern for the postcodes.* files (added by #1039 / #1403).

Changes

  • export.yml: add gzip -9 -k -f for json/postcodes.json, xml/postcodes.xml, yml/postcodes.yml, csv/postcodes.csv, sqlite/postcodes.sqlite3, sqlserver/postcodes.sql (the existing gzip for sql/postcodes.sql and psql/postcodes.sql is unchanged).
  • .gitignore: add raw uncompressed postcodes.* for all 7 directories so the export commit only carries small files; large files only land in the GitHub Release.

Why this matters

Blocks today's data fixes from reaching the live API. Once this merges and the export workflow is re-triggered, the API ingest picks up:

Test plan

  • Workflow can run on master after this merges.
  • Re-run export.yml, verify the PR pushes successfully and .gz postcode files appear on the Release.

Refs: #1039, #1349, #1352

dr5hn and others added 2 commits April 27, 2026 21:24
Adds the official Luxembourg postcode dataset from CACLR (Centre des
Adresses du Cadastre du Luxembourg) via data.public.lu, CC-Zero.

Why
---
Closes the LU gap on issue #1039. The CACLR registry is the
canonical reference for Luxembourgish addresses, published by the
LU government under public-domain CC-Zero.

Coverage
--------
- 4,491 unique (code, locality, canton) tuples / 100% state FK
- All 12 CSC cantons covered

Source pipeline
---------------
1. data.public.lu API resolves the latest caclr.xlsx URL (URL is
   date-stamped and rotates every refresh)
2. Importer parses the denormalised TR.DiCaCoLo.RuCp join sheet
   directly via openpyxl
3. SOURCE_TO_ISO2 maps 13 source canton labels to 12 CSC iso2
   ('LUXEMBOURG-VILLE' capital sub-classification collapses to L)
4. 118 '?' postcodes (newly named streets without assigned codes)
   are filtered out

License
-------
CC-Zero (public domain). Each row carries
`source: "caclr-data-public-lu"` for export-time provenance.

Validation
----------
- python3 -m py_compile passes
- 100% regex match (^(?:L-)?\d{4}$)
- 100% state_id valid + state.country_id == 127 + state_code agrees
- No auto-managed fields (id, created_at, updated_at, flag)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The export.yml workflow currently produces raw uncompressed postcode
exports (json/postcodes.json 228MB, xml/postcodes.xml 250MB,
yml/postcodes.yml 163MB, sqlite/postcodes.sqlite3 101MB,
sqlserver/postcodes.sql 70MB, sql/postcodes.sql 86MB) that exceed
GitHub's 100MB hard limit when peter-evans/create-pull-request tries
to push them, breaking the export PR.

Mirror the existing cities.* gzip+gitignore pattern for postcodes.*:
- gzip -9 every generated postcodes.* file alongside cities.*
- gitignore the raw uncompressed postcodes.* in all 7 directories so
  the export commit doesn't include them; only the .gz goes to the
  GitHub Release.

Restores the export pipeline so today's data fixes (#1349, #1352:
PR-A/B/C/D/E + leveling) can reach the live API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 16:04
@dosubot dosubot Bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Apr 27, 2026
@dr5hn dr5hn merged commit e6b4f8d into master Apr 27, 2026
3 checks passed
@dr5hn dr5hn deleted the fix/export-postcodes-gzip branch April 27, 2026 16:04
@dosubot dosubot Bot added the fixed Issue has been fixed label Apr 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

CSC Validation Report

PR Format

  • ✅ Description provided
  • ❌ Data source linked
  • ✅ Issue linked (recommended for data changes)
  • ✅ Justification / context provided

Labels applied: data:postcodes, large-contribution

⚠️ Large Contribution

This PR contains 4491 records. Large contributions require manual review.

Schema Validation (4491 records)

✅ All records passed validation

Cross-Reference Validation

✅ 8982 reference(s) verified


All checks passed | Status: Ready for review

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes export workflow push failures caused by newly large postcodes.* export artifacts exceeding GitHub’s 100MB limit by aligning postcodes handling with the existing “compress + don’t commit large exports” approach.

Changes:

  • Add gzip compression for additional postcodes.* export formats in the export workflow.
  • Update .gitignore to ignore uncompressed postcodes.* export artifacts across export directories.
  • Add a new Luxembourg (LU) postcode importer script that generates contributions/postcodes/LU.json from the CACLR XLSX dataset.

Reviewed changes

Copilot reviewed 1 out of 4 changed files in this pull request and generated no comments.

File Description
bin/scripts/sync/import_luxembourg_postcodes.py New LU postcode importer (fetch + parse XLSX, map cantons → state FK, write LU.json).
.gitignore Ignores raw uncompressed postcodes.* exports to avoid committing oversized artifacts.
.github/workflows/export.yml Compresses additional postcode export files into .gz for Release uploads.

dr5hn added a commit that referenced this pull request Apr 27, 2026
…g, postcodes (#1491)

Adds:
- v3.2 release section summarising today's work (#1349, #1352, #1039,
  #1481-#1490).
- Notable callout for FR mainland city region->department remap
  (mirrors the existing IT one), explicitly calling out the behaviour
  change for consumers querying by region state_code.
- Chronological entries for each of the 19 PRs that landed today
  (changelog automation only runs weekly).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dr5hn added a commit that referenced this pull request Apr 28, 2026
These were committed at tiny placeholder sizes during #1039's initial
exports wiring (#1403), but the export.yml workflow regenerates them
at full size every run — 239 MB json/postcodes.json, 263 MB
xml/postcodes.xml, 171 MB yml/postcodes.yml, 123 MB psql/postcodes.sql,
105 MB sqlite/postcodes.sqlite3, 90 MB sql/postcodes.sql, 73 MB
sqlserver/postcodes.sql — all over GitHub's 100 MB hard limit.

#1490 added .gitignore entries for the same paths but gitignore is
inert against tracked files, so the export PR's git push still failed.
Untrack here so the gitignore actually applies; large compressed
.gz versions continue to ship via GitHub Releases.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data:postcodes fixed Issue has been fixed large-contribution ready-for-review size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants