feat: authoritative Albania block resolver (#118)#133
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ec2290baa2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| n = int(postal_code) | ||
| if n < _STARTS[0]: | ||
| return None | ||
| return _NUTS3[bisect_right(_STARTS, n) - 1] |
There was a problem hiding this comment.
Reject unallocated AL prefixes instead of range-filling them
This lookup uses the previous center code for every numeric value until the next center, so an AL input like 1900 or 9999 is returned as a high-confidence Tirana/Sarandë estimate even though the module's own allocation describes district identity by the first two digits and BLOCKS only lists prefixes such as 10, 15, 17, 18, 20, etc. The new tier therefore turns unallocated 4-digit AL codes in the gaps between listed prefixes into successful NUTS matches; use the first-two-digit block keys (or explicit valid ranges) rather than a continuous bisect interval if these should keep returning no match.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Correct that the bisect range-fills: any code ≥ 1000 resolves to its enclosing district block, so unallocated-prefix inputs like 1900 or 9999 return a best-effort region rather than no-match. This is a deliberate design choice, not an oversight — the goal of #118 was to stop valid AL codes 404-ing, and the resolver leans toward best-effort coverage (documented in the docstring's "9800–9999 → Sarandë as best-effort" note).
Worth stating precisely, since it bounds the concern: all 489 real AL codes — and every gap code #118 was about — sit in allocated 2-digit prefixes, so they resolve identically under either approach (the golden regression test covers this). The only inputs affected by range-fill vs. prefix-strict are ~5,500 codes in unallocated prefixes, which are almost certainly non-existent codes. Range-fill returns a plausible neighbor region for those; prefix-strict would 404 them.
Your alternative (key on the 35 allocated 2-digit prefixes, 404 the gaps) is a reasonable and arguably more honest tradeoff — it stops asserting confidence 0.9 for codes with no allocated district — and it's free for real coverage. I've flagged it to the maintainer as a follow-up decision, since switching reverses the approved best-effort behavior. If we adopt it, it'll be a small follow-up PR against the _STARTS/bisect logic. Thanks for the catch.
There was a problem hiding this comment.
Closes #118.
Problem
Albania has no Eurostat TERCET file, so v0.21.0 shipped AL as incomplete GeoNames estimates (~489 codes). Officially-valid codes GeoNames omitted returned 404 — e.g. Tirana 1055/1057/1060–1065, and whole districts GeoNames never covered (Gramsh 33xx, Peqin 35xx, Tepelenë 63xx, Përmet 64xx). The gap spanned all 12 qarks.
Approach
Albanian postal codes are block-allocated by district: the first two digits identify one of ~33 postal districts, each belonging to exactly one of the 12 NUTS3 qarks. This branch replaces the GeoNames enumeration with an authoritative in-code block resolver (
app/albania_blocks.py) that maps any well-formed 4-digit code to its qark by the block it falls into — covering the gaps by construction, at NUTS3 granularity.app/albania_blocks.py: ~35-row block table +resolve_al_block()(bisect).lookup()gains an AL-only "Tier 2b" consulting it;get_loaded_countries()keeps AL supported.scripts/build_albania_estimates.pyare removed — the block map (code, not data) is now AL's sole resolver, so aPC2NUTS_ESTIMATES_REFRESH_URLfull-replace can no longer clobber AL coverage.Continuity
match_type="estimated"andhighconfidence (nuts3 0.9) are unchanged, so AL's/lookupand/resolvebehavior is identical for the 489 known codes (in particular/resolvestill won't geocode AL). The only observable change: gap codes now resolve instead of 404.Validation
A golden regression test captures all 489 previously-shipped
(code → NUTS3)pairs as a fixture and asserts the block resolver reproduces every one — 0 mismatches. Plus gap-coverage, service-code (1700/1800), top-open-range (9800–9999→AL035), and malformed/out-of-range tests. Docs (README +lookup()docstring + estimates statistics) updated to match.Suite: 48 passed across the AL + data_loader tests, ruff clean.