Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,17 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/).

## [Unreleased]

### Added

- **Albania coverage completeness** (#118): AL postal codes now resolve via the
official postal-code block-allocation scheme (`app/albania_blocks.py`) instead
of the incomplete GeoNames estimates. Every well-formed 4-digit AL code maps to
its NUTS3 region by district block — codes GeoNames omitted (e.g. Tirana 1055,
and whole districts like Gramsh 33xx / Peqin 35xx / Tepelenë 63xx / Përmet
64xx) no longer 404. Validated to reproduce all 489 previously-shipped codes
identically. Because the map is code, not data, AL coverage is now immune to
the `PC2NUTS_ESTIMATES_REFRESH_URL` full-replace clobber.

## [1.0.0] - 2026-07-03

### Added
Expand Down
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Albania (AL), North Macedonia (MK), Montenegro (ME), Serbia (RS), Türkiye (TR)

> **Montenegro** is treated by Eurostat as a single nationwide unit at every NUTS level (`ME0` / `ME00` / `ME000`), and GISCO does not currently publish a TERCET file for it. Lookups for ME are served by the single-NUTS3 fallback (Tier 5) configured via `single_nuts3_fallback` in `app/settings.json`, returning `ME000` for any valid 5-digit code starting with `8`.

> **Albania** has a full NUTS hierarchy (`AL0`; `AL01` / `AL02` / `AL03`; 12 NUTS3 counties `AL011`–`AL035`) but Eurostat publishes no GISCO TERCET file for it. Coverage is provided through the Tier-2 estimates layer: each of ~489 Albanian 4-digit postal codes is mapped to its NUTS3 county (qark) via GeoNames' admin1 tagging, which corresponds 1:1 to the NUTS3 regions. Lookups return `match_type="estimated"` with `high` confidence — see [Estimates](#estimates).
> **Albania** has a full NUTS hierarchy (`AL0`; `AL01` / `AL02` / `AL03`; 12 NUTS3 counties `AL011`–`AL035`) but Eurostat publishes no GISCO TERCET file for it. Coverage is provided by an authoritative postal-code **block resolver** (`app/albania_blocks.py`): Albanian codes are block-allocated by district — the first two digits identify one of ~33 postal districts, each belonging to one of the 12 NUTS3 qarks — so **any** well-formed 4-digit code resolves to its qark via the block it falls into. Lookups return `match_type="estimated"` with `high` confidence — see [Estimates](#estimates).

**Other territories** (1):
Faroe Islands (FO) — not part of NUTS; synthetic result.
Expand Down Expand Up @@ -738,15 +738,15 @@ These labels map to numerical confidence scores per NUTS level. Coarser levels r

### Current coverage

The estimates file contains **7,632 entries** across 33 countries, with the following confidence distribution:
The estimates file contains **7,143 entries** across 32 countries, with the following confidence distribution:

| Confidence | Count | Share |
|------------|-------|-------|
| high | 5,746 | 75.3% |
| medium | 1,439 | 18.9% |
| low | 447 | 5.9% |
| high | 5,257 | 73.6% |
| medium | 1,439 | 20.1% |
| low | 447 | 6.3% |

Countries with the most estimates: TR (1,778), LT (1,231), FR (526), DE (500), AL (489), EL (387), CZ (361), RO (358).
Countries with the most estimates: TR (1,778), LT (1,231), FR (526), DE (500), EL (387), CZ (361), RO (358).

### Revalidation

Expand Down Expand Up @@ -847,7 +847,7 @@ docker build -t postalcode2nuts .
docker run -p 8000:8000 postalcode2nuts
```

On first start the service downloads TERCET data for the 34 countries with GISCO coverage (~2-5 minutes depending on network); Montenegro (single-NUTS3 fallback) and Albania (estimates-only, bundled in `tercet_missing_codes.csv`) need no download. After that everything is cached in a SQLite database for instant restarts.
On first start the service downloads TERCET data for the 34 countries with GISCO coverage (~2-5 minutes depending on network); Montenegro (single-NUTS3 fallback) and Albania (resolved in-code via the postal-code block map) need no download. After that everything is cached in a SQLite database for instant restarts.

### Persistent data volume

Expand Down Expand Up @@ -939,7 +939,7 @@ tests/
├── test_nuts_pip.py
├── test_auth.py
├── test_token_db.py
└── ... # full suite also covers estimates refresh, rate limiting, Albania estimates, etc.
└── ... # full suite also covers estimates refresh, rate limiting, the Albania block resolver, etc.
scripts/
├── import_estimates.py # CLI: import pre-computed estimates into SQLite DB
└── tokens.py # CLI: manage trusted-token DB (init/add/list/revoke)
Expand Down Expand Up @@ -1025,7 +1025,7 @@ No Python code changes are required.

## Data sources & attribution

**Postal code → NUTS (both tiers).** [GISCO TERCET flat files](https://ec.europa.eu/eurostat/web/gisco/geodata/administrative-units/postal-codes) ([download](https://gisco-services.ec.europa.eu/tercet/flat-files)), © European Union – GISCO, licensed [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). Albanian estimates are derived from [GeoNames](https://www.geonames.org/) admin1 tagging, licensed [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
**Postal code → NUTS (both tiers).** [GISCO TERCET flat files](https://ec.europa.eu/eurostat/web/gisco/geodata/administrative-units/postal-codes) ([download](https://gisco-services.ec.europa.eu/tercet/flat-files)), © European Union – GISCO, licensed [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). Albanian NUTS3 assignments come from the country's official postal-code block-allocation scheme (Posta Shqiptare), cross-validated against [GeoNames](https://www.geonames.org/) admin1 tagging ([CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)).

The [EU Open Data Portal dataset](https://data.europa.eu/data/datasets/postcodes-and-nuts-nomenclature-of-territorial-units-for-statistics) was also considered as a data source. However, its refresh cycle lags behind the GISCO TERCET flat files, so direct sourcing from GISCO was chosen for more up-to-date coverage.

Expand Down
82 changes: 82 additions & 0 deletions app/albania_blocks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
"""Authoritative Albania (AL) NUTS3 resolver from the postal-code BLOCK scheme.

Albania has no Eurostat TERCET file. Its postal codes are block-allocated by
district: the first two digits identify one of ~33 postal districts, and each
district sits in exactly one of the 12 qarks (= NUTS3). A range map keyed on the
district-center codes resolves ANY well-formed 4-digit code to its NUTS3 by the
block it falls into — covering the gaps GeoNames leaves (issue #118) by
construction, at NUTS3 granularity.

Source: official Posta Shqiptare allocation, cross-checked vs. Wikipedia "Postal
codes in Albania" and the UPU addressing PDF. The district->qark->NUTS3 mapping
reuses the GISCO-verified qark codes; the two non-obvious assignments (Kruje
15xx -> AL012, Kavaje 25xx -> AL022) are confirmed by GeoNames' own 15xx/25xx
tagging. Validated 100% against the 489 previously-shipped GeoNames codes (see
tests/test_albania_golden.py).
"""

from __future__ import annotations

from bisect import bisect_right

SUPPORTED: frozenset[str] = frozenset({"AL"})

# (district-center code, NUTS3, district name). Ascending by code. Each code is
# the LOWER bound of that district's block; a block runs to the next code.
# 1700 "Transit" / 1800 "EMS" are non-geographic service codes folded into
# Tirana (AL022), matching how GeoNames tags the 17xx/18xx prefixes.
BLOCKS: list[tuple[int, str, str]] = [
(1000, "AL022", "Tirana"),
(1500, "AL012", "Kruje"),
(1700, "AL022", "Transit (service)"),
(1800, "AL022", "EMS Office (service)"),
(2000, "AL012", "Durres"),
(2500, "AL022", "Kavaje"),
(3000, "AL021", "Elbasan"),
(3300, "AL021", "Gramsh"),
(3400, "AL021", "Librazhd"),
(3500, "AL021", "Peqin"),
(4000, "AL015", "Shkoder"),
(4300, "AL015", "Malesi e Madhe"),
(4400, "AL015", "Puke"),
(4500, "AL014", "Lezhe"),
(4600, "AL014", "Mirdite"),
(4700, "AL014", "Kurbin"),
(5000, "AL031", "Berat"),
(5300, "AL031", "Kucove"),
(5400, "AL031", "Skrapar"),
(6000, "AL033", "Gjirokaster"),
(6300, "AL033", "Tepelene"),
(6400, "AL033", "Permet"),
(7000, "AL034", "Korce"),
(7300, "AL034", "Pogradec"),
(7400, "AL034", "Kolonje"),
(8000, "AL011", "Mat"),
(8300, "AL011", "Diber"),
(8400, "AL011", "Bulqize"),
(8500, "AL013", "Kukes"),
(8600, "AL013", "Has"),
(8700, "AL013", "Tropoje"),
(9000, "AL032", "Lushnje"),
(9300, "AL032", "Fier"),
(9400, "AL035", "Vlore"),
(9700, "AL035", "Sarande"),
]

_STARTS = [b[0] for b in BLOCKS]
_NUTS3 = [b[1] for b in BLOCKS]


def resolve_al_block(postal_code: str) -> str | None:
"""NUTS3 code for a well-formed 4-digit AL postal code, else None.

Any code >= 1000 maps to its enclosing district block (incl. 9800-9999 ->
Sarande/AL035 as best-effort). Codes < 1000, wrong length, or non-numeric
return None.
"""
if not (len(postal_code) == 4 and postal_code.isdigit()):
return None
n = int(postal_code)
if n < _STARTS[0]:
return None
return _NUTS3[bisect_right(_STARTS, n) - 1]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject unallocated AL prefixes instead of range-filling them

This lookup uses the previous center code for every numeric value until the next center, so an AL input like 1900 or 9999 is returned as a high-confidence Tirana/Sarandë estimate even though the module's own allocation describes district identity by the first two digits and BLOCKS only lists prefixes such as 10, 15, 17, 18, 20, etc. The new tier therefore turns unallocated 4-digit AL codes in the gaps between listed prefixes into successful NUTS matches; use the first-two-digit block keys (or explicit valid ranges) rather than a continuous bisect interval if these should keep returning no match.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct that the bisect range-fills: any code ≥ 1000 resolves to its enclosing district block, so unallocated-prefix inputs like 1900 or 9999 return a best-effort region rather than no-match. This is a deliberate design choice, not an oversight — the goal of #118 was to stop valid AL codes 404-ing, and the resolver leans toward best-effort coverage (documented in the docstring's "9800–9999 → Sarandë as best-effort" note).

Worth stating precisely, since it bounds the concern: all 489 real AL codes — and every gap code #118 was about — sit in allocated 2-digit prefixes, so they resolve identically under either approach (the golden regression test covers this). The only inputs affected by range-fill vs. prefix-strict are ~5,500 codes in unallocated prefixes, which are almost certainly non-existent codes. Range-fill returns a plausible neighbor region for those; prefix-strict would 404 them.

Your alternative (key on the 35 allocated 2-digit prefixes, 404 the gaps) is a reasonable and arguably more honest tradeoff — it stops asserting confidence 0.9 for codes with no allocated district — and it's free for real coverage. I've flagged it to the maintainer as a follow-up decision, since switching reverses the approved best-effort behavior. If we adopt it, it'll be a small follow-up PR against the _STARTS/bisect logic. Thanks for the catch.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #134: the resolver now keys on the allocated 2-digit district prefix, so codes in unallocated prefixes (1900, 9999, …) return not-found instead of a fabricated region. All 489 real codes and every #118 gap code still resolve identically (golden test unchanged).

21 changes: 20 additions & 1 deletion app/data_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@

import httpx

from app.albania_blocks import SUPPORTED as AL_SUPPORTED
from app.albania_blocks import resolve_al_block
from app.config import settings

_NUTS3_RE = re.compile(r"^[A-Z]{2}[A-Z0-9]{1,3}$")
Expand Down Expand Up @@ -96,6 +98,7 @@ def get_loaded_countries() -> set[str]:
| {cc for cc, _ in _estimates}
| set(_single_nuts3.keys())
| set(_synthetic_nuts.keys())
| set(AL_SUPPORTED)
)


Expand Down Expand Up @@ -1031,9 +1034,10 @@ def _matches_pattern(cc: str, raw: str) -> bool:
def lookup(country_code: str, postal_code: str) -> dict | None:
"""Look up NUTS codes for a given country + postal code.

Six-tier fall-through:
Tiered fall-through:
1. Exact TERCET match → confidence 1.0
2. Pre-computed estimate → stored confidence per level
2b. Albania block map → district-block NUTS3, match_type='estimated' (#118)
3. Runtime prefix-based estimation → calculated confidence
4. Country-level majority vote → unanimous NUTS1/2, dominant NUTS3 (e.g. MT)
5. Single-NUTS3 country fallback → confidence 1.0 (e.g. LI, CY, LU)
Expand Down Expand Up @@ -1067,6 +1071,21 @@ def lookup(country_code: str, postal_code: str) -> dict | None:
nuts3_confidence=est["nuts3_confidence"],
)

# Tier 2b: Albania authoritative block map (#118). AL has no TERCET and no
# estimate rows; the official postal-district block scheme resolves any
# well-formed 4-digit code to its NUTS3.
if cc in AL_SUPPORTED:
al_nuts3 = resolve_al_block(extracted)
if al_nuts3 is not None:
conf = settings.confidence_map["high"]
return _build_result(
"estimated",
al_nuts3,
nuts1_confidence=conf["nuts1"],
nuts2_confidence=conf["nuts2"],
nuts3_confidence=conf["nuts3"],
)

# Tier 3: Runtime prefix-based estimation
approx = _estimate_by_prefix(cc, extracted)
if approx is not None:
Expand Down
107 changes: 0 additions & 107 deletions scripts/build_albania_estimates.py

This file was deleted.

Loading