A single-operator data platform that ingests three public APIs (eBird, NOAA, USGS) into one queryable cross-domain warehouse. Zero always-on infra: file-based DuckDB locally, MotherDuck cloud with one environment flag. Every layer — ingest, transform, quality, orchestration, semantic metrics, data dictionary — is wired end-to-end through the same open-source stack.
The project exists to answer one question: "do species distributions shift with same-day weather and streamflow anomalies?" The platform around it exists to answer that question honestly, repeatably, and with receipts.
- Skim the System architecture diagram below.
- Skim the Data flow diagram below.
- Open the data dictionary — every model, columns, types, Soda checks, lineage.
- Read ADR-0001 through ADR-0006 for the six load-bearing architectural decisions.
- Read docs/analytics-examples.md to see what the cross-domain mart actually answers.
graph TB
subgraph External["External APIs"]
ebird_api[eBird API v2]
noaa_api[NOAA CDO v2]
usgs_api[USGS NWIS]
end
subgraph Ingest["Ingestion · dlt"]
ebird_src[ebird source]
noaa_src[noaa source]
usgs_src[usgs source]
end
subgraph Raw["Raw catalogs · DuckDB / MotherDuck"]
raw_ebird[(raw_ebird)]
raw_noaa[(raw_noaa)]
raw_usgs[(raw_usgs)]
end
subgraph Transform["Transform · SQLMesh"]
staging["*_staging views"]
marts["ebird / noaa / usgs marts"]
analytics["analytics.* cross-domain marts"]
metrics["semantic metrics METRIC()"]
end
subgraph Quality["Quality · Soda Core"]
contracts["Soda contracts per model"]
end
subgraph Consumer["Consumers"]
dashboard["MotherDuck Dive dashboards"]
dict["Data dictionary site (MkDocs)"]
end
Orchestrator["Dagster: sole orchestrator<br/>assets · schedules · sensors · asset checks"]
ebird_api --> ebird_src --> raw_ebird
noaa_api --> noaa_src --> raw_noaa
usgs_api --> usgs_src --> raw_usgs
raw_ebird --> staging
raw_noaa --> staging
raw_usgs --> staging
staging --> marts
marts --> analytics
analytics --> metrics
analytics --> contracts
marts --> contracts
analytics --> dashboard
analytics --> dict
Orchestrator -.materializes.-> ebird_src
Orchestrator -.materializes.-> noaa_src
Orchestrator -.materializes.-> usgs_src
Orchestrator -.materializes.-> staging
Orchestrator -.materializes.-> marts
Orchestrator -.materializes.-> analytics
Orchestrator -.asset-checks.-> contracts
flowchart LR
A[API responses<br/>JSON] -->|dlt incremental| B[raw_*<br/>append-only tables]
B -->|SQLMesh stg_* views| C[*_staging<br/>renames + type coercion]
C -->|SQLMesh int_* models| D[int_*<br/>H3 cell + day grain]
D -->|SQLMesh fct_* / dim_*| E[source marts<br/>ebird / noaa / usgs]
E -->|cross-domain joins| F[analytics.fct_species_environment_daily<br/>species × H3 × day]
F -->|METRIC DDL| G[semantic metrics<br/>richness · intensity · anomalies]
F -->|Soda asset check| H{quality gate}
H -->|pass| I[Dive dashboards]
H -->|fail| J[block downstream · surface in Dagster]
Each claim is backed by a ticket and its evidence, not just prose.
| Capability | How it shows up | Evidence |
|---|---|---|
| Cross-domain modeling — joining bird, weather, and streamflow at a shared spatial grain | analytics.fct_species_environment_daily keyed on (species_code, h3_cell, obs_date), H3 cells resolve spatial joins across domains |
ticket:flagship-cross-domain-mart · example queries |
| Semantic metrics layer — one canonical SQL definition per KPI | Seven metrics in transforms/main/metrics/flagship.sql, queryable by name via resolve_metric_query() |
ticket:semantic-metrics-layer · metrics docs · design record |
| Contract-based quality — every model has a Soda contract gated as a Dagster asset check | soda/contracts/ + Soda asset checks that block downstream materialization on failure |
ticket:observability-pass · contracts doc |
| Schema-contract CI gate — breaking changes to contracts require explicit opt-in | scripts/schema_gate.py runs on PRs, blocks column drops / type narrowings without accept-breaking-change marker |
ticket:schema-contract-ci |
| Auto-generated data dictionary — every model's columns, types, checks, and lineage discoverable without cloning | doctacon.github.io/databox, regenerated from sqlmesh.Context + Soda YAML |
ticket:data-dictionary-site |
| Idempotent incremental ingest — reruns do not double-count | write_disposition='merge' on the right resources, primary keys declared, documented contract |
ticket:incremental-load-audit · incremental-loading doc |
| Portable local ↔ cloud — identical SQL, environment-variable switch | DATABOX_BACKEND=local vs =motherduck; gateways mirror; settings return file paths vs md: URIs transparently |
ADR-0006 |
| Single orchestration surface — Dagster owns every run, no CLI drift | Every asset (ingest, transform, quality) visible in one DAG; task targets are thin wrappers |
ADR-0005 |
- dlt — Python-native ingestion, auto-schema inference
- SQLMesh — SQL transforms with virtual environments, native metrics, column-level change detection (see ADR-0002)
- DuckDB — embedded analytical warehouse (see ADR-0001)
- MotherDuck — cloud DuckDB for the portfolio deploy (see ADR-0006)
- Dagster — sole orchestrator; assets, schedules, sensors, asset checks (see ADR-0005)
- Soda Core — contract-based quality, run as asset checks
- MkDocs-Material — auto-generated data dictionary site
Six backfilled ADRs (Nygard format) cover the choices that load-bear on the rest of the design. Each one is under 200 lines.
- ADR-0001 — DuckDB as the primary warehouse
- ADR-0002 — SQLMesh over dbt
- ADR-0003 — Single SQLMesh project across all sources
- ADR-0004 — Per-source raw DuckDB catalogs
- ADR-0005 — Dagster as the sole orchestrator
- ADR-0006 — MotherDuck as the cloud path
# Install (requires uv)
task install
# Configure API keys — EBIRD_API_TOKEN and NOAA_API_TOKEN are free
cp .env.example .env && $EDITOR .env
# Run everything headlessly (ingest → transform → quality)
task full-refresh
# Or interactive: Dagster UI at localhost:3000
task dagster:dev
# Edit a model? Propose in dev, verify, promote to prod
task plan:dev # materialize into ebird__dev, noaa__dev, ...
task verify:dev # Soda contracts run against __dev schemas
task plan:prod # promote verified changes — see docs/environments.md
# Launch the Streamlit data explorer
task streamlitDagster is the one entrypoint — individual pipeline runs, quality checks, schedules, and sensors all happen as assets in the same DAG. See ADR-0005.
Per-mart staleness SLAs are declared in each domain module and validated
by last_update asset checks; a sensor emits a structured warning line
per violation. See docs/freshness.md.
Four production-failure scenarios — blown DuckDB file, partial source backfill, MotherDuck point-in-time recovery, paused-schedule resumption — have copy-pasteable recovery commands in docs/runbook.md.
Databox is designed to be forked. After cloning:
task init -- \
--name Weatherbox --slug weatherbox \
--org your-org --repo weatherbox \
--site-url https://your-org.github.io/weatherbox/task init reads scaffold.yaml, rewrites every project-identity reference across README/mkdocs/docs/LICENSE/pyproject, and is a no-op on second run. The three example sources (eBird / NOAA / USGS) stay wired so the whole pipeline is green on day one — delete or replace them at your own pace.
See docs/template.md for the full list of what's covered, what stays unchanged (Python package name, external deps), and how to extend the scaffold.
Flip between local and cloud via environment variables; the SQL never changes.
# Local (default) — file-based DuckDB, zero infra
DATABOX_BACKEND=local
# MotherDuck cloud — same SQL, md:* URIs
DATABOX_BACKEND=motherduck
MOTHERDUCK_TOKEN=<your_token>databox.config.settings is the single source of truth: DuckDB paths,
MotherDuck URIs, SQLMesh gateway selection, and the Soda datasource
are all derived from DATABOX_BACKEND. See
docs/configuration.md,
docs/secrets.md,
ADR-0001,
ADR-0004,
ADR-0006.
packages/ # uv workspace
├── databox/ # Shared library (config / quality / orchestration)
└── databox-sources/ # dlt ingestion + per-source configs
transforms/main/ # Unified SQLMesh project (ADR-0003)
├── config.yaml # local + motherduck gateways
├── metrics/ # semantic metrics registry
└── models/
├── ebird/ noaa/ usgs/ # staging → intermediate → marts
└── analytics/ # cross-domain marts
soda/contracts/ # One Soda contract per SQLMesh model
docs/
├── adr/ # 6 backfilled ADRs
├── dictionary/ # Auto-generated by scripts/generate_docs.py
└── {metrics,contracts,...}.md
scripts/
├── generate_docs.py # Data-dictionary generator
├── generate_staging.py # Staging-SQL codegen from Soda contracts
├── schema_gate.py # CI breaking-change gate
└── check_source_layout.py # Source-layout convention lint
data/ # DuckDB files (gitignored)
Adding a source? See docs/source-layout.md for the required on-disk shape (enforced in CI by source-layout-lint).
- Data dictionary + lineage: https://doctacon.github.io/databox/ — regenerates on every push to
main - docs/metrics.md — semantic metrics registry and
resolve_metric_queryhelper - docs/analytics-examples.md — representative queries against the flagship mart
- docs/contracts.md — Soda contract conventions + schema-contract gate escape hatch
- docs/incremental-loading.md — per-resource write disposition, keys, backfill
Two surfaces ship in-tree: the Dagster UI (asset graph, asset checks, freshness panel) and the auto-generated data dictionary at https://doctacon.github.io/databox/.
For external lineage catalogs, Dagster attaches an OpenLineage-emitting
sensor when OPENLINEAGE_URL is set in .env. Every major catalog
(Marquez, DataHub, OpenMetadata, Atlan, Astro) speaks OpenLineage, so
forkers drop whichever URL they already have and restart Dagster — no
code change. Install with uv sync --package databox --extra lineage.
Full walkthrough + local Marquez docker-compose in
docs/observability.md.
task install # Install dependencies (uv sync + pre-commit hook)
task verify # Smoke full-refresh via Dagster (DATABOX_SMOKE=1)
task ci # Ruff + mypy + pytest + secret scanRaw lint / format / test / SQLMesh / Dagster CLIs are in
docs/commands.md. Taskfile.yaml deliberately
keeps only targets that compose multiple commands or inject env vars.
MIT — see LICENSE.