dbt-validator is a Python CLI that reads a unified git diff for dbt model
changes and prints a structured code review to stdout. It combines deterministic
SQL heuristics with optional LLM-assisted semantic review.
- Setup
- Launch
- Semantic Modes
- Design Decisions
- Current Deterministic Checks
- Known Limitations
- Samples
- Development
- License
Use uv with Python 3.12.
uv can be downloaded using this link.
The commands below do not assume a global python
or dbt-validator executable is already installed.
uv python install 3.12
uv sync --dev --python 3.12Check the CLI:
uv run --python 3.12 dbt-validator --helpIf you activate .venv manually and dbt-validator is still not found, the
project console script has not been installed into that environment. Run
uv sync --dev --python 3.12, or keep using uv run --python 3.12 dbt-validator ....
The project defines a Python build backend in pyproject.toml, so uv sync
installs the dbt-validator console script into the managed environment.
Review a diff file and print JSON:
uv run --python 3.12 dbt-validator samples/dbt_model_changes.diff --semantic-mode mockTry focused deterministic examples:
uv run --python 3.12 dbt-validator samples/join_risks.diff --semantic-mode off
uv run --python 3.12 dbt-validator samples/relation_select_risks.diff --semantic-mode off
uv run --python 3.12 dbt-validator samples/date_null_risks.diff --semantic-mode offEquivalent direct module launch:
uv run --python 3.12 python main.py samples/dbt_model_changes.diff --semantic-mode mockRead from stdin:
git diff -- models/ | uv run --python 3.12 dbt-validator --semantic-mode autoOptions:
diff_file: optional path to a unified diff. Omit it or pass-to read stdin.--repo-root: repository root used to include current model SQL context in semantic review prompts. Defaults to..--format json|text: stdout format. JSON is the default.--semantic-mode auto|openai|mock|off: controls whether the semantic reviewer runs and which provider it uses. See "Semantic Modes" below.--model: OpenAI model for semantic review. Defaults togpt-5-mini.
For real semantic review, set:
export OPENAI_API_KEY="..."
uv run --python 3.12 dbt-validator samples/dbt_model_changes.diff --semantic-mode openaiThe OpenAI provider calls the Responses API with reasoning.effort set to
medium. No live OpenAI calls are made in tests.
Deterministic checks always run. --semantic-mode only decides whether an
extra semantic reviewer runs after those deterministic checks.
| Goal | Mode | What Runs |
|---|---|---|
| Review a real diff with deterministic rules only | --semantic-mode off |
Deterministic checks only. No LLM, no mock semantic finding, no network. |
| Review a real diff with deterministic rules plus real LLM review | --semantic-mode openai |
Deterministic checks plus OpenAI semantic review. Requires OPENAI_API_KEY. |
| Run a demo or tests without secrets | --semantic-mode mock |
Deterministic checks plus a tiny local semantic stub. It does not call OpenAI and is not a substitute for real semantic review. |
| Let the environment decide | --semantic-mode auto |
Deterministic checks plus OpenAI if OPENAI_API_KEY is set; otherwise deterministic checks plus mock. |
Practical recommendation:
- Use
offwhen you want only deterministic findings on a real diff. - Use
openaiwhen you want both deterministic findings and real semantic review on a real diff. - Use
mockonly for repeatable examples, local demos, or tests. - Avoid
autoin CI if you need strict repeatability, because behavior changes depending on whetherOPENAI_API_KEYis present.
- The CLI defaults to JSON so it can be consumed by CI, scripts, or PR bots.
- The parser only needs a unified diff. When a dbt checkout is available, the
semantic reviewer reads the changed model file through
--repo-rootfor additional context. - Deterministic checks are small composable classes split by concern under
dbt_validator/check_rules/, making new checks easy to add without changing unrelated rules. - The OpenAI provider uses the standard library instead of adding an SDK dependency.
- Runtime dependencies are intentionally empty; formatting, linting, testing, and security scanners live in the dev dependency group.
- Hatchling is used as the build backend so
uv synccan install the package and expose thedbt-validatorconsole script. mocksemantic mode gives stable demo and test behavior without secrets.
| Check | Why It Is Risky In dbt |
|---|---|
SELECT * in added SQL |
dbt models often become upstream contracts. Pulling every column makes downstream models sensitive to unrelated source changes and can accidentally expose or process columns that were not intended. |
Hardcoded schema or database references after from or join |
They bypass dbt's ref() and source() resolution, making environments less portable and increasing the chance that dev, CI, or prod runs point at the wrong relation. |
| Missing join predicates | Joins without ON or USING clauses can accidentally multiply rows and corrupt aggregates. |
Weak join predicates such as ON true or ON 1 = 1 |
These behave like cross joins while looking like ordinary joins. |
Explicit cross join |
It multiplies every row from both sides. That may be intentional for date spines or grids, but it should be reviewed carefully. |
Suspicious fanout heuristics for joins to names such as events, items, lines, transactions, facts, or orders |
Those names often indicate many rows per business entity, so joining them can duplicate base records unless uniqueness is guaranteed or data is aggregated first. |
Suspicious = NULL, != NULL, or <> NULL comparisons |
SQL three-valued logic does not treat NULL like a normal value. These expressions usually do not behave as intended; use IS NULL or IS NOT NULL. |
Hardcoded date filters such as where order_date >= '2025-01-01' |
Fixed dates silently age and can make model logic stale. Prefer a dbt variable, documented source boundary, or explicit incremental predicate. |
Relative clock functions such as current_date, current_timestamp, or now() |
Model output changes depending on run time, which can make backfills, tests, and reproducibility harder. |
- This is a diff reviewer, not a SQL parser. Checks are heuristic and can produce false positives or miss multiline SQL patterns.
- Without
--repo-root-aware extensions, the tool cannot inspect existing model files, dbt metadata, manifests, tests, or upstream uniqueness constraints. - The mock semantic provider is intentionally narrow and only demonstrates the shape of semantic findings.
- The OpenAI provider asks the model to return a small JSON object:
{"findings": [{"title": "...", "message": "...", "severity": "error|warning|info", "path": "...", "line": 12}]}The CLI handles this internally; users do not provide this JSON. If OpenAI returns text that is not valid JSON or does not match this shape, deterministic findings are still printed andsummary.semantic_errorexplains that the semantic review response could not be used. - No proper production testing was done (besides testing on samples and on "dbt-labs/jaffle_shop_duckdb" repo)
- Openai calls are not using json_schema in api calls - that means there can be mopre failures of semantic checks because of schema mismatch with the expected format
- Add to README the repo structure with comments on what each file contains
Sample input:
samples/dbt_model_changes.diffsamples/join_risks.diffsamples/relation_select_risks.diffsamples/date_null_risks.diff
Sample output:
samples/sample_output.jsonsamples/join_risks_output.jsonsamples/relation_select_risks_output.jsonsamples/date_null_risks_output.json
Generate the primary sample with:
uv run --python 3.12 dbt-validator samples/dbt_model_changes.diff --semantic-mode mockThe output shape is:
{
"summary": {
"files_reviewed": 1,
"findings": 7,
"semantic_provider": "mock",
"semantic_error": null
},
"files_reviewed": ["models/marts/customer_order_summary.sql"],
"findings": [
{
"id": "select_star:models/marts/customer_order_summary.sql:8:select-star",
"title": "Avoid SELECT * in dbt models",
"message": "SELECT * makes downstream contracts fragile. Select the required columns explicitly.",
"severity": "warning",
"path": "models/marts/customer_order_summary.sql",
"line": 8,
"check_name": "select_star",
"source": "deterministic",
"metadata": {}
}
]
}Development tools:
| Tool | Why It Is Used |
|---|---|
uv |
Creates the Python 3.12 environment, installs dev dependencies, installs the local package, and runs commands without relying on globally installed Python tools. |
hatchling |
Build backend used by uv to install this project as a package and expose the dbt-validator console script. |
black |
Formats Python code consistently so reviews do not spend time on style differences. |
ruff |
Runs fast lint checks and import sorting; --fix handles safe mechanical cleanup. |
mypy |
Type-checks the codebase and catches interface mismatches that tests may not cover. |
pytest |
Runs the unit and CLI behavior tests. |
pytest-cov |
Available when coverage reporting is useful; not required for the basic verification command. |
bandit |
Scans Python code for common security issues. It is a dev-only check, not a runtime dependency. |
pip-audit |
Audits installed Python dependencies for known vulnerabilities. It is a dev-only check, not a runtime dependency. |
Required verification commands:
uv run --python 3.12 black .
uv run --python 3.12 ruff check . --fix
uv run --python 3.12 mypy .
uv run --python 3.12 python -m pytest
uv run --python 3.12 black --check .
uv run --python 3.12 ruff check .Optional security checks:
uv run --python 3.12 bandit -r dbt_validator
uv run --python 3.12 pip-auditThis repository is licensed under the PolyForm Noncommercial License 1.0.0
(PolyForm-Noncommercial-1.0.0).
Commercial use of this test-task code is not permitted. For commercial use, get a separate written commercial license from the copyright holder.
See LICENSE for the repository notice and the official license URL.