dbt Validator

dbt-validator is a Python CLI that reads a unified git diff for dbt model changes and prints a structured code review to stdout. It combines deterministic SQL heuristics with optional LLM-assisted semantic review.

Setup

Use uv with Python 3.12.
uv can be downloaded using this link.

The commands below do not assume a global python or dbt-validator executable is already installed.

uv python install 3.12
uv sync --dev --python 3.12

Check the CLI:

uv run --python 3.12 dbt-validator --help

If you activate .venv manually and dbt-validator is still not found, the project console script has not been installed into that environment. Run uv sync --dev --python 3.12, or keep using uv run --python 3.12 dbt-validator ....

The project defines a Python build backend in pyproject.toml, so uv sync installs the dbt-validator console script into the managed environment.

Launch

Review a diff file and print JSON:

uv run --python 3.12 dbt-validator samples/dbt_model_changes.diff --semantic-mode mock

Try focused deterministic examples:

uv run --python 3.12 dbt-validator samples/join_risks.diff --semantic-mode off
uv run --python 3.12 dbt-validator samples/relation_select_risks.diff --semantic-mode off
uv run --python 3.12 dbt-validator samples/date_null_risks.diff --semantic-mode off

Equivalent direct module launch:

uv run --python 3.12 python main.py samples/dbt_model_changes.diff --semantic-mode mock

Read from stdin:

git diff -- models/ | uv run --python 3.12 dbt-validator --semantic-mode auto

Options:

diff_file: optional path to a unified diff. Omit it or pass - to read stdin.
--repo-root: repository root used to include current model SQL context in semantic review prompts. Defaults to ..
--format json|text: stdout format. JSON is the default.
--semantic-mode auto|openai|mock|off: controls whether the semantic reviewer runs and which provider it uses. See "Semantic Modes" below.
--model: OpenAI model for semantic review. Defaults to gpt-5-mini.

For real semantic review, set:

export OPENAI_API_KEY="..."
uv run --python 3.12 dbt-validator samples/dbt_model_changes.diff --semantic-mode openai

The OpenAI provider calls the Responses API with reasoning.effort set to medium. No live OpenAI calls are made in tests.

Semantic Modes

Deterministic checks always run. --semantic-mode only decides whether an extra semantic reviewer runs after those deterministic checks.

Goal	Mode	What Runs
Review a real diff with deterministic rules only	`--semantic-mode off`	Deterministic checks only. No LLM, no mock semantic finding, no network.
Review a real diff with deterministic rules plus real LLM review	`--semantic-mode openai`	Deterministic checks plus OpenAI semantic review. Requires `OPENAI_API_KEY`.
Run a demo or tests without secrets	`--semantic-mode mock`	Deterministic checks plus a tiny local semantic stub. It does not call OpenAI and is not a substitute for real semantic review.
Let the environment decide	`--semantic-mode auto`	Deterministic checks plus OpenAI if `OPENAI_API_KEY` is set; otherwise deterministic checks plus `mock`.

Practical recommendation:

Use off when you want only deterministic findings on a real diff.
Use openai when you want both deterministic findings and real semantic review on a real diff.
Use mock only for repeatable examples, local demos, or tests.
Avoid auto in CI if you need strict repeatability, because behavior changes depending on whether OPENAI_API_KEY is present.

Design Decisions

The CLI defaults to JSON so it can be consumed by CI, scripts, or PR bots.
The parser only needs a unified diff. When a dbt checkout is available, the semantic reviewer reads the changed model file through --repo-root for additional context.
Deterministic checks are small composable classes split by concern under dbt_validator/check_rules/, making new checks easy to add without changing unrelated rules.
The OpenAI provider uses the standard library instead of adding an SDK dependency.
Runtime dependencies are intentionally empty; formatting, linting, testing, and security scanners live in the dev dependency group.
Hatchling is used as the build backend so uv sync can install the package and expose the dbt-validator console script.
mock semantic mode gives stable demo and test behavior without secrets.

Current Deterministic Checks

Check	Why It Is Risky In dbt
`SELECT *` in added SQL	dbt models often become upstream contracts. Pulling every column makes downstream models sensitive to unrelated source changes and can accidentally expose or process columns that were not intended.
Hardcoded schema or database references after `from` or `join`	They bypass dbt's `ref()` and `source()` resolution, making environments less portable and increasing the chance that dev, CI, or prod runs point at the wrong relation.
Missing join predicates	Joins without `ON` or `USING` clauses can accidentally multiply rows and corrupt aggregates.
Weak join predicates such as `ON true` or `ON 1 = 1`	These behave like cross joins while looking like ordinary joins.
Explicit `cross join`	It multiplies every row from both sides. That may be intentional for date spines or grids, but it should be reviewed carefully.
Suspicious fanout heuristics for joins to names such as `events`, `items`, `lines`, `transactions`, `facts`, or `orders`	Those names often indicate many rows per business entity, so joining them can duplicate base records unless uniqueness is guaranteed or data is aggregated first.
Suspicious `= NULL`, `!= NULL`, or `<> NULL` comparisons	SQL three-valued logic does not treat `NULL` like a normal value. These expressions usually do not behave as intended; use `IS NULL` or `IS NOT NULL`.
Hardcoded date filters such as `where order_date >= '2025-01-01'`	Fixed dates silently age and can make model logic stale. Prefer a dbt variable, documented source boundary, or explicit incremental predicate.
Relative clock functions such as `current_date`, `current_timestamp`, or `now()`	Model output changes depending on run time, which can make backfills, tests, and reproducibility harder.

Known Limitations and TODOs

This is a diff reviewer, not a SQL parser. Checks are heuristic and can produce false positives or miss multiline SQL patterns.
Without --repo-root-aware extensions, the tool cannot inspect existing model files, dbt metadata, manifests, tests, or upstream uniqueness constraints.
The mock semantic provider is intentionally narrow and only demonstrates the shape of semantic findings.
The OpenAI provider asks the model to return a small JSON object: {"findings": [{"title": "...", "message": "...", "severity": "error|warning|info", "path": "...", "line": 12}]} The CLI handles this internally; users do not provide this JSON. If OpenAI returns text that is not valid JSON or does not match this shape, deterministic findings are still printed and summary.semantic_error explains that the semantic review response could not be used.
No proper production testing was done (besides testing on samples and on "dbt-labs/jaffle_shop_duckdb" repo)
Openai calls are not using json_schema in api calls - that means there can be mopre failures of semantic checks because of schema mismatch with the expected format
Add to README the repo structure with comments on what each file contains

Samples

Sample input:

samples/dbt_model_changes.diff
samples/join_risks.diff
samples/relation_select_risks.diff
samples/date_null_risks.diff

Sample output:

samples/sample_output.json
samples/join_risks_output.json
samples/relation_select_risks_output.json
samples/date_null_risks_output.json

Generate the primary sample with:

uv run --python 3.12 dbt-validator samples/dbt_model_changes.diff --semantic-mode mock

The output shape is:

{
  "summary": {
    "files_reviewed": 1,
    "findings": 7,
    "semantic_provider": "mock",
    "semantic_error": null
  },
  "files_reviewed": ["models/marts/customer_order_summary.sql"],
  "findings": [
    {
      "id": "select_star:models/marts/customer_order_summary.sql:8:select-star",
      "title": "Avoid SELECT * in dbt models",
      "message": "SELECT * makes downstream contracts fragile. Select the required columns explicitly.",
      "severity": "warning",
      "path": "models/marts/customer_order_summary.sql",
      "line": 8,
      "check_name": "select_star",
      "source": "deterministic",
      "metadata": {}
    }
  ]
}

Development

Development tools:

Tool	Why It Is Used
`uv`	Creates the Python 3.12 environment, installs dev dependencies, installs the local package, and runs commands without relying on globally installed Python tools.
`hatchling`	Build backend used by `uv` to install this project as a package and expose the `dbt-validator` console script.
`black`	Formats Python code consistently so reviews do not spend time on style differences.
`ruff`	Runs fast lint checks and import sorting; `--fix` handles safe mechanical cleanup.
`mypy`	Type-checks the codebase and catches interface mismatches that tests may not cover.
`pytest`	Runs the unit and CLI behavior tests.
`pytest-cov`	Available when coverage reporting is useful; not required for the basic verification command.
`bandit`	Scans Python code for common security issues. It is a dev-only check, not a runtime dependency.
`pip-audit`	Audits installed Python dependencies for known vulnerabilities. It is a dev-only check, not a runtime dependency.

Required verification commands:

uv run --python 3.12 black .
uv run --python 3.12 ruff check . --fix
uv run --python 3.12 mypy .
uv run --python 3.12 python -m pytest
uv run --python 3.12 black --check .
uv run --python 3.12 ruff check .

Optional security checks:

uv run --python 3.12 bandit -r dbt_validator
uv run --python 3.12 pip-audit

License

This repository is licensed under the PolyForm Noncommercial License 1.0.0 (PolyForm-Noncommercial-1.0.0).

Commercial use of this test-task code is not permitted. For commercial use, get a separate written commercial license from the copyright holder.

See LICENSE for the repository notice and the official license URL.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dbt_validator		dbt_validator
samples		samples
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dbt Validator

Contents

Setup

Launch

Semantic Modes

Design Decisions

Current Deterministic Checks

Known Limitations and TODOs

Samples

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dbt Validator

Contents

Setup

Launch

Semantic Modes

Design Decisions

Current Deterministic Checks

Known Limitations and TODOs

Samples

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages