Skip to content

Teoretic6/dbt-validator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dbt Validator

dbt-validator is a Python CLI that reads a unified git diff for dbt model changes and prints a structured code review to stdout. It combines deterministic SQL heuristics with optional LLM-assisted semantic review.

Contents

Setup

Use uv with Python 3.12.
uv can be downloaded using this link.

The commands below do not assume a global python or dbt-validator executable is already installed.

uv python install 3.12
uv sync --dev --python 3.12

Check the CLI:

uv run --python 3.12 dbt-validator --help

If you activate .venv manually and dbt-validator is still not found, the project console script has not been installed into that environment. Run uv sync --dev --python 3.12, or keep using uv run --python 3.12 dbt-validator ....

The project defines a Python build backend in pyproject.toml, so uv sync installs the dbt-validator console script into the managed environment.

Launch

Review a diff file and print JSON:

uv run --python 3.12 dbt-validator samples/dbt_model_changes.diff --semantic-mode mock

Try focused deterministic examples:

uv run --python 3.12 dbt-validator samples/join_risks.diff --semantic-mode off
uv run --python 3.12 dbt-validator samples/relation_select_risks.diff --semantic-mode off
uv run --python 3.12 dbt-validator samples/date_null_risks.diff --semantic-mode off

Equivalent direct module launch:

uv run --python 3.12 python main.py samples/dbt_model_changes.diff --semantic-mode mock

Read from stdin:

git diff -- models/ | uv run --python 3.12 dbt-validator --semantic-mode auto

Options:

  • diff_file: optional path to a unified diff. Omit it or pass - to read stdin.
  • --repo-root: repository root used to include current model SQL context in semantic review prompts. Defaults to ..
  • --format json|text: stdout format. JSON is the default.
  • --semantic-mode auto|openai|mock|off: controls whether the semantic reviewer runs and which provider it uses. See "Semantic Modes" below.
  • --model: OpenAI model for semantic review. Defaults to gpt-5-mini.

For real semantic review, set:

export OPENAI_API_KEY="..."
uv run --python 3.12 dbt-validator samples/dbt_model_changes.diff --semantic-mode openai

The OpenAI provider calls the Responses API with reasoning.effort set to medium. No live OpenAI calls are made in tests.

Semantic Modes

Deterministic checks always run. --semantic-mode only decides whether an extra semantic reviewer runs after those deterministic checks.

Goal Mode What Runs
Review a real diff with deterministic rules only --semantic-mode off Deterministic checks only. No LLM, no mock semantic finding, no network.
Review a real diff with deterministic rules plus real LLM review --semantic-mode openai Deterministic checks plus OpenAI semantic review. Requires OPENAI_API_KEY.
Run a demo or tests without secrets --semantic-mode mock Deterministic checks plus a tiny local semantic stub. It does not call OpenAI and is not a substitute for real semantic review.
Let the environment decide --semantic-mode auto Deterministic checks plus OpenAI if OPENAI_API_KEY is set; otherwise deterministic checks plus mock.

Practical recommendation:

  • Use off when you want only deterministic findings on a real diff.
  • Use openai when you want both deterministic findings and real semantic review on a real diff.
  • Use mock only for repeatable examples, local demos, or tests.
  • Avoid auto in CI if you need strict repeatability, because behavior changes depending on whether OPENAI_API_KEY is present.

Design Decisions

  • The CLI defaults to JSON so it can be consumed by CI, scripts, or PR bots.
  • The parser only needs a unified diff. When a dbt checkout is available, the semantic reviewer reads the changed model file through --repo-root for additional context.
  • Deterministic checks are small composable classes split by concern under dbt_validator/check_rules/, making new checks easy to add without changing unrelated rules.
  • The OpenAI provider uses the standard library instead of adding an SDK dependency.
  • Runtime dependencies are intentionally empty; formatting, linting, testing, and security scanners live in the dev dependency group.
  • Hatchling is used as the build backend so uv sync can install the package and expose the dbt-validator console script.
  • mock semantic mode gives stable demo and test behavior without secrets.

Current Deterministic Checks

Check Why It Is Risky In dbt
SELECT * in added SQL dbt models often become upstream contracts. Pulling every column makes downstream models sensitive to unrelated source changes and can accidentally expose or process columns that were not intended.
Hardcoded schema or database references after from or join They bypass dbt's ref() and source() resolution, making environments less portable and increasing the chance that dev, CI, or prod runs point at the wrong relation.
Missing join predicates Joins without ON or USING clauses can accidentally multiply rows and corrupt aggregates.
Weak join predicates such as ON true or ON 1 = 1 These behave like cross joins while looking like ordinary joins.
Explicit cross join It multiplies every row from both sides. That may be intentional for date spines or grids, but it should be reviewed carefully.
Suspicious fanout heuristics for joins to names such as events, items, lines, transactions, facts, or orders Those names often indicate many rows per business entity, so joining them can duplicate base records unless uniqueness is guaranteed or data is aggregated first.
Suspicious = NULL, != NULL, or <> NULL comparisons SQL three-valued logic does not treat NULL like a normal value. These expressions usually do not behave as intended; use IS NULL or IS NOT NULL.
Hardcoded date filters such as where order_date >= '2025-01-01' Fixed dates silently age and can make model logic stale. Prefer a dbt variable, documented source boundary, or explicit incremental predicate.
Relative clock functions such as current_date, current_timestamp, or now() Model output changes depending on run time, which can make backfills, tests, and reproducibility harder.

Known Limitations and TODOs

  • This is a diff reviewer, not a SQL parser. Checks are heuristic and can produce false positives or miss multiline SQL patterns.
  • Without --repo-root-aware extensions, the tool cannot inspect existing model files, dbt metadata, manifests, tests, or upstream uniqueness constraints.
  • The mock semantic provider is intentionally narrow and only demonstrates the shape of semantic findings.
  • The OpenAI provider asks the model to return a small JSON object: {"findings": [{"title": "...", "message": "...", "severity": "error|warning|info", "path": "...", "line": 12}]} The CLI handles this internally; users do not provide this JSON. If OpenAI returns text that is not valid JSON or does not match this shape, deterministic findings are still printed and summary.semantic_error explains that the semantic review response could not be used.
  • No proper production testing was done (besides testing on samples and on "dbt-labs/jaffle_shop_duckdb" repo)
  • Openai calls are not using json_schema in api calls - that means there can be mopre failures of semantic checks because of schema mismatch with the expected format
  • Add to README the repo structure with comments on what each file contains

Samples

Sample input:

  • samples/dbt_model_changes.diff
  • samples/join_risks.diff
  • samples/relation_select_risks.diff
  • samples/date_null_risks.diff

Sample output:

  • samples/sample_output.json
  • samples/join_risks_output.json
  • samples/relation_select_risks_output.json
  • samples/date_null_risks_output.json

Generate the primary sample with:

uv run --python 3.12 dbt-validator samples/dbt_model_changes.diff --semantic-mode mock

The output shape is:

{
  "summary": {
    "files_reviewed": 1,
    "findings": 7,
    "semantic_provider": "mock",
    "semantic_error": null
  },
  "files_reviewed": ["models/marts/customer_order_summary.sql"],
  "findings": [
    {
      "id": "select_star:models/marts/customer_order_summary.sql:8:select-star",
      "title": "Avoid SELECT * in dbt models",
      "message": "SELECT * makes downstream contracts fragile. Select the required columns explicitly.",
      "severity": "warning",
      "path": "models/marts/customer_order_summary.sql",
      "line": 8,
      "check_name": "select_star",
      "source": "deterministic",
      "metadata": {}
    }
  ]
}

Development

Development tools:

Tool Why It Is Used
uv Creates the Python 3.12 environment, installs dev dependencies, installs the local package, and runs commands without relying on globally installed Python tools.
hatchling Build backend used by uv to install this project as a package and expose the dbt-validator console script.
black Formats Python code consistently so reviews do not spend time on style differences.
ruff Runs fast lint checks and import sorting; --fix handles safe mechanical cleanup.
mypy Type-checks the codebase and catches interface mismatches that tests may not cover.
pytest Runs the unit and CLI behavior tests.
pytest-cov Available when coverage reporting is useful; not required for the basic verification command.
bandit Scans Python code for common security issues. It is a dev-only check, not a runtime dependency.
pip-audit Audits installed Python dependencies for known vulnerabilities. It is a dev-only check, not a runtime dependency.

Required verification commands:

uv run --python 3.12 black .
uv run --python 3.12 ruff check . --fix
uv run --python 3.12 mypy .
uv run --python 3.12 python -m pytest
uv run --python 3.12 black --check .
uv run --python 3.12 ruff check .

Optional security checks:

uv run --python 3.12 bandit -r dbt_validator
uv run --python 3.12 pip-audit

License

This repository is licensed under the PolyForm Noncommercial License 1.0.0 (PolyForm-Noncommercial-1.0.0).

Commercial use of this test-task code is not permitted. For commercial use, get a separate written commercial license from the copyright holder.

See LICENSE for the repository notice and the official license URL.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages