Replace soda with ibis by jochenchrist · Pull Request #1279 · datacontract/datacontract-cli

jochenchrist · 2026-06-03T20:25:19Z

Tests pass (uv run pytest)
Code formatted (uv run ruff check --fix && uv run ruff format)
README.md updated (if relevant)
CHANGELOG.md entry added

Remove soda-core entirely as the data quality execution engine and replace it with ibis (https://ibis-project.org/), which compiles one expression API to many SQL dialects via sqlglot and reads local/remote files through DuckDB. Motivation: soda-core v3 was an unmaintained, string-templated per-dialect SQL generator that forced a `setuptools`/`distutils.strtobool` shim, a `mysql-connector-python` override, and brittle version pinning across ~13 `soda-core-*` extras. What changed: - New engine-neutral check IR (`datacontract/engines/checks/`): `CheckSpec` + structured `Threshold`, and `create_checks` that enumerates an ODCS contract into specs, preserving every legacy check key/type/name. - New ibis engine (`datacontract/engines/ibis/`): batches row/missing/invalid counts into one aggregation per model; runs dedicated queries for duplicates, schema/type, freshness/retention and user SQL; evaluates thresholds in Python. Reproduces soda's invalid_count semantics (NOT missing AND (NOT valid OR in invalid_values)). Counts use `CASE WHEN ... THEN 1 ELSE 0 END` for dialect portability (e.g. Oracle). - Per-source ibis connection builders reusing the existing DuckDB view builder (files) and Spark/Kafka helpers; Spark sources run via the ibis pyspark backend. - SodaCL kept but isolated: all SodaCL generation moved into `datacontract/export/sodacl_check_builder.py`, used only by `SodaExporter`. `export sodacl` is unchanged and no longer shares code with the test path. - Removed `engines/soda/`, the old `data_contract_checks.py`, the soda config-builder tests, the setuptools shim, and all `soda-core-*` deps; pyproject extras now map to `ibis-framework[<backend>]`. - Raw SodaCL custom checks (quality.engine: soda) now surface a migration warning instead of executing. Verified end-to-end against testcontainers/local data for DuckDB (parquet/csv/ json/s3), Postgres (full quality fixture), Trino, and Oracle; full non-DB suite passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…xture - execute_ibis_checks only disconnects connections the engine created. The pyspark backend wraps a caller-owned SparkSession, and an externally supplied DuckDB connection is owned by the caller; disconnecting either broke subsequent runs (e.g. the session-scoped Spark fixture shared by the two dataframe tests). Skip disposal for the pyspark backend and for caller-provided duckdb/spark resources. - Migrate tests/fixtures/kafka to the native rowCount quality metric, replacing the removed raw SodaCL custom check, so the kafka/Spark path is exercised end-to-end. Verified with Java 21: test_test_dataframe (x2), test_import_spark (x3) and test_test_kafka all pass via the ibis pyspark backend. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Minor bump (0.x SemVer) for the breaking quality-engine replacement. Document the soda-core -> ibis migration, the dropped raw-SodaCL execution, and the soda-core dependency removal in the CHANGELOG Unreleased section. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- ibis's native MySQL backend requires `mysqlclient`, a C extension with no macOS/Linux wheels that fails to build without pkg-config + MySQL client libraries (broke `pip install -e .[dev]`). Connect to MySQL via DuckDB's `mysql` extension instead: ATTACH the database and materialize each contract model into a local DuckDB table, then run checks locally. Keeps the `mysql` extra pure-pip. Materializing avoids DuckDB MySQL-scanner pushdown errors (e.g. the grouped duplicate-count query hit a DuckDB binder assertion). - Pin `duckdb` to `<1.1.0` to match the bundled `duckdb-extension-*` wheels (httpfs/aws/azure, pinned `<1.1.0`). Without a lockfile, fresh installs resolved duckdb 1.5.3, which mismatched those wheels (S3 "Secret Validation Failure") and changed CSV/JSON/secret behavior and mysql-extension port handling — breaking s3, csv-import, nested-json, and mysql tests. Full suite (Java 21, duckdb 1.0.0): 744 passed, 14 skipped. Remaining 5 failures are environmental: 4 protobuf (no `protoc`), 1 kafka (pre-existing Spark-session conflict with the dataframe test in non-xdist runs; passes in isolation, skipped under xdist in CI). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The runtime MySQL path goes through DuckDB's mysql extension, so no Python MySQL driver is needed at install time. mysql-connector-python is only used by the MySQL test fixture to seed data, so it belongs in `dev`. The `mysql` extra is now just `datacontract-cli[duckdb]`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…wheels Loosen the duckdb pin off the 1.0.x line. duckdb and the bundled duckdb-extension-* wheels (httpfs/aws/azure) are bumped together to 1.5.x; the 1.5.x extension wheels ship arm64 Linux builds, so the platform skip markers are dropped and air-gapped installs on arm64 Linux now work. Fixes for DuckDB >=1.5 behavior changes: - S3 secret: explicit KEY_ID/SECRET now use the default `config` provider; `PROVIDER CREDENTIAL_CHAIN` with explicit credentials is rejected in 1.5.x ("Secret Validation Failure"). - csv import: the uniqueness probe uses `count(DISTINCT ...)` via SQL instead of the relational `.count('DISTINCT ...')` form, which 1.5.x's binder rejects. - test_duckdb_json: assert on the stable DuckDBPyType `.id` (number->bigint, dict->struct) instead of the old DBAPI type-code strings. Full suite (Java 21, duckdb 1.5.3): 744 passed, 14 skipped; remaining 5 failures are environmental (4 protobuf: no protoc; 1 kafka: Spark-session conflict in non-xdist runs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Use the TemporaryDirectory path instead of its repr for the Spark warehouse dir in create_spark_session() and import_spark().

Without a lockfile, `pyspark>=3.5.0,<5.0.0` resolved to 4.0.x on fresh installs, but the Kafka/Avro paths load `spark-sql-kafka-0-10_2.12:3.5.5` / `spark-avro_2.12:3.5.5` (Scala 2.12, Spark 3.5) jars, which fail to load on a Spark 4.x (Scala 2.13) runtime — breaking `datacontract test` against Kafka. Cap pyspark to the 3.5.x line in the kafka and databricks extras. Full suite (Java 21, duckdb 1.5.3, pyspark 3.5.8): 745 passed, 14 skipped, 4 failed; the 4 failures are the protobuf importer tests, which require the `protoc` system binary (documented manual install). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the protoc-based importer with proto-schema-parser (pure Python, only depends on antlr4-python3-runtime). `import protobuf` no longer needs the `protoc` system binary or the protobuf runtime: imports are resolved transitively by reading `import` statements and parsing each file, and message/enum type references are linked across files by simple name (handling package-qualified and subdirectory imports). Output is preserved byte-for-byte vs the protoc-based importer (scalar physicalType is still the protobuf type number; repeated scalars stay scalar; only repeated messages become arrays). The `protobuf` extra now declares `proto-schema-parser` instead of `protobuf`. Dockerfile: drop the protobuf-compiler install and the protoc binary/lib copy into the runtime image — no longer needed. tests/test_import_protobuf.py (incl. nested-imports and subdirs): 4 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The ibis Snowflake backend forwards connection kwargs to snowflake-connector-python, which expects `user` (not soda's `username`). Map the documented DATACONTRACT_SNOWFLAKE_USERNAME env var to `user` so it keeps working after the soda -> ibis migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- protobuf import: drop the protoc system-dependency instructions (now a pure-Python parser). - engine description: ibis + fastjsonschema instead of soda-core. - bigquery: ADC/WIF fallback no longer described via soda's use_context_auth. - snowflake: document env vars as snowflake-connector-python params (USERNAME kept as an alias for `user`). - redshift: username/password via the Postgres backend; note IAM auth is not currently supported. - impala: drop 'Soda' wording. The `export sodacl` format is unchanged and remains documented. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the hardcoded spark-sql-kafka / spark-avro coordinates (_2.12:3.5.5) with spark_connector_packages(), which reads the Spark version and Scala binary version (2.12 vs 2.13) from the installed PySpark jars. This lets the kafka extra allow PySpark 4.x (Scala 2.13) without the connector JARs mismatching the runtime. Tests use the same helper.

The databricks path connects via a caller-provided Spark session (or the databricks SQL connector) and never calls create_spark_session(), so it loads no Kafka/Avro connector jars and isn't tied to a Scala/Spark line. Align its pyspark range with the kafka extra (<5.0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…cap) The kafka/databricks extras allow pyspark<5.0; the Kafka/Avro connector jars are derived from the runtime PySpark (Scala 2.12/2.13), so the earlier "capped to <4.0" note no longer describes the final state. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Previously every ibis check carried a SodaCL-looking pseudo-string in `implementation` with `language: null`, a leftover from the soda model. Now each check records what it actually runs: - count-style metrics (row_count, missing_count, invalid_count), duplicates, freshness/retention and custom-SQL checks store the backend-dialect SQL (compiled via ibis.to_sql) with `language: "sql"`. - schema checks (field_is_present, field_type) use schema introspection, so they record a short note with `language: "introspection"`. The batched count metrics still execute as a single aggregation; each check's recorded SQL is the representative single-metric query. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The direct (non-Spark) Databricks test connection only accepted a personal access token. Now that the engine connects via ibis.databricks.connect, the full connector auth surface is available, so resolve the auth method from env vars in priority order: 1. DATACONTRACT_DATABRICKS_TOKEN — personal access token (unchanged default) 2. DATACONTRACT_DATABRICKS_CLIENT_ID + _CLIENT_SECRET — OAuth service principal (M2M), the usual choice for CI/CD 3. DATACONTRACT_DATABRICKS_PROFILE — a local config profile via the Databricks SDK unified auth (parity with the Unity Catalog importer; also Azure CLI/MSI) 4. DATACONTRACT_DATABRICKS_AUTH_TYPE — explicit connector auth_type, e.g. databricks-oauth for the interactive U2M browser flow The OAuth credential providers build their SDK Config lazily so token exchange happens at connect time, not while reading env. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…igration The soda->ibis migration reduced the SQL Server connection to username/password + driver, silently dropping the documented auth and SSL env vars. With ODBC Driver 18 (encrypt + verify by default) this broke connections to servers with a self-signed certificate, including the test container (test_test_sqlserver). Restore the documented behavior in _sqlserver_connection_kwargs, selected by DATACONTRACT_SQLSERVER_AUTHENTICATION: - sql (default): USERNAME/PASSWORD - windows: Trusted_Connection (Kerberos/NTLM) - ActiveDirectoryPassword: Entra ID USERNAME/PASSWORD - ActiveDirectoryServicePrincipal: Entra ID CLIENT_ID/CLIENT_SECRET - ActiveDirectoryInteractive: Entra ID browser login - cli: az login session via ActiveDirectoryDefault (ODBC Driver 18.1+) Plus the legacy TRUSTED_CONNECTION switch (== windows, takes precedence), ENCRYPTED_CONNECTION (Encrypt=yes/no, default yes), and TRUST_SERVER_CERTIFICATE (TrustServerCertificate=yes). The auth modes that pass no credentials explicitly set Trusted_Connection to avoid ibis's integrated-auth default leaking into Entra ID / cli connections. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

SQL Server auth gets no entry: it restores parity with the last release (0.12.5), so there is no user-visible change relative to a shipped version. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dmaresma · 2026-06-03T21:01:17Z

Wow a big roadmap.

jochenchrist · 2026-06-03T21:03:12Z

Wow a big roadmap.

;-)

It will resolve many blockers (and certainly cause some issues with the first new versions)

Raise requires-python to >=3.10,<3.14. The core and all non-Spark extras work on 3.13; the Spark extras resolve to PySpark 4.0 there (Spark 3.5 has no 3.13 build), and the connector jars already adapt to the runtime Spark/Scala version. create_spark_session now sets PYSPARK_PYTHON/PYSPARK_DRIVER_PYTHON to sys.executable so Spark's Python workers use the same interpreter as the driver (otherwise PySpark fails with PYTHON_VERSION_MISMATCH when PATH's python3 differs, which is common on 3.13). Full suite passes on Python 3.13.12 / PySpark 4.0.2 and on 3.11 / PySpark 3.5.8 (769 passed, 14 skipped).

…anges

Raise requires-python to >=3.10,<3.15 and add 3.14 to the CI test matrix. The full dependency graph resolves and installs with native 3.14 wheels within the existing version caps (no cap changes needed); notably duckdb 1.5.x, ibis-framework 12, pyspark 4.0, pydantic-core, pyarrow, numpy, and cryptography all ship 3.14 wheels. No code changes required: on 3.14 the Spark-backed extras resolve to PySpark 4.0 (Spark 3.5 has no 3.14 build), same as 3.13, and the PYSPARK_PYTHON/PYSPARK_DRIVER_PYTHON pinning added for 3.13 already keeps Spark's Python workers matched to the driver. Full suite passes on Python 3.14.3 / PySpark 4.0.2 (768 passed, 15 skipped).

… failed-row samples Extend the ibis quality engine with four ODCS-aware capabilities: - diagnostics: each check records structured diagnostics (metric, measured value, threshold, row count, failed fraction, and the enforced validity rule) on Check.diagnostics, surfaced in JSON and JUnit output. Removes the unused, never-populated Check.details field. - percent thresholds: honor ODCS quality.unit: percent for the count-of-bad-rows metrics (nullValues, missingValues, invalidValues), comparing the failed fraction (0-100) of the row count against the threshold. Percent on metrics with no row-fraction meaning (rowCount, duplicateValues) warns and falls back to an absolute count. - severity: honor ODCS quality.severity; a non-blocking severity (info, warning, low, minor, trivial) downgrades a failing quality check to a warning so it no longer fails the run. Any other severity still fails. - failed-row samples: new `datacontract test --include-failed-samples` collects a capped (5-row) sample of offending rows for missing/invalid/ duplicate checks, restricted to identifier (unique/primaryKey) plus the offending column, omitting columns whose ODCS classification marks them sensitive. Stored on Check.failed_samples and surfaced in JSON and the JUnit failure text. Local-only; needs no Soda Cloud. Add in-process duckdb tests for diagnostics, percent/severity, and failed samples.

…a installation instructions

jochenchrist and others added 18 commits June 3, 2026 19:48

fix(spark): use tmp_dir.name for spark.sql.warehouse.dir

9434e91

Use the TemporaryDirectory path instead of its repr for the Spark warehouse dir in create_spark_session() and import_spark().

docs(changelog): note the new Databricks authentication methods

7e909ac

SQL Server auth gets no entry: it restores parity with the last release (0.12.5), so there is no user-visible change relative to a shipped version. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jochenchrist added 9 commits June 3, 2026 23:34

docs(changelog): surface breaking changes in a section above Added

42a7060

ci: add Python 3.13 to test matrix, update changelog with breaking ch…

ba5292b

…anges

ci: add Python 3.13 to test matrix, update changelog with breaking ch…

b1ca75b

…anges

ci: add Java setup for Spark-based tests, update README

3134c32

chore(deps): remove aiobotocore dependency and update README with Jav…

0708a0c

…a installation instructions

chore(release): 1.0.0

64f4755

jochenchrist merged commit 7675c62 into main Jun 4, 2026
14 checks passed

jochenchrist deleted the replace-soda-with-ibis branch June 4, 2026 08:59

jochenchrist mentioned this pull request Jun 4, 2026

feat: support for relative (percent) metrics in data contract checks. #1248

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace soda with ibis#1279

Replace soda with ibis#1279
jochenchrist merged 27 commits into
mainfrom
replace-soda-with-ibis

jochenchrist commented Jun 3, 2026

Uh oh!

dmaresma commented Jun 3, 2026

Uh oh!

jochenchrist commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jochenchrist commented Jun 3, 2026

Uh oh!

dmaresma commented Jun 3, 2026

Uh oh!

jochenchrist commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants