Skip to content

Replace soda with ibis#1279

Merged
jochenchrist merged 27 commits into
mainfrom
replace-soda-with-ibis
Jun 4, 2026
Merged

Replace soda with ibis#1279
jochenchrist merged 27 commits into
mainfrom
replace-soda-with-ibis

Conversation

@jochenchrist
Copy link
Copy Markdown
Contributor

  • Tests pass (uv run pytest)
  • Code formatted (uv run ruff check --fix && uv run ruff format)
  • README.md updated (if relevant)
  • CHANGELOG.md entry added

jochenchrist and others added 18 commits June 3, 2026 19:48
Remove soda-core entirely as the data quality execution engine and replace
it with ibis (https://ibis-project.org/), which compiles one expression API
to many SQL dialects via sqlglot and reads local/remote files through DuckDB.

Motivation: soda-core v3 was an unmaintained, string-templated per-dialect SQL
generator that forced a `setuptools`/`distutils.strtobool` shim, a
`mysql-connector-python` override, and brittle version pinning across ~13
`soda-core-*` extras.

What changed:
- New engine-neutral check IR (`datacontract/engines/checks/`): `CheckSpec` +
  structured `Threshold`, and `create_checks` that enumerates an ODCS contract
  into specs, preserving every legacy check key/type/name.
- New ibis engine (`datacontract/engines/ibis/`): batches row/missing/invalid
  counts into one aggregation per model; runs dedicated queries for duplicates,
  schema/type, freshness/retention and user SQL; evaluates thresholds in Python.
  Reproduces soda's invalid_count semantics
  (NOT missing AND (NOT valid OR in invalid_values)). Counts use
  `CASE WHEN ... THEN 1 ELSE 0 END` for dialect portability (e.g. Oracle).
- Per-source ibis connection builders reusing the existing DuckDB view builder
  (files) and Spark/Kafka helpers; Spark sources run via the ibis pyspark
  backend.
- SodaCL kept but isolated: all SodaCL generation moved into
  `datacontract/export/sodacl_check_builder.py`, used only by `SodaExporter`.
  `export sodacl` is unchanged and no longer shares code with the test path.
- Removed `engines/soda/`, the old `data_contract_checks.py`, the soda
  config-builder tests, the setuptools shim, and all `soda-core-*` deps;
  pyproject extras now map to `ibis-framework[<backend>]`.
- Raw SodaCL custom checks (quality.engine: soda) now surface a migration
  warning instead of executing.

Verified end-to-end against testcontainers/local data for DuckDB (parquet/csv/
json/s3), Postgres (full quality fixture), Trino, and Oracle; full non-DB suite
passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…xture

- execute_ibis_checks only disconnects connections the engine created. The
  pyspark backend wraps a caller-owned SparkSession, and an externally supplied
  DuckDB connection is owned by the caller; disconnecting either broke
  subsequent runs (e.g. the session-scoped Spark fixture shared by the two
  dataframe tests). Skip disposal for the pyspark backend and for
  caller-provided duckdb/spark resources.
- Migrate tests/fixtures/kafka to the native rowCount quality metric, replacing
  the removed raw SodaCL custom check, so the kafka/Spark path is exercised
  end-to-end.

Verified with Java 21: test_test_dataframe (x2), test_import_spark (x3) and
test_test_kafka all pass via the ibis pyspark backend.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Minor bump (0.x SemVer) for the breaking quality-engine replacement.
Document the soda-core -> ibis migration, the dropped raw-SodaCL execution,
and the soda-core dependency removal in the CHANGELOG Unreleased section.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- ibis's native MySQL backend requires `mysqlclient`, a C extension with no
  macOS/Linux wheels that fails to build without pkg-config + MySQL client
  libraries (broke `pip install -e .[dev]`). Connect to MySQL via DuckDB's
  `mysql` extension instead: ATTACH the database and materialize each contract
  model into a local DuckDB table, then run checks locally. Keeps the `mysql`
  extra pure-pip. Materializing avoids DuckDB MySQL-scanner pushdown errors
  (e.g. the grouped duplicate-count query hit a DuckDB binder assertion).
- Pin `duckdb` to `<1.1.0` to match the bundled `duckdb-extension-*` wheels
  (httpfs/aws/azure, pinned `<1.1.0`). Without a lockfile, fresh installs
  resolved duckdb 1.5.3, which mismatched those wheels (S3 "Secret Validation
  Failure") and changed CSV/JSON/secret behavior and mysql-extension port
  handling — breaking s3, csv-import, nested-json, and mysql tests.

Full suite (Java 21, duckdb 1.0.0): 744 passed, 14 skipped. Remaining 5
failures are environmental: 4 protobuf (no `protoc`), 1 kafka (pre-existing
Spark-session conflict with the dataframe test in non-xdist runs; passes in
isolation, skipped under xdist in CI).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The runtime MySQL path goes through DuckDB's mysql extension, so no Python
MySQL driver is needed at install time. mysql-connector-python is only used by
the MySQL test fixture to seed data, so it belongs in `dev`. The `mysql` extra
is now just `datacontract-cli[duckdb]`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…wheels

Loosen the duckdb pin off the 1.0.x line. duckdb and the bundled
duckdb-extension-* wheels (httpfs/aws/azure) are bumped together to 1.5.x; the
1.5.x extension wheels ship arm64 Linux builds, so the platform skip markers are
dropped and air-gapped installs on arm64 Linux now work.

Fixes for DuckDB >=1.5 behavior changes:
- S3 secret: explicit KEY_ID/SECRET now use the default `config` provider;
  `PROVIDER CREDENTIAL_CHAIN` with explicit credentials is rejected in 1.5.x
  ("Secret Validation Failure").
- csv import: the uniqueness probe uses `count(DISTINCT ...)` via SQL instead of
  the relational `.count('DISTINCT ...')` form, which 1.5.x's binder rejects.
- test_duckdb_json: assert on the stable DuckDBPyType `.id` (number->bigint,
  dict->struct) instead of the old DBAPI type-code strings.

Full suite (Java 21, duckdb 1.5.3): 744 passed, 14 skipped; remaining 5 failures
are environmental (4 protobuf: no protoc; 1 kafka: Spark-session conflict in
non-xdist runs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Use the TemporaryDirectory path instead of its repr for the Spark warehouse dir
in create_spark_session() and import_spark().
Without a lockfile, `pyspark>=3.5.0,<5.0.0` resolved to 4.0.x on fresh installs,
but the Kafka/Avro paths load `spark-sql-kafka-0-10_2.12:3.5.5` /
`spark-avro_2.12:3.5.5` (Scala 2.12, Spark 3.5) jars, which fail to load on a
Spark 4.x (Scala 2.13) runtime — breaking `datacontract test` against Kafka.
Cap pyspark to the 3.5.x line in the kafka and databricks extras.

Full suite (Java 21, duckdb 1.5.3, pyspark 3.5.8): 745 passed, 14 skipped, 4
failed; the 4 failures are the protobuf importer tests, which require the
`protoc` system binary (documented manual install).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the protoc-based importer with proto-schema-parser (pure Python, only
depends on antlr4-python3-runtime). `import protobuf` no longer needs the
`protoc` system binary or the protobuf runtime: imports are resolved transitively
by reading `import` statements and parsing each file, and message/enum type
references are linked across files by simple name (handling package-qualified
and subdirectory imports).

Output is preserved byte-for-byte vs the protoc-based importer (scalar
physicalType is still the protobuf type number; repeated scalars stay scalar;
only repeated messages become arrays). The `protobuf` extra now declares
`proto-schema-parser` instead of `protobuf`.

Dockerfile: drop the protobuf-compiler install and the protoc binary/lib copy
into the runtime image — no longer needed.

tests/test_import_protobuf.py (incl. nested-imports and subdirs): 4 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The ibis Snowflake backend forwards connection kwargs to
snowflake-connector-python, which expects `user` (not soda's `username`).
Map the documented DATACONTRACT_SNOWFLAKE_USERNAME env var to `user` so it
keeps working after the soda -> ibis migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- protobuf import: drop the protoc system-dependency instructions (now a
  pure-Python parser).
- engine description: ibis + fastjsonschema instead of soda-core.
- bigquery: ADC/WIF fallback no longer described via soda's use_context_auth.
- snowflake: document env vars as snowflake-connector-python params (USERNAME
  kept as an alias for `user`).
- redshift: username/password via the Postgres backend; note IAM auth is not
  currently supported.
- impala: drop 'Soda' wording.

The `export sodacl` format is unchanged and remains documented.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the hardcoded spark-sql-kafka / spark-avro coordinates (_2.12:3.5.5)
with spark_connector_packages(), which reads the Spark version and Scala binary
version (2.12 vs 2.13) from the installed PySpark jars. This lets the kafka
extra allow PySpark 4.x (Scala 2.13) without the connector JARs mismatching the
runtime. Tests use the same helper.
The databricks path connects via a caller-provided Spark session (or the
databricks SQL connector) and never calls create_spark_session(), so it loads
no Kafka/Avro connector jars and isn't tied to a Scala/Spark line. Align its
pyspark range with the kafka extra (<5.0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cap)

The kafka/databricks extras allow pyspark<5.0; the Kafka/Avro connector jars are
derived from the runtime PySpark (Scala 2.12/2.13), so the earlier "capped to
<4.0" note no longer describes the final state.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Previously every ibis check carried a SodaCL-looking pseudo-string in
`implementation` with `language: null`, a leftover from the soda model. Now each
check records what it actually runs:

- count-style metrics (row_count, missing_count, invalid_count), duplicates,
  freshness/retention and custom-SQL checks store the backend-dialect SQL
  (compiled via ibis.to_sql) with `language: "sql"`.
- schema checks (field_is_present, field_type) use schema introspection, so they
  record a short note with `language: "introspection"`.

The batched count metrics still execute as a single aggregation; each check's
recorded SQL is the representative single-metric query.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The direct (non-Spark) Databricks test connection only accepted a personal
access token. Now that the engine connects via ibis.databricks.connect, the
full connector auth surface is available, so resolve the auth method from env
vars in priority order:

1. DATACONTRACT_DATABRICKS_TOKEN — personal access token (unchanged default)
2. DATACONTRACT_DATABRICKS_CLIENT_ID + _CLIENT_SECRET — OAuth service principal
   (M2M), the usual choice for CI/CD
3. DATACONTRACT_DATABRICKS_PROFILE — a local config profile via the Databricks
   SDK unified auth (parity with the Unity Catalog importer; also Azure CLI/MSI)
4. DATACONTRACT_DATABRICKS_AUTH_TYPE — explicit connector auth_type, e.g.
   databricks-oauth for the interactive U2M browser flow

The OAuth credential providers build their SDK Config lazily so token exchange
happens at connect time, not while reading env.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…igration

The soda->ibis migration reduced the SQL Server connection to username/password
+ driver, silently dropping the documented auth and SSL env vars. With ODBC
Driver 18 (encrypt + verify by default) this broke connections to servers with
a self-signed certificate, including the test container (test_test_sqlserver).

Restore the documented behavior in _sqlserver_connection_kwargs, selected by
DATACONTRACT_SQLSERVER_AUTHENTICATION:

- sql (default): USERNAME/PASSWORD
- windows: Trusted_Connection (Kerberos/NTLM)
- ActiveDirectoryPassword: Entra ID USERNAME/PASSWORD
- ActiveDirectoryServicePrincipal: Entra ID CLIENT_ID/CLIENT_SECRET
- ActiveDirectoryInteractive: Entra ID browser login
- cli: az login session via ActiveDirectoryDefault (ODBC Driver 18.1+)

Plus the legacy TRUSTED_CONNECTION switch (== windows, takes precedence),
ENCRYPTED_CONNECTION (Encrypt=yes/no, default yes), and
TRUST_SERVER_CERTIFICATE (TrustServerCertificate=yes). The auth modes that pass
no credentials explicitly set Trusted_Connection to avoid ibis's integrated-auth
default leaking into Entra ID / cli connections.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SQL Server auth gets no entry: it restores parity with the last release
(0.12.5), so there is no user-visible change relative to a shipped version.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dmaresma
Copy link
Copy Markdown
Contributor

dmaresma commented Jun 3, 2026

Wow a big roadmap.

@jochenchrist
Copy link
Copy Markdown
Contributor Author

Wow a big roadmap.

;-)

It will resolve many blockers (and certainly cause some issues with the first new versions)

Raise requires-python to >=3.10,<3.14. The core and all non-Spark extras work
on 3.13; the Spark extras resolve to PySpark 4.0 there (Spark 3.5 has no 3.13
build), and the connector jars already adapt to the runtime Spark/Scala version.

create_spark_session now sets PYSPARK_PYTHON/PYSPARK_DRIVER_PYTHON to
sys.executable so Spark's Python workers use the same interpreter as the driver
(otherwise PySpark fails with PYTHON_VERSION_MISMATCH when PATH's python3
differs, which is common on 3.13).

Full suite passes on Python 3.13.12 / PySpark 4.0.2 and on 3.11 / PySpark 3.5.8
(769 passed, 14 skipped).
Raise requires-python to >=3.10,<3.15 and add 3.14 to the CI test matrix.
The full dependency graph resolves and installs with native 3.14 wheels
within the existing version caps (no cap changes needed); notably duckdb
1.5.x, ibis-framework 12, pyspark 4.0, pydantic-core, pyarrow, numpy, and
cryptography all ship 3.14 wheels.

No code changes required: on 3.14 the Spark-backed extras resolve to
PySpark 4.0 (Spark 3.5 has no 3.14 build), same as 3.13, and the
PYSPARK_PYTHON/PYSPARK_DRIVER_PYTHON pinning added for 3.13 already keeps
Spark's Python workers matched to the driver.

Full suite passes on Python 3.14.3 / PySpark 4.0.2 (768 passed, 15 skipped).
… failed-row samples

Extend the ibis quality engine with four ODCS-aware capabilities:

- diagnostics: each check records structured diagnostics (metric, measured
  value, threshold, row count, failed fraction, and the enforced validity
  rule) on Check.diagnostics, surfaced in JSON and JUnit output. Removes the
  unused, never-populated Check.details field.

- percent thresholds: honor ODCS quality.unit: percent for the
  count-of-bad-rows metrics (nullValues, missingValues, invalidValues),
  comparing the failed fraction (0-100) of the row count against the
  threshold. Percent on metrics with no row-fraction meaning (rowCount,
  duplicateValues) warns and falls back to an absolute count.

- severity: honor ODCS quality.severity; a non-blocking severity (info,
  warning, low, minor, trivial) downgrades a failing quality check to a
  warning so it no longer fails the run. Any other severity still fails.

- failed-row samples: new `datacontract test --include-failed-samples`
  collects a capped (5-row) sample of offending rows for missing/invalid/
  duplicate checks, restricted to identifier (unique/primaryKey) plus the
  offending column, omitting columns whose ODCS classification marks them
  sensitive. Stored on Check.failed_samples and surfaced in JSON and the
  JUnit failure text. Local-only; needs no Soda Cloud.

Add in-process duckdb tests for diagnostics, percent/severity, and failed
samples.
@jochenchrist jochenchrist merged commit 7675c62 into main Jun 4, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants