Replace soda with ibis#1279
Merged
Merged
Conversation
Remove soda-core entirely as the data quality execution engine and replace it with ibis (https://ibis-project.org/), which compiles one expression API to many SQL dialects via sqlglot and reads local/remote files through DuckDB. Motivation: soda-core v3 was an unmaintained, string-templated per-dialect SQL generator that forced a `setuptools`/`distutils.strtobool` shim, a `mysql-connector-python` override, and brittle version pinning across ~13 `soda-core-*` extras. What changed: - New engine-neutral check IR (`datacontract/engines/checks/`): `CheckSpec` + structured `Threshold`, and `create_checks` that enumerates an ODCS contract into specs, preserving every legacy check key/type/name. - New ibis engine (`datacontract/engines/ibis/`): batches row/missing/invalid counts into one aggregation per model; runs dedicated queries for duplicates, schema/type, freshness/retention and user SQL; evaluates thresholds in Python. Reproduces soda's invalid_count semantics (NOT missing AND (NOT valid OR in invalid_values)). Counts use `CASE WHEN ... THEN 1 ELSE 0 END` for dialect portability (e.g. Oracle). - Per-source ibis connection builders reusing the existing DuckDB view builder (files) and Spark/Kafka helpers; Spark sources run via the ibis pyspark backend. - SodaCL kept but isolated: all SodaCL generation moved into `datacontract/export/sodacl_check_builder.py`, used only by `SodaExporter`. `export sodacl` is unchanged and no longer shares code with the test path. - Removed `engines/soda/`, the old `data_contract_checks.py`, the soda config-builder tests, the setuptools shim, and all `soda-core-*` deps; pyproject extras now map to `ibis-framework[<backend>]`. - Raw SodaCL custom checks (quality.engine: soda) now surface a migration warning instead of executing. Verified end-to-end against testcontainers/local data for DuckDB (parquet/csv/ json/s3), Postgres (full quality fixture), Trino, and Oracle; full non-DB suite passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…xture - execute_ibis_checks only disconnects connections the engine created. The pyspark backend wraps a caller-owned SparkSession, and an externally supplied DuckDB connection is owned by the caller; disconnecting either broke subsequent runs (e.g. the session-scoped Spark fixture shared by the two dataframe tests). Skip disposal for the pyspark backend and for caller-provided duckdb/spark resources. - Migrate tests/fixtures/kafka to the native rowCount quality metric, replacing the removed raw SodaCL custom check, so the kafka/Spark path is exercised end-to-end. Verified with Java 21: test_test_dataframe (x2), test_import_spark (x3) and test_test_kafka all pass via the ibis pyspark backend. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Minor bump (0.x SemVer) for the breaking quality-engine replacement. Document the soda-core -> ibis migration, the dropped raw-SodaCL execution, and the soda-core dependency removal in the CHANGELOG Unreleased section. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- ibis's native MySQL backend requires `mysqlclient`, a C extension with no macOS/Linux wheels that fails to build without pkg-config + MySQL client libraries (broke `pip install -e .[dev]`). Connect to MySQL via DuckDB's `mysql` extension instead: ATTACH the database and materialize each contract model into a local DuckDB table, then run checks locally. Keeps the `mysql` extra pure-pip. Materializing avoids DuckDB MySQL-scanner pushdown errors (e.g. the grouped duplicate-count query hit a DuckDB binder assertion). - Pin `duckdb` to `<1.1.0` to match the bundled `duckdb-extension-*` wheels (httpfs/aws/azure, pinned `<1.1.0`). Without a lockfile, fresh installs resolved duckdb 1.5.3, which mismatched those wheels (S3 "Secret Validation Failure") and changed CSV/JSON/secret behavior and mysql-extension port handling — breaking s3, csv-import, nested-json, and mysql tests. Full suite (Java 21, duckdb 1.0.0): 744 passed, 14 skipped. Remaining 5 failures are environmental: 4 protobuf (no `protoc`), 1 kafka (pre-existing Spark-session conflict with the dataframe test in non-xdist runs; passes in isolation, skipped under xdist in CI). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The runtime MySQL path goes through DuckDB's mysql extension, so no Python MySQL driver is needed at install time. mysql-connector-python is only used by the MySQL test fixture to seed data, so it belongs in `dev`. The `mysql` extra is now just `datacontract-cli[duckdb]`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…wheels
Loosen the duckdb pin off the 1.0.x line. duckdb and the bundled
duckdb-extension-* wheels (httpfs/aws/azure) are bumped together to 1.5.x; the
1.5.x extension wheels ship arm64 Linux builds, so the platform skip markers are
dropped and air-gapped installs on arm64 Linux now work.
Fixes for DuckDB >=1.5 behavior changes:
- S3 secret: explicit KEY_ID/SECRET now use the default `config` provider;
`PROVIDER CREDENTIAL_CHAIN` with explicit credentials is rejected in 1.5.x
("Secret Validation Failure").
- csv import: the uniqueness probe uses `count(DISTINCT ...)` via SQL instead of
the relational `.count('DISTINCT ...')` form, which 1.5.x's binder rejects.
- test_duckdb_json: assert on the stable DuckDBPyType `.id` (number->bigint,
dict->struct) instead of the old DBAPI type-code strings.
Full suite (Java 21, duckdb 1.5.3): 744 passed, 14 skipped; remaining 5 failures
are environmental (4 protobuf: no protoc; 1 kafka: Spark-session conflict in
non-xdist runs).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Use the TemporaryDirectory path instead of its repr for the Spark warehouse dir in create_spark_session() and import_spark().
Without a lockfile, `pyspark>=3.5.0,<5.0.0` resolved to 4.0.x on fresh installs, but the Kafka/Avro paths load `spark-sql-kafka-0-10_2.12:3.5.5` / `spark-avro_2.12:3.5.5` (Scala 2.12, Spark 3.5) jars, which fail to load on a Spark 4.x (Scala 2.13) runtime — breaking `datacontract test` against Kafka. Cap pyspark to the 3.5.x line in the kafka and databricks extras. Full suite (Java 21, duckdb 1.5.3, pyspark 3.5.8): 745 passed, 14 skipped, 4 failed; the 4 failures are the protobuf importer tests, which require the `protoc` system binary (documented manual install). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the protoc-based importer with proto-schema-parser (pure Python, only depends on antlr4-python3-runtime). `import protobuf` no longer needs the `protoc` system binary or the protobuf runtime: imports are resolved transitively by reading `import` statements and parsing each file, and message/enum type references are linked across files by simple name (handling package-qualified and subdirectory imports). Output is preserved byte-for-byte vs the protoc-based importer (scalar physicalType is still the protobuf type number; repeated scalars stay scalar; only repeated messages become arrays). The `protobuf` extra now declares `proto-schema-parser` instead of `protobuf`. Dockerfile: drop the protobuf-compiler install and the protoc binary/lib copy into the runtime image — no longer needed. tests/test_import_protobuf.py (incl. nested-imports and subdirs): 4 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The ibis Snowflake backend forwards connection kwargs to snowflake-connector-python, which expects `user` (not soda's `username`). Map the documented DATACONTRACT_SNOWFLAKE_USERNAME env var to `user` so it keeps working after the soda -> ibis migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- protobuf import: drop the protoc system-dependency instructions (now a pure-Python parser). - engine description: ibis + fastjsonschema instead of soda-core. - bigquery: ADC/WIF fallback no longer described via soda's use_context_auth. - snowflake: document env vars as snowflake-connector-python params (USERNAME kept as an alias for `user`). - redshift: username/password via the Postgres backend; note IAM auth is not currently supported. - impala: drop 'Soda' wording. The `export sodacl` format is unchanged and remains documented. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the hardcoded spark-sql-kafka / spark-avro coordinates (_2.12:3.5.5) with spark_connector_packages(), which reads the Spark version and Scala binary version (2.12 vs 2.13) from the installed PySpark jars. This lets the kafka extra allow PySpark 4.x (Scala 2.13) without the connector JARs mismatching the runtime. Tests use the same helper.
The databricks path connects via a caller-provided Spark session (or the databricks SQL connector) and never calls create_spark_session(), so it loads no Kafka/Avro connector jars and isn't tied to a Scala/Spark line. Align its pyspark range with the kafka extra (<5.0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cap) The kafka/databricks extras allow pyspark<5.0; the Kafka/Avro connector jars are derived from the runtime PySpark (Scala 2.12/2.13), so the earlier "capped to <4.0" note no longer describes the final state. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Previously every ibis check carried a SodaCL-looking pseudo-string in `implementation` with `language: null`, a leftover from the soda model. Now each check records what it actually runs: - count-style metrics (row_count, missing_count, invalid_count), duplicates, freshness/retention and custom-SQL checks store the backend-dialect SQL (compiled via ibis.to_sql) with `language: "sql"`. - schema checks (field_is_present, field_type) use schema introspection, so they record a short note with `language: "introspection"`. The batched count metrics still execute as a single aggregation; each check's recorded SQL is the representative single-metric query. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The direct (non-Spark) Databricks test connection only accepted a personal access token. Now that the engine connects via ibis.databricks.connect, the full connector auth surface is available, so resolve the auth method from env vars in priority order: 1. DATACONTRACT_DATABRICKS_TOKEN — personal access token (unchanged default) 2. DATACONTRACT_DATABRICKS_CLIENT_ID + _CLIENT_SECRET — OAuth service principal (M2M), the usual choice for CI/CD 3. DATACONTRACT_DATABRICKS_PROFILE — a local config profile via the Databricks SDK unified auth (parity with the Unity Catalog importer; also Azure CLI/MSI) 4. DATACONTRACT_DATABRICKS_AUTH_TYPE — explicit connector auth_type, e.g. databricks-oauth for the interactive U2M browser flow The OAuth credential providers build their SDK Config lazily so token exchange happens at connect time, not while reading env. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…igration The soda->ibis migration reduced the SQL Server connection to username/password + driver, silently dropping the documented auth and SSL env vars. With ODBC Driver 18 (encrypt + verify by default) this broke connections to servers with a self-signed certificate, including the test container (test_test_sqlserver). Restore the documented behavior in _sqlserver_connection_kwargs, selected by DATACONTRACT_SQLSERVER_AUTHENTICATION: - sql (default): USERNAME/PASSWORD - windows: Trusted_Connection (Kerberos/NTLM) - ActiveDirectoryPassword: Entra ID USERNAME/PASSWORD - ActiveDirectoryServicePrincipal: Entra ID CLIENT_ID/CLIENT_SECRET - ActiveDirectoryInteractive: Entra ID browser login - cli: az login session via ActiveDirectoryDefault (ODBC Driver 18.1+) Plus the legacy TRUSTED_CONNECTION switch (== windows, takes precedence), ENCRYPTED_CONNECTION (Encrypt=yes/no, default yes), and TRUST_SERVER_CERTIFICATE (TrustServerCertificate=yes). The auth modes that pass no credentials explicitly set Trusted_Connection to avoid ibis's integrated-auth default leaking into Entra ID / cli connections. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SQL Server auth gets no entry: it restores parity with the last release (0.12.5), so there is no user-visible change relative to a shipped version. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
|
Wow a big roadmap. |
Contributor
Author
;-) It will resolve many blockers (and certainly cause some issues with the first new versions) |
Raise requires-python to >=3.10,<3.14. The core and all non-Spark extras work on 3.13; the Spark extras resolve to PySpark 4.0 there (Spark 3.5 has no 3.13 build), and the connector jars already adapt to the runtime Spark/Scala version. create_spark_session now sets PYSPARK_PYTHON/PYSPARK_DRIVER_PYTHON to sys.executable so Spark's Python workers use the same interpreter as the driver (otherwise PySpark fails with PYTHON_VERSION_MISMATCH when PATH's python3 differs, which is common on 3.13). Full suite passes on Python 3.13.12 / PySpark 4.0.2 and on 3.11 / PySpark 3.5.8 (769 passed, 14 skipped).
Raise requires-python to >=3.10,<3.15 and add 3.14 to the CI test matrix. The full dependency graph resolves and installs with native 3.14 wheels within the existing version caps (no cap changes needed); notably duckdb 1.5.x, ibis-framework 12, pyspark 4.0, pydantic-core, pyarrow, numpy, and cryptography all ship 3.14 wheels. No code changes required: on 3.14 the Spark-backed extras resolve to PySpark 4.0 (Spark 3.5 has no 3.14 build), same as 3.13, and the PYSPARK_PYTHON/PYSPARK_DRIVER_PYTHON pinning added for 3.13 already keeps Spark's Python workers matched to the driver. Full suite passes on Python 3.14.3 / PySpark 4.0.2 (768 passed, 15 skipped).
… failed-row samples Extend the ibis quality engine with four ODCS-aware capabilities: - diagnostics: each check records structured diagnostics (metric, measured value, threshold, row count, failed fraction, and the enforced validity rule) on Check.diagnostics, surfaced in JSON and JUnit output. Removes the unused, never-populated Check.details field. - percent thresholds: honor ODCS quality.unit: percent for the count-of-bad-rows metrics (nullValues, missingValues, invalidValues), comparing the failed fraction (0-100) of the row count against the threshold. Percent on metrics with no row-fraction meaning (rowCount, duplicateValues) warns and falls back to an absolute count. - severity: honor ODCS quality.severity; a non-blocking severity (info, warning, low, minor, trivial) downgrades a failing quality check to a warning so it no longer fails the run. Any other severity still fails. - failed-row samples: new `datacontract test --include-failed-samples` collects a capped (5-row) sample of offending rows for missing/invalid/ duplicate checks, restricted to identifier (unique/primaryKey) plus the offending column, omitting columns whose ODCS classification marks them sensitive. Stored on Check.failed_samples and surfaced in JSON and the JUnit failure text. Local-only; needs no Soda Cloud. Add in-process duckdb tests for diagnostics, percent/severity, and failed samples.
…a installation instructions
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
uv run pytest)uv run ruff check --fix && uv run ruff format)