Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
## Metaflow API Docs

- [BatchInferencePipeline](docs/metaflow/batch_inference_pipeline.md)
- [create_ownership_registry_view](docs/metaflow/create_ownership_registry_view.md)
- [make_pydantic_parser_fn](docs/metaflow/make_pydantic_parser_fn.md)
- [publish](docs/metaflow/publish.md)
- [publish_pandas](docs/metaflow/publish_pandas.md)
Expand Down
46 changes: 46 additions & 0 deletions docs/metaflow/create_ownership_registry_view.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# `create_ownership_registry_view`

Source: `ds_platform_utils.metaflow.registry.create_ownership_registry_view`

Creates (or replaces) the central **table-ownership registry view**,
`PATTERN_DB.DATA_SCIENCE.TABLE_OWNERSHIP_REGISTRY`. The view pivots the object tags
applied by [`publish`](publish.md) / [`publish_pandas`](publish_pandas.md) into one row
per table, exposing `owner`, `team`, `domain`, `project`, `status`, `sla` and `contact`.

This is a one-time admin helper.

## Signature

```python
create_ownership_registry_view(conn: SnowflakeConnection | None = None) -> str
```

| Parameter | Type | Required | Description |
| --------- | ----------------------------- | -------: | ------------------------------------------------------------------------ |
| `conn` | `SnowflakeConnection \| None` | No | Open Snowflake connection. If omitted, one is created via `get_snowflake_connection()`. |

**Returns:** the executed `CREATE OR REPLACE VIEW` SQL string.

## Usage

```python
from ds_platform_utils.metaflow import create_ownership_registry_view

create_ownership_registry_view()
```

Then query it:

```sql
SELECT * FROM PATTERN_DB.DATA_SCIENCE.TABLE_OWNERSHIP_REGISTRY
ORDER BY team, table_name;
```

## Notes

- **No refresh needed.** A view is not materialized — it re-runs its query on every read,
so it is always live.
- **~2h lag.** The view reads `SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES`, which itself lags
up to ~2 hours. For the current value of a single table's tag, use
`SYSTEM$GET_TAG('PATTERN_DB.DATA_SCIENCE.TABLE_OWNER', '<table>', 'table')` instead.
- **Adoption-based.** Only tables that have at least one ownership tag appear in the view.
59 changes: 59 additions & 0 deletions docs/metaflow/publish.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ publish(
ctx: dict[str, Any] | None = None,
warehouse: Literal["XS", "MED", "XL"] = None,
use_utc: bool = True,
tags: dict[str, str] | None = None,
) -> None
```

Expand All @@ -22,6 +23,8 @@ publish(
- Reads SQL from a string or `.sql` path.
- Runs write/audit/publish operations through Snowflake.
- Adds operation details and table links to the Metaflow card when available.
- **Automatically applies ownership object tags to production tables** (see
[Ownership tags](#ownership-tags) below).

## Parameters

Expand All @@ -33,6 +36,7 @@ publish(
| `ctx` | `dict[str, Any] \| None` | No | Optional template substitution context for SQL operations. |
| `warehouse` | `Literal["XS", "MED", "XL"] \| None` | No | Snowflake warehouse override for this operation. Supports `XS`/`MED`/`XL` shortcuts or a full warehouse name. |
| `use_utc` | `bool` | No | If `True`, uses UTC timezone for Snowflake session. |
| `tags` | `dict[str, str] \| None` | No | Overrides for the ownership object tags applied to the published table. See [Ownership tags](#ownership-tags).|

**Returns:** `None`

Expand All @@ -47,3 +51,58 @@ publish(
audits=["SELECT COUNT(*) > 0 FROM PATTERN_DB.{{schema}}.{{table_name}}"],
)
```

## Ownership tags

When publishing to **production**, `publish()` automatically applies the table-ownership
object tags from the table-ownership RFC. The seven tags are:

| Tag | Source | Always set? |
| --------------- | ------------------------------------------------------- | --------------- |
| `TABLE_OWNER` | `ds.owner` flow tag, else owning-team alias derived from the domain (`ds-<domain>-team`), else `unknown` | yes |
| `TABLE_TEAM` | `data-science` | yes |
| `TABLE_DOMAIN` | `ds.domain` Metaflow tag, else `unknown` | yes |
| `TABLE_PROJECT` | `ds.project` Metaflow tag, else `unknown` | yes |
| `TABLE_STATUS` | `active` (override allows `active`/`development`/`testing`/`deprecated`/`archived`/`retired`) | yes |
| `TABLE_SLA` | override only (`streaming`/`realtime`/`hourly`/`daily`/`weekly`/`monthly`/`quarterly`/`ad_hoc`/`on_demand`) | only if given |
| `TABLE_CONTACT` | override only (Slack channel or email) | only if given |

> **`TABLE_OWNER` is not the run user.** Owner is resolved by priority:
> (1) an explicit `tags={"owner": ...}` override, else
> (2) the **`ds.owner`** Metaflow flow tag (set in CI alongside `ds.domain`/`ds.project`), else
> (3) the owning-team alias `ds-<domain>-team` when the domain is known (e.g. domain
> `advertising` → `ds-advertising-team`), else (4) `unknown`. We don't use `current.username`,
> because on deployed/argo runs it resolves to a service identity (`argo-workflows`) rather
> than a person. Set `ds.owner` on the flow for a per-flow owner, or pass `tags={"owner": ...}`
> per call.

> **`TABLE_DOMAIN` / `TABLE_PROJECT` depend on flow tags.** These are read from the
> `ds.domain` / `ds.project` Metaflow tags. If a flow runs without them, the value falls
> back to the literal string `unknown` and a warning is printed (the same warning used
> for select.dev cost tracking). Make sure your flow carries `--tag "ds.domain:..."` and
> `--tag "ds.project:..."` — these are applied automatically in CI and the standard `poe`
> run commands in the monorepo — or pass `tags={"domain": ..., "project": ...}` explicitly.
> Note: because owner is derived from the domain, a missing domain also means
> `TABLE_OWNER` falls back to `unknown`.

Pass `tags=` to override any value. Keys may be `owner`/`team`/`domain`/`project`/
`status`/`sla`/`contact` (optionally `TABLE_`-prefixed):

```python
publish(
table_name="OUT_OF_STOCK_ADS",
query="sql/create_training_data.sql",
tags={"sla": "daily", "contact": "#ds-recsys", "status": "active"},
)
```

Notes:

- Tags are applied **only to production tables** (`DATA_SCIENCE`). Non-prod
(`DATA_SCIENCE_STAGE`) runs apply no tags. The publishing role needs `APPLY` on the tags.
- The tag *definitions* must first be created once by a Snowflake admin in `DATA_SCIENCE`
(the RFC `CREATE TAG` setup). Until then, tagging is **skipped with a warning** — the publish
still succeeds.
- Invalid `status`/`sla` values raise `ValueError` before any data is written.
- Tagged tables surface in the `TABLE_OWNERSHIP_REGISTRY` view (see
`create_ownership_registry_view`).
28 changes: 28 additions & 0 deletions docs/metaflow/publish_pandas.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ publish_pandas(
use_utc: bool = True,
use_s3_stage: bool = False,
table_definition: list[tuple[str, str]] | None = None,
tags: dict[str, str] | None = None,
) -> None
```

Expand All @@ -30,6 +31,8 @@ publish_pandas(
- Validates DataFrame input.
- Writes directly via `write_pandas` or via S3 stage flow for large data.
- Adds a Snowflake table URL to Metaflow card output.
- **Automatically applies ownership object tags to production tables** (see
[Ownership tags](#ownership-tags) below).

## Parameters

Expand All @@ -49,9 +52,34 @@ publish_pandas(
| `use_utc` | `bool` | No | If `True`, uses UTC timezone for Snowflake session. |
| `use_s3_stage` | `bool` | No | If `True`, publishes via S3 stage flow; otherwise uses direct `write_pandas`. |
| `table_definition` | `list[tuple[str, str]] \| None` | No | Optional Snowflake table schema; used by S3 stage flow when table creation is needed. |
| `tags` | `dict[str, str] \| None` | No | Overrides for the ownership object tags applied to the published table. See [Ownership tags](#ownership-tags).|

**Returns:** `None`

## Ownership tags

When publishing to **production**, `publish_pandas()` automatically applies the same
seven table-ownership object tags as [`publish`](publish.md#ownership-tags):
`TABLE_OWNER`, `TABLE_TEAM`, `TABLE_DOMAIN`, `TABLE_PROJECT`, `TABLE_STATUS` and
(when provided via `tags=`) `TABLE_SLA` / `TABLE_CONTACT`.

```python
publish_pandas(
table_name="MY_TABLE",
df=df,
tags={"sla": "daily", "contact": "#ds-recsys"},
)
```

- Tags are applied **only to production tables** (`DATA_SCIENCE`); non-prod runs apply none.
- `TABLE_DOMAIN` / `TABLE_PROJECT` come from the `ds.domain` / `ds.project` Metaflow tags;
if a flow runs without them they fall back to the literal `unknown` and a warning is
printed. Ensure the flow carries those tags (automatic in CI / standard `poe` commands)
or pass `tags={"domain": ..., "project": ...}`. See [`publish`](publish.md#ownership-tags).
- Tag *definitions* must first be created by a Snowflake admin (RFC `CREATE TAG` setup);
until then tagging is **skipped with a warning** and the publish still succeeds.
- Invalid `status`/`sla` values raise `ValueError` before any data is written.

## Limitations

- When `use_s3_stage=True`, some column data types may not map exactly as expected between pandas/parquet and Snowflake.
Expand Down
5 changes: 3 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
[project]
name = "ds-platform-utils"
version = "0.4.2"
version = "0.5.0"
description = "Utility library for Pattern Data Science."
readme = "README.md"
authors = [
{ name = "Amit Vikram Raj", email = "amit.raj@pattern.com" },
{ name = "Eric Riddoch", email = "eric.riddoch@pattern.com" }
{ name = "Eric Riddoch", email = "eric.riddoch@pattern.com" },
{ name = "Vinay Shende", email = "vinay.shende@pattern.com" }
]
# requires-python = ">=3.7"
dependencies = [
Expand Down
Loading
Loading