You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(lineage): support all-columns mode and on_node callback (#7575)
* feat(lineage): support all-columns mode and on_node callback
Adds an extension to lineage() so that passing column=None produces a
dict[str, Node] mapping every top-level output column name to its
lineage Node. The single-column form (str | exp.Column) is unchanged
and continues to return a Node. Typing overloads disambiguate the two
return shapes for callers.
A new on_node callback is invoked for every Node created during the
walk, after its downstream is populated. Combined with Node.payload —
a caller-managed dict — this lets callers thread per-node data through
the lineage graph during construction without subclassing Node or
rewalking it after the fact.
Performance:
* Resolving a column to its select expression scanned
selectable.selects on every to_node call. Wide queries with many
output columns made this O(N^2). Memoize a per-scope
{name: select} map and the selectable.is_star bit on first lookup
instead.
* Compile sqlglot/lineage.py via mypyc by listing it in sqlglotc's
_source_files. Together with the memoization above, this shrinks
end-to-end all-columns lineage cost on large CTE-heavy queries by
roughly 2x compared to the unmemoized pure-Python path.
* test(lineage): cover all-columns mode and on_node invariants
Adds tests for the column=None form of lineage() and the on_node
callback contract:
* column=None returns a dict keyed by every top-level output column,
with each entry shaped like single-column lineage().
* shared upstream Nodes are deduplicated across output columns by
the per-call cache (same source column referenced from multiple
selects yields a single shared downstream Node).
* UNION CTEs fan out correctly — each output column points at one
downstream per branch and bottoms out at every branch's base table.
* passing a pre-built Scope returns the same Node tree as the
no-scope path, with no second qualify pass.
* the on_node callback fires children before parents, so callers
can populate Node.payload bottom-up from already-finalized
children.
* on_node fires exactly once per Node, even when a Node is reached
from multiple parents.
0 commit comments