Skip to content

feat(analyzer): Add indirect column lineage tracking#27695

Open
jja725 wants to merge 1 commit intoprestodb:masterfrom
jja725:wt/column-lineage
Open

feat(analyzer): Add indirect column lineage tracking#27695
jja725 wants to merge 1 commit intoprestodb:masterfrom
jja725:wt/column-lineage

Conversation

@jja725
Copy link
Copy Markdown
Contributor

@jja725 jja725 commented May 1, 2026

Description

Adds indirect column lineage tracking to the analyzer in addition to the existing direct lineage. Direct (identity) column relationships between an output column and its source columns were introduced in #25913. This PR extends that to also capture indirect relationships — columns that influence an output without being projected — produced by JOIN, WHERE / HAVING filters, GROUP BY, ORDER BY, and conditional expressions.

Concretely:

  • New presto-common types: TransformationType (DIRECT, INDIRECT), TransformationSubtype (IDENTITY, JOIN, FILTER, GROUP_BY, SORT, CONDITIONAL, ...) and ColumnLineageEntry carrying (downstream column, upstream column, relationship metadata).
  • Analysis records per-field indirect sources via propagateLineage(...) / addPerFieldIndirectSources(...) and exposes a global getIndirectSourceColumns() view.
  • StatementAnalyzer collects indirect sources at the analysis sites listed above and propagates lineage through views and (legacy) materialized views, so that withAlias-created Field instances retain their source columns.
  • OutputColumnMetadata (SPI) gains a relationship-metadata field, so the new lineage is observable from event listeners.
  • QueryMonitor populates the new field when emitting query-completed events.

Backward compatibility note

This change keeps the existing direct-relationship lineage path from #25913 intact — the new indirect-lineage data structure is added alongside it rather than replacing it, so existing event listeners that only consume direct lineage are unaffected. Longer term, it would be cleaner to unify direct and indirect relationships into the new ColumnLineageEntry / relationship-metadata structure (DIRECT becomes one transformation type among others). That migration is intentionally deferred to a follow-up to avoid breaking downstream consumers of the existing direct-lineage API in this PR.

Motivation and Context

End-to-end data freshness, lineage analytics, and impact analysis tooling need to know not just which columns flow into an output, but how — e.g. that a column appears only as a join key or a filter predicate rather than being projected. Direct lineage from #25913 alone cannot express this, which forces downstream consumers to re-parse SQL to recover the structure. Recording indirect relationships at analysis time makes this information available to any event listener.

Impact

  • SPI: OutputColumnMetadata gains a new field for relationship metadata. Existing constructors and consumers continue to work; the new field is optional.
  • Event listeners: query-completed events now expose indirect source columns and per-column transformation metadata in addition to direct sources.
  • No SQL syntax or query-execution behavior changes; this is metadata only.

Test Plan

  • New unit test TestIndirectColumnLineage covers patterns observed in production metrics, including JOIN keys, filter predicates, GROUP BY columns, ORDER BY columns, conditional expressions, view propagation, and (legacy) materialized-view propagation.
  • Extended TestOutput to cover the new OutputColumnMetadata field.
  • Existing analyzer / event-listener tests pass unchanged.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

== RELEASE NOTES ==

General Changes
* Add indirect column lineage tracking (JOIN, FILTER, GROUP BY, ORDER BY, CONDITIONAL) to query analysis, building on the direct column lineage added in #25913. Indirect relationships are exposed to event listeners via a new relationship-metadata field on ``OutputColumnMetadata``; existing direct-lineage consumers are unaffected.

@jja725 jja725 requested review from a team, elharo, feilong-liu and jaystarshot as code owners May 1, 2026 04:22
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 1, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: jja725 / name: jianjian.xie (fd203ab)

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @jja725, your pull request is larger than the review limit of 150000 diff characters

@jja725 jja725 force-pushed the wt/column-lineage branch 2 times, most recently from 86be2ee to e0d8c39 Compare May 1, 2026 04:27
@jja725 jja725 changed the title feat(analyzer): add indirect column lineage tracking feat(analyzer): Add indirect column lineage tracking May 1, 2026
@jja725
Copy link
Copy Markdown
Contributor Author

jja725 commented May 1, 2026

@evanvdia @imjalpreet do you mind taking a look as well when you are free? Thanks

Track indirect column relationships (FILTER, JOIN, GROUP_BY, etc.) in
addition to existing direct lineage. Adds TransformationType /
TransformationSubtype enums and ColumnLineageEntry in presto-common,
extends Analysis to record per-field indirect sources, and updates
StatementAnalyzer to collect them at HAVING / GROUP BY / JOIN /
ORDER BY sites and to propagate lineage through views and (legacy)
materialized views. OutputColumnMetadata in the SPI gains a
relationship-metadata field so this information is exposed via the
event listener.

Cherry-picked from internal commit bae2a640df33; the
presto-event-listener/com/uber/* paths from that commit are dropped
since they live in an Uber-internal module that does not exist in OSS.
@jja725 jja725 force-pushed the wt/column-lineage branch from e0d8c39 to fd203ab Compare May 1, 2026 04:37
@jja725 jja725 requested review from imjalpreet and tdcmeehan May 2, 2026 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant