feat(analyzer): Add indirect column lineage tracking#27695
Open
jja725 wants to merge 1 commit intoprestodb:masterfrom
Open
feat(analyzer): Add indirect column lineage tracking#27695jja725 wants to merge 1 commit intoprestodb:masterfrom
jja725 wants to merge 1 commit intoprestodb:masterfrom
Conversation
|
|
Contributor
There was a problem hiding this comment.
Sorry @jja725, your pull request is larger than the review limit of 150000 diff characters
86be2ee to
e0d8c39
Compare
Contributor
Author
|
@evanvdia @imjalpreet do you mind taking a look as well when you are free? Thanks |
Track indirect column relationships (FILTER, JOIN, GROUP_BY, etc.) in addition to existing direct lineage. Adds TransformationType / TransformationSubtype enums and ColumnLineageEntry in presto-common, extends Analysis to record per-field indirect sources, and updates StatementAnalyzer to collect them at HAVING / GROUP BY / JOIN / ORDER BY sites and to propagate lineage through views and (legacy) materialized views. OutputColumnMetadata in the SPI gains a relationship-metadata field so this information is exposed via the event listener. Cherry-picked from internal commit bae2a640df33; the presto-event-listener/com/uber/* paths from that commit are dropped since they live in an Uber-internal module that does not exist in OSS.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds indirect column lineage tracking to the analyzer in addition to the existing direct lineage. Direct (identity) column relationships between an output column and its source columns were introduced in #25913. This PR extends that to also capture indirect relationships — columns that influence an output without being projected — produced by
JOIN,WHERE/HAVINGfilters,GROUP BY,ORDER BY, and conditional expressions.Concretely:
presto-commontypes:TransformationType(DIRECT, INDIRECT),TransformationSubtype(IDENTITY, JOIN, FILTER, GROUP_BY, SORT, CONDITIONAL, ...) andColumnLineageEntrycarrying(downstream column, upstream column, relationship metadata).Analysisrecords per-field indirect sources viapropagateLineage(...)/addPerFieldIndirectSources(...)and exposes a globalgetIndirectSourceColumns()view.StatementAnalyzercollects indirect sources at the analysis sites listed above and propagates lineage through views and (legacy) materialized views, so thatwithAlias-createdFieldinstances retain their source columns.OutputColumnMetadata(SPI) gains a relationship-metadata field, so the new lineage is observable from event listeners.QueryMonitorpopulates the new field when emitting query-completed events.Backward compatibility note
This change keeps the existing direct-relationship lineage path from #25913 intact — the new indirect-lineage data structure is added alongside it rather than replacing it, so existing event listeners that only consume direct lineage are unaffected. Longer term, it would be cleaner to unify direct and indirect relationships into the new
ColumnLineageEntry/ relationship-metadata structure (DIRECT becomes one transformation type among others). That migration is intentionally deferred to a follow-up to avoid breaking downstream consumers of the existing direct-lineage API in this PR.Motivation and Context
End-to-end data freshness, lineage analytics, and impact analysis tooling need to know not just which columns flow into an output, but how — e.g. that a column appears only as a join key or a filter predicate rather than being projected. Direct lineage from #25913 alone cannot express this, which forces downstream consumers to re-parse SQL to recover the structure. Recording indirect relationships at analysis time makes this information available to any event listener.
Impact
OutputColumnMetadatagains a new field for relationship metadata. Existing constructors and consumers continue to work; the new field is optional.Test Plan
TestIndirectColumnLineagecovers patterns observed in production metrics, including JOIN keys, filter predicates, GROUP BY columns, ORDER BY columns, conditional expressions, view propagation, and (legacy) materialized-view propagation.TestOutputto cover the newOutputColumnMetadatafield.Contributor checklist
Release Notes