Skip to content

Conversation

reltuk
Copy link
Contributor

@reltuk reltuk commented Sep 3, 2021

The existing code seems to be expecting rows to arrive in order of the group by expression. But the analyzer does not arrange for that to happen.

Instead, this PR changes the approach so that each aggregation function duplicates its Child expression(s) as part of its per-aggregation state. That way it can Eval against its per-group-by Child and get the correct results out of Distinct for example.

This PR updates AVG, FIRST, LAST, MAX, MIN and SUM to do this. COUNT(DISTINCT ...) is handled by a special expression node instead, and nothing has been changed in Count or CountDistinct. group_concat also seems to handle DISTINCT itself, and so I have not changed anything there. Json aggregation did not look immediately amenable to combining with DISTINCT, because the Update functions seemed to error when the child expression returned nil, so I have not changed them either.

@VinaiRachakonda
Copy link
Contributor

I like this method a lot. Really clever imo

Copy link
Member

@zachmu zachmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, but I'm wondering if there could be a better way using the analyzer.

The analyzer has a rule called flatten_aggretation_expressions that does something very similar, i.e. it turns (SUM(a) / SUM(b)) into a projection of (suma / sumb) onto a new GroupBy node that expands the single select expression (SUM(a) / SUM(b)) into two, SUM(a) as suma, SUM(b) as sumb.

If I'm squinting at this, it seems like something similar could work here, where the distinct expression is itself treated (appropriately) as an aggregate expression and flattened. Then it automatically gets a new buffer for every grouping key, no need to duplicate expression trees. I haven't thought about it that deeply though, might not be workable. Worth a little investigation.

"github.com/dolthub/go-mysql-server/sql/expression"
)

func duplicateExpression(ctx *sql.Context, expr sql.Expression) (sql.Expression, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be a public Clone method in the expression package, next to TransformUp

@reltuk reltuk force-pushed the aaron/fix-aggregation-functions-with-windows branch from 006eab8 to 180bba5 Compare September 8, 2021 21:49
… Allow Window nodes to have aggregation expressions as well.
@reltuk
Copy link
Contributor Author

reltuk commented Sep 8, 2021

This got a little bit gnarly. Broadly:

  1. We added a validation rule to error out on previously broken queries where we had aggregations that were not correctly getting placed into GroupBy or Window nodes.

  2. We changed the Aggregation interface so that NewBuffer will return an AggregationBuffer. An AggregationBuffer knows how to Update and how to Eval. This lets aggregation implementations easily use proper structs with proper fields for their buffer implementations.

  3. We added Dispose capabilities to AggregationBuffer and threaded through calls for our GroupBy and Window iterator implementations. Some helper methods like expression.Dispose and sql.Dispose added.

  4. We changed the implementation of most of aggregation functions to duplicate their child expression and use the unique duplication for the Eval calls in the buffers' Update implementations. This is the crux of our approach to fix distinct aggregations, at least currently.

  5. We removed unused Merge functionality from Aggregation{,Buffer}. If we define the order in which merges occur, it could be well defined for some aggregations at least, but it was entirely unused throughout the code base.

This is ready for review and I think this is the approach we should broadly go with for now.

@reltuk reltuk changed the title [RFC] sql/expression/function/aggregation: Change aggregation functions to work better with group by expressions. sql/expression/function/aggregation: Change aggregation functions to work better with group by expressions. Sep 8, 2021
Copy link
Member

@zachmu zachmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, no real comments

@@ -143,7 +143,7 @@ func (i *windowIter) Next() (sql.Row, error) {
return nil, err
}
case sql.Aggregation:
row[j], err = expr.Eval(i.ctx, i.buffers[j])
row[j], err = i.buffers[j][0].(sql.AggregationBuffer).Eval(i.ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function could use a TODO, it needs to be given the same treatment as the non-window aggregates

// GroupBy is the only node that can support evaluating an Aggregation.
//
// See https://github.com/dolthub/go-mysql-server/issues/542 for some queries
// that should be supported but that currently trigger this validation.
Copy link
Contributor

@VinaiRachakonda VinaiRachakonda Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there is a typo here

@reltuk reltuk merged commit 809ad9d into master Sep 9, 2021
@Hydrocharged Hydrocharged deleted the aaron/fix-aggregation-functions-with-windows branch December 8, 2021 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants