chore: Simplify `PandasLikeGroupBy` #2680

dangotbanned · 2025-06-15T13:40:42Z

Will close #2489

What type of PR is this? (check all applicable)

Related issues

Closes Simplify PandasLikeGroupBy #2489
Partially unblocks feat: Adds {Expr,Series}.{first,last} #2528 (comment)

If you have comments or can explain your changes, please do so below

Hoping to make adding Expr.(first|last) less complicated

Note

If you're a pandas expert, please feel free to hop on and add some commits 🙂

Tasks

Get all tests passing using **named_aggs see thread
Clean up all the patchwork fixes to bugs (c2df420)

https://github.com/narwhals-dev/narwhals/actions/runs/15663810673/job/44125275890?pr=2680

dangotbanned · 2025-06-15T14:54:56Z

So I've done some git archaeology ⛏️ and I think these PRs are where the complexity slowly crept in

PRs

No criticisms from me, I've got the benefit of time and a change in scope on my side 😄

patch: group by n_unique #917
fix: pandas and pyarrow were raising for repeated aggregations involving the same root name and the same aggregation function #1613
feat: support std and var with ddof !=1 in pandas-like group by #1645
feat: completely refactor alias tracking and support nw.all, nw.nth, and selectors across the API #1866
perf: fast path for pandas-like group by with single expression and avoid resetting index if not necessary #2010

This PR

I'd like to make the task of adding paths for new aggregations less daunting
Currently, there's a lot of context to keep in your head while reading through .agg()
- but no typing on the native side to ease the burden
std and var could probably share more code
- there's a fair bit of duplication that just needs to parameterize self._grouped.<method_name>
- the complication is how the aggregations are collected and aggregated in multiple passes
  - which loses the context of the PandasLikeExpr
The counts in (294c6de) are just a guide - purely reducing them isn't the goal

    
           class ArrowGroupBy(EagerGroupBy["ArrowDataFrame", "ArrowExpr", "Aggregation"]): 
        
               _REMAP_AGGS: ClassVar[Mapping[NarwhalsAggregation, Aggregation]] = { 
        
                   "sum": "sum", 
        
                   "mean": "mean", 
        
                   "median": "approximate_median", 
        
                   "max": "max", 
        
                   "min": "min", 
        
                   "std": "stddev", 
        
                   "var": "variance", 
        
                   "len": "count", 
        
                   "n_unique": "count_distinct", 
        
                   "count": "count", 
        
                   "first": "first", 
        
               } 
        
               _REMAP_UNIQUE: ClassVar[Mapping[UniqueKeepStrategy, Aggregation]] = { 
        
                   "any": "min", 
        
                   "first": "min", 
        
                   "last": "max", 
        
               } 
        
               _OPTION_COUNT_ALL: ClassVar[frozenset[NarwhalsAggregation]] = frozenset( 
        
                   ("len", "n_unique") 
        
               ) 
        
               _OPTION_COUNT_VALID: ClassVar[frozenset[NarwhalsAggregation]] = frozenset(("count",)) 
        
               _OPTION_ORDERED: ClassVar[frozenset[NarwhalsAggregation]] = frozenset(("first",)) 
        
               _OPTION_VARIANCE: ClassVar[frozenset[NarwhalsAggregation]] = frozenset(("std", "var")) 
        
               def __init__( 
        
                   self, 
        
                   df: ArrowDataFrame, 
        
                   keys: Sequence[ArrowExpr] | Sequence[str], 
        
                   /, 
        
                   *, 
        
                   drop_null_keys: bool, 
        
               ) -> None: 
        
                   self._df = df 
        
                   frame, self._keys, self._output_key_names = self._parse_keys(df, keys=keys) 
        
                   self._compliant_frame = frame.drop_nulls(self._keys) if drop_null_keys else frame 
        
                   self._grouped = pa.TableGroupBy(self.compliant.native, self._keys) 
        
                   self._drop_null_keys = drop_null_keys 
        
               def _configure_agg( 
        
                   self, grouped: pa.TableGroupBy, expr: ArrowExpr, / 
        
               ) -> tuple[pa.TableGroupBy, Aggregation, AggregateOptions | None]: 
        
                   option: AggregateOptions | None = None 
        
                   function_name = self._leaf_name(expr) 
        
                   if function_name in self._OPTION_VARIANCE: 
        
                       ddof = expr._scalar_kwargs.get("ddof", 1) 
        
                       option = pc.VarianceOptions(ddof=ddof) 
        
                   elif function_name in self._OPTION_COUNT_ALL: 
        
                       option = pc.CountOptions(mode="all") 
        
                   elif function_name in self._OPTION_COUNT_VALID: 
        
                       option = pc.CountOptions(mode="only_valid") 
        
                   elif function_name in self._OPTION_ORDERED: 
        
                       grouped, option = self._ordered_agg(grouped, function_name) 
        
                   return grouped, self._remap_expr_name(function_name), option 
        
               def _ordered_agg( 
        
                   self, grouped: pa.TableGroupBy, name: NarwhalsAggregation, / 
        
               ) -> tuple[pa.TableGroupBy, AggregateOptions]: 
        
                   """The default behavior of `pyarrow` raises when `first` or `last` are used. 
        
                   You'd see an error like: 
        
                       ArrowNotImplementedError: Using ordered aggregator in multiple threaded execution is not supported 
        
                   We need to **disable** multi-threading to use them, but the ability to do so 
        
                   wasn't possible before `14.0.0` ([pyarrow-36709]) 
        
                   [pyarrow-36709]: https://github.com/apache/arrow/issues/36709 
        
                   """ 
        
                   backend_version = self.compliant._backend_version 
        
                   if backend_version >= (14, 0) and grouped._use_threads: 
        
                       native = self.compliant.native 
        
                       grouped = pa.TableGroupBy(native, grouped.keys, use_threads=False) 
        
                   elif backend_version < (14, 0):  # pragma: no cover 
        
                       msg = ( 
        
                           f"Using `{name}()` in a `group_by().agg(...)` context is only available in 'pyarrow>=14.0.0', " 
        
                           f"found version {requires._unparse_version(backend_version)!r}.\n\n" 
        
                           f"See https://github.com/apache/arrow/issues/36709" 
        
                       ) 
        
                       raise NotImplementedError(msg) 
        
                   return grouped, pc.ScalarAggregateOptions(skip_nulls=False) 
        
               def agg(self, *exprs: ArrowExpr) -> ArrowDataFrame: 
        
                   self._ensure_all_simple(exprs) 
        
                   aggs: list[tuple[str, Aggregation, AggregateOptions | None]] = [] 
        
                   expected_pyarrow_column_names: list[str] = self._keys.copy() 
        
                   new_column_names: list[str] = self._keys.copy() 
        
                   exclude = (*self._keys, *self._output_key_names) 
        
                   grouped = self._grouped 
        
                   for expr in exprs: 
        
                       output_names, aliases = evaluate_output_names_and_aliases( 
        
                           expr, self.compliant, exclude 
        
                       ) 
        
                       if expr._depth == 0: 
        
                           # e.g. `agg(nw.len())` 
        
                           if expr._function_name != "len":  # pragma: no cover 
        
                               msg = "Safety assertion failed, please report a bug to https://github.com/narwhals-dev/narwhals/issues" 
        
                               raise AssertionError(msg) 
        
                           new_column_names.append(aliases[0]) 
        
                           expected_pyarrow_column_names.append(f"{self._keys[0]}_count") 
        
                           aggs.append((self._keys[0], "count", pc.CountOptions(mode="all"))) 
        
                           continue 
        
                       grouped, function_name, option = self._configure_agg(grouped, expr) 
        
                       new_column_names.extend(aliases) 
        
                       expected_pyarrow_column_names.extend( 
        
                           [f"{output_name}_{function_name}" for output_name in output_names] 
        
                       ) 
        
                       aggs.extend( 
        
                           [(output_name, function_name, option) for output_name in output_names] 
        
                       ) 
        
                   result_simple = grouped.aggregate(aggs) 
        
                   # Rename columns, being very careful

MarcoGorelli · 2025-06-15T15:01:52Z

thanks for looking into this

i'd like to rewrite this to use namedagg, so it's probably worth doing that first #1661

dangotbanned · 2025-06-15T15:23:43Z

thanks for looking into this

i'd like to rewrite this to use namedagg, so it's probably worth doing that first #1661

Thanks @MarcoGorelli, I've linked the two issues

No objections from me on how we get there 🙂

Did you know NamedAgg is just a NamedTuple?

Edit

Yeah I've taken more of a look and I get it now. It's the combination of using NamedAgg and **kwargs that we'd aim for

https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#named-aggregation

https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#built-in-aggregation-methods

See docs/notes

narwhals/_pandas_like/group_by.py

dangotbanned · 2025-06-16T12:15:09Z

@MarcoGorelli maybe a longshot, but was wondering if you were interested in a fix for the DataFrameGroupBy.aggregate @overload(s)?

Both mypy and pyright aren't happy with them 😞

mypy

narwhals\_pandas_like\group_by.py:176: error: No overload variant matches argument type
"dict[str, tuple[str, Literal['any', 'all', 'count', 'first', 'idxmax', 'idxmin', 'last', 'max', 'mean', 'median', 'min', 'nunique', 'prod', 'quantile', 'sem', 'size', 'std', 'sum', 'var', 'cov', 'skew'] | Callable[..., Any]]]"  [call-overload]
                result = self._grouped.agg(**into_agg)  # type: ignore[arg-type]
                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
narwhals\_pandas_like\group_by.py:176: note: Error code "call-overload" not covered by "type: ignore" comment
narwhals\_pandas_like\group_by.py:176: note: Possible overload variants:
narwhals\_pandas_like\group_by.py:176: note:     def aggregate(self, func: Literal['size']) -> Series[Any]
narwhals\_pandas_like\group_by.py:176: note:     def aggregate(self, func: Callable[..., Any] | str | ufunc | list[Callable[..., Any] | str | ufunc] | Mapping[Any, Callable[..., Any] | str | ufunc | list[Callable[..., Any] | str | ufunc]] | None = ..., *args: Any, engine: Literal['cython', 'numba'] | None = ..., engine_kwargs: _WindowingNumbaKwargs | None = ..., **kwargs: Any) -> DataFrame

pyright

I'm pretty sure the gaps in red are causing the issues.

**kwargs could be typed using Unpack and PEP 728 – TypedDict with Typed Extra Items

Example Fix

You can also specify the pyright setting as:

# pyproject.toml
[tool.pyright]
enableExperimentalFeatures = true

But below should be easier to copy/paste 🙂

from __future__ import annotations

# pyright: enableExperimentalFeatures=true
from typing import TYPE_CHECKING

import numpy as np
import pandas as pd

if TYPE_CHECKING:
    from collections.abc import Hashable
    from typing import Any

    import typing_extensions as te
    from pandas._typing import AggFuncTypeFrame, WindowingEngine, WindowingEngineKwargs
    from pandas.core.groupby.generic import AggScalar, NamedAgg

    Aggregation: te.TypeAlias = NamedAgg | tuple[Hashable, AggScalar]

    class AggKwargs(te.TypedDict, extra_items=Aggregation, total=False):
        engine: WindowingEngine
        engine_kwargs: WindowingEngineKwargs


def df_groupby_agg(
    func: AggFuncTypeFrame | None = None, *args: Any, **kwargs: te.Unpack[AggKwargs]
) -> tuple[AggFuncTypeFrame | None, tuple[Any, ...], AggKwargs]:
    return func, args, kwargs


# NOTE: `DataFrameGroupBy._agg_examples_doc`
a = df_groupby_agg("min")
b = df_groupby_agg(["min", "max"])
d = df_groupby_agg(lambda x: sum(x) + 2)
e = df_groupby_agg({"B": ["min", "max"], "C": "sum"})
f = df_groupby_agg(
    b_min=pd.NamedAgg(column="B", aggfunc="min"),
    c_sum=pd.NamedAgg(column="C", aggfunc="sum"),
)
g = df_groupby_agg(lambda x: x.astype(float).min())

# NOTE: More fancy variants
m1 = df_groupby_agg("size")
m2 = df_groupby_agg(engine=None)
m3 = df_groupby_agg(engine_kwargs={"nogil": True}, alias_1=("column_1", "first"))
m4 = df_groupby_agg(
    column_1=("column_1", "last"),
    column_2=pd.NamedAgg("column_1", aggfunc=lambda x: sum(x) + 2),
)
m5 = df_groupby_agg(
    alias_2=pd.NamedAgg("column_2", "max"),
    alias_3=("column_3", np.min),
    engine="cython",
    alias_4=("column_4", "quantile"),
)

And now none of these are causing yelling 😎

Need to go through the failures, then add back old stuff that's required (#2680 (comment))

I missed the renaming part, so I've documented that a bit now

We don't need to repeat this in group_by anymore

Fixes the 2 tpch q1 failures #2680 (comment)

narwhals/_pandas_like/group_by.py

- Seems to be the most minimal change to resolve (#2680 (comment)) - Need to review what else is still needed

Maybe addresses (#2680 (comment))

24eb873

From #2680 (comment)

- Very ugly fix that's working locally - #2680 (comment)

#2680 (comment) https://github.com/narwhals-dev/narwhals/actions/runs/16201526923/job/45741527528?pr=2680

https://github.com/narwhals-dev/narwhals/actions/runs/16201769257/job/45742327757?pr=2680

- #2680 (comment) - #2680 (comment) - #2680 (comment)

…v/narwhals into simp-pandas-group-by

Hopefully fixes (https://github.com/narwhals-dev/narwhals/actions/runs/16218674138/job/45793825739)

dangotbanned · 2025-07-11T12:39:42Z

@FBruzzesi feel free to delete any docs that seem unnecessary

The API has shrunk a lot in (91b5800), so if what's left is easier enough to understand without docs - then I'm not precious on keeping them 😄

FBruzzesi

Thanks for all the back and forth @dangotbanned - I have no more objections! This looks 🔥

It covers so many edge cases that a few times I found myself thinking "wait a minute, let me change this line, I don't think we need it", well actually we do 😂

MarcoGorelli · 2025-07-14T10:21:07Z

______________________ test_double_same_aggregation[cudf] ______________________
[XPASS(strict)] 
_________________________ test_all_kind_of_aggs[cudf] __________________________
[XPASS(strict)]

nice one, well done both!

dangotbanned · 2025-07-14T10:33:28Z

______________________ test_double_same_aggregation[cudf] ______________________
[XPASS(strict)] 
_________________________ test_all_kind_of_aggs[cudf] __________________________
[XPASS(strict)]

nice one, well done both!

Ahh I was curious about if this was fixed for all of them 🥳

dangotbanned added 4 commits June 15, 2025 13:06

refactor: Move errors, warnings outside of PandasLikeGroupBy.agg

a7f2559

refactor: Split out complex, add some typing

c61e09e

refactor: Split out dupe code to _select_results

c530b85

docs: Note the current complexity counts

294c6de

dangotbanned added internal pandas-like Issue is related to pandas-like backends labels Jun 15, 2025

cov

745eccd

https://github.com/narwhals-dev/narwhals/actions/runs/15663810673/job/44125275890?pr=2680

dangotbanned added 4 commits June 15, 2025 17:19

feat(typing): Add NativeAggregation literal

d9cfa76

https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#built-in-aggregation-methods

chore(typing): "Add typing" for the rest

839cfb6

Experimenting with named agg style

13d900c

See docs/notes

refactor: Nice generalized version that "works"

6707d2d

dangotbanned commented Jun 15, 2025

View reviewed changes

narwhals/_pandas_like/group_by.py Outdated Show resolved Hide resolved

dangotbanned added 11 commits June 16, 2025 13:32

Mostly clean slate re-impl

d941e74

Need to go through the failures, then add back old stuff that's required (#2680 (comment))

fix: Resolve 15/21 failures

f222cac

I missed the renaming part, so I've documented that a bit now

note the remaining issues

a62e58f

test: Update to use DuplicateError

6951889

We don't need to repeat this in group_by anymore

refactor: Remove dead code

e313f2b

revert: Remove outdated complexity counts

a61dbbb

Merge remote-tracking branch 'upstream/main' into simp-pandas-group-by

c0c142c

chore: Remove more outdated

b2feb1e

refactor: Move functions into `PandasLikeGroupBy

b50456d

fix: Use alias instead of function name

12ae9af

Fixes the 2 tpch q1 failures #2680 (comment)

Merge branch 'main' into simp-pandas-group-by

f24651d

dangotbanned commented Jun 16, 2025

View reviewed changes

narwhals/_pandas_like/group_by.py Outdated Show resolved Hide resolved

dangotbanned linked an issue Jun 16, 2025 that may be closed by this pull request

chore: rewrite agg_pandas to use NamedAgg #1661

Closed

dangotbanned changed the title ~~chore(DRAFT): Simplify PandasLikeGroupBy~~ chore: Simplify PandasLikeGroupBy Jun 16, 2025

dangotbanned added 3 commits July 9, 2025 13:03

perf: Use rename instead of with_columns

cfeac25

- Seems to be the most minimal change to resolve (#2680 (comment)) - Need to review what else is still needed

try rename with copy=True

24eb873

Maybe addresses (#2680 (comment))

revert: try rename with copy=True

46282e8

24eb873

dangotbanned mentioned this pull request Jul 9, 2025

ci: Add ruff to dev #2811

Merged

dangotbanned and others added 6 commits July 10, 2025 17:38

test: Add failing repro

9fe1642

From #2680 (comment)

fix(DRAFT): Are we passing yet? 🙏

9eb672e

- Very ugly fix that's working locally - #2680 (comment)

chore(typing): ignore apply overload

53eab03

#2680 (comment) https://github.com/narwhals-dev/narwhals/actions/runs/16201526923/job/45741527528?pr=2680

Merge remote-tracking branch 'upstream/main' into simp-pandas-group-by

cc7a9a0

re-undo undoing ignore 😭

44ba496

https://github.com/narwhals-dev/narwhals/actions/runs/16201769257/job/45742327757?pr=2680

Merge branch 'main' into simp-pandas-group-by

60461d9

dangotbanned marked this pull request as draft July 11, 2025 10:31

dangotbanned added 3 commits July 11, 2025 12:14

refactor: Switch back from .agg(**named_aggs) to __getitem__

91b5800

- #2680 (comment) - #2680 (comment) - #2680 (comment)

Merge branch 'simp-pandas-group-by' of https://github.com/narwhals-de…

c053971

…v/narwhals into simp-pandas-group-by

fix: pandas nightly boolean columns

5899103

Hopefully fixes (https://github.com/narwhals-dev/narwhals/actions/runs/16218674138/job/45793825739)

dangotbanned mentioned this pull request Jul 11, 2025

chore: rewrite agg_pandas to use NamedAgg #1661

Closed

dangotbanned marked this pull request as ready for review July 11, 2025 12:37

dangotbanned requested a review from FBruzzesi July 11, 2025 12:37

dangotbanned added the tests label Jul 11, 2025

Merge branch 'main' into simp-pandas-group-by

13fdb47

FBruzzesi approved these changes Jul 14, 2025

View reviewed changes

dangotbanned added 2 commits July 14, 2025 09:21

Merge branch 'main' into simp-pandas-group-by

e93044e

Merge branch 'main' into simp-pandas-group-by

1672513

dangotbanned merged commit b2afcea into main Jul 14, 2025
32 checks passed

dangotbanned deleted the simp-pandas-group-by branch July 14, 2025 09:15

FBruzzesi mentioned this pull request Jul 14, 2025

chore: Refactor _apply_exprs_function internal #2829

Merged

10 tasks

hoxbro mentioned this pull request Aug 12, 2025

feat: Support Narwhals holoviz/holoviews#6567

Open

MarcoGorelli mentioned this pull request Aug 12, 2025

bug: agg(nw.all().len()) raises #2973

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: Simplify `PandasLikeGroupBy` #2680

chore: Simplify `PandasLikeGroupBy` #2680

Uh oh!

dangotbanned commented Jun 15, 2025 •

edited

Loading

Uh oh!

dangotbanned commented Jun 15, 2025

Uh oh!

MarcoGorelli commented Jun 15, 2025

Uh oh!

dangotbanned commented Jun 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

dangotbanned commented Jun 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

dangotbanned commented Jul 11, 2025

Uh oh!

FBruzzesi left a comment

Uh oh!

Uh oh!

MarcoGorelli commented Jul 14, 2025

Uh oh!

dangotbanned commented Jul 14, 2025

Uh oh!

Uh oh!

chore: Simplify PandasLikeGroupBy #2680

chore: Simplify PandasLikeGroupBy #2680

Uh oh!

Conversation

dangotbanned commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Related issues

If you have comments or can explain your changes, please do so below

Tasks

Uh oh!

dangotbanned commented Jun 15, 2025

PRs

This PR

Related

Uh oh!

MarcoGorelli commented Jun 15, 2025

Uh oh!

dangotbanned commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Edit

Uh oh!

Uh oh!

dangotbanned commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dangotbanned commented Jul 11, 2025

Uh oh!

FBruzzesi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MarcoGorelli commented Jul 14, 2025

Uh oh!

dangotbanned commented Jul 14, 2025

Uh oh!

Uh oh!

chore: Simplify `PandasLikeGroupBy` #2680

chore: Simplify `PandasLikeGroupBy` #2680

dangotbanned commented Jun 15, 2025 •

edited

Loading

dangotbanned commented Jun 15, 2025 •

edited

Loading

dangotbanned commented Jun 16, 2025 •

edited

Loading