Skip to content

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Jun 15, 2025

Will close #2489

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

If you have comments or can explain your changes, please do so below

  • Hoping to make adding Expr.(first|last) less complicated

Note

If you're a pandas expert, please feel free to hop on and add some commits πŸ™‚

Tasks

  • Get all tests passing using **named_aggs see thread
  • Clean up all the patchwork fixes to bugs (c2df420)

@dangotbanned dangotbanned added internal pandas-like Issue is related to pandas-like backends labels Jun 15, 2025
@dangotbanned
Copy link
Member Author

So I've done some git archaeology ⛏️ and I think these PRs are where the complexity slowly crept in

PRs

No criticisms from me, I've got the benefit of time and a change in scope on my side πŸ˜„

This PR

  • I'd like to make the task of adding paths for new aggregations less daunting
  • Currently, there's a lot of context to keep in your head while reading through .agg()
    • but no typing on the native side to ease the burden
  • std and var could probably share more code
    • there's a fair bit of duplication that just needs to parameterize self._grouped.<method_name>
    • the complication is how the aggregations are collected and aggregated in multiple passes
      • which loses the context of the PandasLikeExpr
  • The counts in (294c6de) are just a guide - purely reducing them isn't the goal

Related

In the (#2680) version of ArrowGroupBy, there are some changes that might be helpful

ArrowGroupBy

class ArrowGroupBy(EagerGroupBy["ArrowDataFrame", "ArrowExpr", "Aggregation"]):
_REMAP_AGGS: ClassVar[Mapping[NarwhalsAggregation, Aggregation]] = {
"sum": "sum",
"mean": "mean",
"median": "approximate_median",
"max": "max",
"min": "min",
"std": "stddev",
"var": "variance",
"len": "count",
"n_unique": "count_distinct",
"count": "count",
"first": "first",
}
_REMAP_UNIQUE: ClassVar[Mapping[UniqueKeepStrategy, Aggregation]] = {
"any": "min",
"first": "min",
"last": "max",
}
_OPTION_COUNT_ALL: ClassVar[frozenset[NarwhalsAggregation]] = frozenset(
("len", "n_unique")
)
_OPTION_COUNT_VALID: ClassVar[frozenset[NarwhalsAggregation]] = frozenset(("count",))
_OPTION_ORDERED: ClassVar[frozenset[NarwhalsAggregation]] = frozenset(("first",))
_OPTION_VARIANCE: ClassVar[frozenset[NarwhalsAggregation]] = frozenset(("std", "var"))
def __init__(
self,
df: ArrowDataFrame,
keys: Sequence[ArrowExpr] | Sequence[str],
/,
*,
drop_null_keys: bool,
) -> None:
self._df = df
frame, self._keys, self._output_key_names = self._parse_keys(df, keys=keys)
self._compliant_frame = frame.drop_nulls(self._keys) if drop_null_keys else frame
self._grouped = pa.TableGroupBy(self.compliant.native, self._keys)
self._drop_null_keys = drop_null_keys
def _configure_agg(
self, grouped: pa.TableGroupBy, expr: ArrowExpr, /
) -> tuple[pa.TableGroupBy, Aggregation, AggregateOptions | None]:
option: AggregateOptions | None = None
function_name = self._leaf_name(expr)
if function_name in self._OPTION_VARIANCE:
ddof = expr._scalar_kwargs.get("ddof", 1)
option = pc.VarianceOptions(ddof=ddof)
elif function_name in self._OPTION_COUNT_ALL:
option = pc.CountOptions(mode="all")
elif function_name in self._OPTION_COUNT_VALID:
option = pc.CountOptions(mode="only_valid")
elif function_name in self._OPTION_ORDERED:
grouped, option = self._ordered_agg(grouped, function_name)
return grouped, self._remap_expr_name(function_name), option
def _ordered_agg(
self, grouped: pa.TableGroupBy, name: NarwhalsAggregation, /
) -> tuple[pa.TableGroupBy, AggregateOptions]:
"""The default behavior of `pyarrow` raises when `first` or `last` are used.
You'd see an error like:
ArrowNotImplementedError: Using ordered aggregator in multiple threaded execution is not supported
We need to **disable** multi-threading to use them, but the ability to do so
wasn't possible before `14.0.0` ([pyarrow-36709])
[pyarrow-36709]: https://github.com/apache/arrow/issues/36709
"""
backend_version = self.compliant._backend_version
if backend_version >= (14, 0) and grouped._use_threads:
native = self.compliant.native
grouped = pa.TableGroupBy(native, grouped.keys, use_threads=False)
elif backend_version < (14, 0): # pragma: no cover
msg = (
f"Using `{name}()` in a `group_by().agg(...)` context is only available in 'pyarrow>=14.0.0', "
f"found version {requires._unparse_version(backend_version)!r}.\n\n"
f"See https://github.com/apache/arrow/issues/36709"
)
raise NotImplementedError(msg)
return grouped, pc.ScalarAggregateOptions(skip_nulls=False)
def agg(self, *exprs: ArrowExpr) -> ArrowDataFrame:
self._ensure_all_simple(exprs)
aggs: list[tuple[str, Aggregation, AggregateOptions | None]] = []
expected_pyarrow_column_names: list[str] = self._keys.copy()
new_column_names: list[str] = self._keys.copy()
exclude = (*self._keys, *self._output_key_names)
grouped = self._grouped
for expr in exprs:
output_names, aliases = evaluate_output_names_and_aliases(
expr, self.compliant, exclude
)
if expr._depth == 0:
# e.g. `agg(nw.len())`
if expr._function_name != "len": # pragma: no cover
msg = "Safety assertion failed, please report a bug to https://github.com/narwhals-dev/narwhals/issues"
raise AssertionError(msg)
new_column_names.append(aliases[0])
expected_pyarrow_column_names.append(f"{self._keys[0]}_count")
aggs.append((self._keys[0], "count", pc.CountOptions(mode="all")))
continue
grouped, function_name, option = self._configure_agg(grouped, expr)
new_column_names.extend(aliases)
expected_pyarrow_column_names.extend(
[f"{output_name}_{function_name}" for output_name in output_names]
)
aggs.extend(
[(output_name, function_name, option) for output_name in output_names]
)
result_simple = grouped.aggregate(aggs)
# Rename columns, being very careful

@MarcoGorelli
Copy link
Member

thanks for looking into this

i'd like to rewrite this to use namedagg, so it's probably worth doing that first #1661

@dangotbanned
Copy link
Member Author

dangotbanned commented Jun 15, 2025

thanks for looking into this

i'd like to rewrite this to use namedagg, so it's probably worth doing that first #1661

Thanks @MarcoGorelli, I've linked the two issues

No objections from me on how we get there πŸ™‚

Did you know NamedAgg is just a NamedTuple?

Edit

Yeah I've taken more of a look and I get it now. It's the combination of using NamedAgg and **kwargs that we'd aim for

@dangotbanned
Copy link
Member Author

dangotbanned commented Jun 16, 2025

@MarcoGorelli maybe a longshot, but was wondering if you were interested in a fix for the DataFrameGroupBy.aggregate @overload(s)?

Both mypy and pyright aren't happy with them 😞

mypy

narwhals\_pandas_like\group_by.py:176: error: No overload variant matches argument type
"dict[str, tuple[str, Literal['any', 'all', 'count', 'first', 'idxmax', 'idxmin', 'last', 'max', 'mean', 'median', 'min', 'nunique', 'prod', 'quantile', 'sem', 'size', 'std', 'sum', 'var', 'cov', 'skew'] | Callable[..., Any]]]"  [call-overload]
                result = self._grouped.agg(**into_agg)  # type: ignore[arg-type]
                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
narwhals\_pandas_like\group_by.py:176: note: Error code "call-overload" not covered by "type: ignore" comment
narwhals\_pandas_like\group_by.py:176: note: Possible overload variants:
narwhals\_pandas_like\group_by.py:176: note:     def aggregate(self, func: Literal['size']) -> Series[Any]
narwhals\_pandas_like\group_by.py:176: note:     def aggregate(self, func: Callable[..., Any] | str | ufunc | list[Callable[..., Any] | str | ufunc] | Mapping[Any, Callable[..., Any] | str | ufunc | list[Callable[..., Any] | str | ufunc]] | None = ..., *args: Any, engine: Literal['cython', 'numba'] | None = ..., engine_kwargs: _WindowingNumbaKwargs | None = ..., **kwargs: Any) -> DataFrame

pyright

image

I'm pretty sure the gaps in red are causing the issues.

image

**kwargs could be typed using Unpack and PEP 728 – TypedDict with Typed Extra Items

Example Fix

You can also specify the pyright setting as:

# pyproject.toml
[tool.pyright]
enableExperimentalFeatures = true

But below should be easier to copy/paste πŸ™‚

from __future__ import annotations

# pyright: enableExperimentalFeatures=true
from typing import TYPE_CHECKING

import numpy as np
import pandas as pd

if TYPE_CHECKING:
    from collections.abc import Hashable
    from typing import Any

    import typing_extensions as te
    from pandas._typing import AggFuncTypeFrame, WindowingEngine, WindowingEngineKwargs
    from pandas.core.groupby.generic import AggScalar, NamedAgg

    Aggregation: te.TypeAlias = NamedAgg | tuple[Hashable, AggScalar]

    class AggKwargs(te.TypedDict, extra_items=Aggregation, total=False):
        engine: WindowingEngine
        engine_kwargs: WindowingEngineKwargs


def df_groupby_agg(
    func: AggFuncTypeFrame | None = None, *args: Any, **kwargs: te.Unpack[AggKwargs]
) -> tuple[AggFuncTypeFrame | None, tuple[Any, ...], AggKwargs]:
    return func, args, kwargs


# NOTE: `DataFrameGroupBy._agg_examples_doc`
a = df_groupby_agg("min")
b = df_groupby_agg(["min", "max"])
d = df_groupby_agg(lambda x: sum(x) + 2)
e = df_groupby_agg({"B": ["min", "max"], "C": "sum"})
f = df_groupby_agg(
    b_min=pd.NamedAgg(column="B", aggfunc="min"),
    c_sum=pd.NamedAgg(column="C", aggfunc="sum"),
)
g = df_groupby_agg(lambda x: x.astype(float).min())

# NOTE: More fancy variants
m1 = df_groupby_agg("size")
m2 = df_groupby_agg(engine=None)
m3 = df_groupby_agg(engine_kwargs={"nogil": True}, alias_1=("column_1", "first"))
m4 = df_groupby_agg(
    column_1=("column_1", "last"),
    column_2=pd.NamedAgg("column_1", aggfunc=lambda x: sum(x) + 2),
)
m5 = df_groupby_agg(
    alias_2=pd.NamedAgg("column_2", "max"),
    alias_3=("column_3", np.min),
    engine="cython",
    alias_4=("column_4", "quantile"),
)

And now none of these are causing yelling 😎

image

@dangotbanned dangotbanned linked an issue Jun 16, 2025 that may be closed by this pull request
@dangotbanned dangotbanned changed the title chore(DRAFT): Simplify PandasLikeGroupBy chore: Simplify PandasLikeGroupBy Jun 16, 2025
- Seems to be the most minimal change to resolve (#2680 (comment))
- Need to review what else is still needed
@dangotbanned dangotbanned mentioned this pull request Jul 9, 2025
@dangotbanned dangotbanned marked this pull request as draft July 11, 2025 10:31
@dangotbanned dangotbanned marked this pull request as ready for review July 11, 2025 12:37
@dangotbanned dangotbanned requested a review from FBruzzesi July 11, 2025 12:37
@dangotbanned
Copy link
Member Author

@FBruzzesi feel free to delete any docs that seem unnecessary

The API has shrunk a lot in (91b5800), so if what's left is easier enough to understand without docs - then I'm not precious on keeping them πŸ˜„

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the back and forth @dangotbanned - I have no more objections! This looks πŸ”₯

It covers so many edge cases that a few times I found myself thinking "wait a minute, let me change this line, I don't think we need it", well actually we do πŸ˜‚

@dangotbanned dangotbanned merged commit b2afcea into main Jul 14, 2025
32 checks passed
@dangotbanned dangotbanned deleted the simp-pandas-group-by branch July 14, 2025 09:15
@MarcoGorelli
Copy link
Member

______________________ test_double_same_aggregation[cudf] ______________________
[XPASS(strict)] 
_________________________ test_all_kind_of_aggs[cudf] __________________________
[XPASS(strict)] 

nice one, well done both!

@dangotbanned
Copy link
Member Author

______________________ test_double_same_aggregation[cudf] ______________________
[XPASS(strict)] 
_________________________ test_all_kind_of_aggs[cudf] __________________________
[XPASS(strict)] 

nice one, well done both!

Ahh I was curious about if this was fixed for all of them πŸ₯³

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal pandas-like Issue is related to pandas-like backends tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify PandasLikeGroupBy chore: rewrite agg_pandas to use NamedAgg
3 participants