-
Notifications
You must be signed in to change notification settings - Fork 161
chore: Simplify PandasLikeGroupBy
#2680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
So I've done some git archaeology βοΈ and I think these PRs are where the complexity slowly crept in PRsNo criticisms from me, I've got the benefit of time and a change in scope on my side π
This PR
RelatedIn the (#2680) version of
|
class ArrowGroupBy(EagerGroupBy["ArrowDataFrame", "ArrowExpr", "Aggregation"]): | |
_REMAP_AGGS: ClassVar[Mapping[NarwhalsAggregation, Aggregation]] = { | |
"sum": "sum", | |
"mean": "mean", | |
"median": "approximate_median", | |
"max": "max", | |
"min": "min", | |
"std": "stddev", | |
"var": "variance", | |
"len": "count", | |
"n_unique": "count_distinct", | |
"count": "count", | |
"first": "first", | |
} | |
_REMAP_UNIQUE: ClassVar[Mapping[UniqueKeepStrategy, Aggregation]] = { | |
"any": "min", | |
"first": "min", | |
"last": "max", | |
} | |
_OPTION_COUNT_ALL: ClassVar[frozenset[NarwhalsAggregation]] = frozenset( | |
("len", "n_unique") | |
) | |
_OPTION_COUNT_VALID: ClassVar[frozenset[NarwhalsAggregation]] = frozenset(("count",)) | |
_OPTION_ORDERED: ClassVar[frozenset[NarwhalsAggregation]] = frozenset(("first",)) | |
_OPTION_VARIANCE: ClassVar[frozenset[NarwhalsAggregation]] = frozenset(("std", "var")) | |
def __init__( | |
self, | |
df: ArrowDataFrame, | |
keys: Sequence[ArrowExpr] | Sequence[str], | |
/, | |
*, | |
drop_null_keys: bool, | |
) -> None: | |
self._df = df | |
frame, self._keys, self._output_key_names = self._parse_keys(df, keys=keys) | |
self._compliant_frame = frame.drop_nulls(self._keys) if drop_null_keys else frame | |
self._grouped = pa.TableGroupBy(self.compliant.native, self._keys) | |
self._drop_null_keys = drop_null_keys | |
def _configure_agg( | |
self, grouped: pa.TableGroupBy, expr: ArrowExpr, / | |
) -> tuple[pa.TableGroupBy, Aggregation, AggregateOptions | None]: | |
option: AggregateOptions | None = None | |
function_name = self._leaf_name(expr) | |
if function_name in self._OPTION_VARIANCE: | |
ddof = expr._scalar_kwargs.get("ddof", 1) | |
option = pc.VarianceOptions(ddof=ddof) | |
elif function_name in self._OPTION_COUNT_ALL: | |
option = pc.CountOptions(mode="all") | |
elif function_name in self._OPTION_COUNT_VALID: | |
option = pc.CountOptions(mode="only_valid") | |
elif function_name in self._OPTION_ORDERED: | |
grouped, option = self._ordered_agg(grouped, function_name) | |
return grouped, self._remap_expr_name(function_name), option | |
def _ordered_agg( | |
self, grouped: pa.TableGroupBy, name: NarwhalsAggregation, / | |
) -> tuple[pa.TableGroupBy, AggregateOptions]: | |
"""The default behavior of `pyarrow` raises when `first` or `last` are used. | |
You'd see an error like: | |
ArrowNotImplementedError: Using ordered aggregator in multiple threaded execution is not supported | |
We need to **disable** multi-threading to use them, but the ability to do so | |
wasn't possible before `14.0.0` ([pyarrow-36709]) | |
[pyarrow-36709]: https://github.com/apache/arrow/issues/36709 | |
""" | |
backend_version = self.compliant._backend_version | |
if backend_version >= (14, 0) and grouped._use_threads: | |
native = self.compliant.native | |
grouped = pa.TableGroupBy(native, grouped.keys, use_threads=False) | |
elif backend_version < (14, 0): # pragma: no cover | |
msg = ( | |
f"Using `{name}()` in a `group_by().agg(...)` context is only available in 'pyarrow>=14.0.0', " | |
f"found version {requires._unparse_version(backend_version)!r}.\n\n" | |
f"See https://github.com/apache/arrow/issues/36709" | |
) | |
raise NotImplementedError(msg) | |
return grouped, pc.ScalarAggregateOptions(skip_nulls=False) | |
def agg(self, *exprs: ArrowExpr) -> ArrowDataFrame: | |
self._ensure_all_simple(exprs) | |
aggs: list[tuple[str, Aggregation, AggregateOptions | None]] = [] | |
expected_pyarrow_column_names: list[str] = self._keys.copy() | |
new_column_names: list[str] = self._keys.copy() | |
exclude = (*self._keys, *self._output_key_names) | |
grouped = self._grouped | |
for expr in exprs: | |
output_names, aliases = evaluate_output_names_and_aliases( | |
expr, self.compliant, exclude | |
) | |
if expr._depth == 0: | |
# e.g. `agg(nw.len())` | |
if expr._function_name != "len": # pragma: no cover | |
msg = "Safety assertion failed, please report a bug to https://github.com/narwhals-dev/narwhals/issues" | |
raise AssertionError(msg) | |
new_column_names.append(aliases[0]) | |
expected_pyarrow_column_names.append(f"{self._keys[0]}_count") | |
aggs.append((self._keys[0], "count", pc.CountOptions(mode="all"))) | |
continue | |
grouped, function_name, option = self._configure_agg(grouped, expr) | |
new_column_names.extend(aliases) | |
expected_pyarrow_column_names.extend( | |
[f"{output_name}_{function_name}" for output_name in output_names] | |
) | |
aggs.extend( | |
[(output_name, function_name, option) for output_name in output_names] | |
) | |
result_simple = grouped.aggregate(aggs) | |
# Rename columns, being very careful |
thanks for looking into this i'd like to rewrite this to use namedagg, so it's probably worth doing that first #1661 |
Thanks @MarcoGorelli, I've linked the two issues No objections from me on how we get there π Did you know
EditYeah I've taken more of a look and I get it now. It's the combination of using |
See docs/notes
@MarcoGorelli maybe a longshot, but was wondering if you were interested in a fix for the Both
|
Need to go through the failures, then add back old stuff that's required (#2680 (comment))
I missed the renaming part, so I've documented that a bit now
We don't need to repeat this in group_by anymore
Fixes the 2 tpch q1 failures #2680 (comment)
PandasLikeGroupBy
PandasLikeGroupBy
- Seems to be the most minimal change to resolve (#2680 (comment)) - Need to review what else is still needed
Maybe addresses (#2680 (comment))
β¦v/narwhals into simp-pandas-group-by
@FBruzzesi feel free to delete any docs that seem unnecessary The API has shrunk a lot in (91b5800), so if what's left is easier enough to understand without docs - then I'm not precious on keeping them π |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the back and forth @dangotbanned - I have no more objections! This looks π₯
It covers so many edge cases that a few times I found myself thinking "wait a minute, let me change this line, I don't think we need it", well actually we do π
nice one, well done both! |
Ahh I was curious about if this was fixed for all of them π₯³ |
Will close #2489
What type of PR is this? (check all applicable)
Related issues
PandasLikeGroupBy
Β #2489{Expr,Series}.{first,last}
Β #2528 (comment)If you have comments or can explain your changes, please do so below
Expr.(first|last)
less complicatedNote
If you're a
pandas
expert, please feel free to hop on and add some commits πTasks
**named_aggs
see thread