-
Notifications
You must be signed in to change notification settings - Fork 162
feat: add DataFrame.top_k
and LazyFrame.top_k
#2977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @raisadz - I left a few comments, only one which I really care about which is about length input validation at the narwhals level
def top_k( | ||
self, k: int, *, by: str | Iterable[str], reverse: bool | Sequence[bool] = False | ||
) -> Self: | ||
flatten_by = flatten([by]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a check that if reverse
is a sequence, and it's length is different than flatten_by
, then an exception is raise? This guarantees that zip(by, reverse)
at the compliant level is same as zip_strict
.
From polars:
df = pl.DataFrame(
{
"a": ["a", "b", "a", "b", "b", "c"],
"b": [2, 1, 1, 3, 2, 1],
}
)
df.top_k(4, by=["b", "a"], reverse=[True])
ValueError: the length of
reverse
(1) does not match the length ofby
(2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raisadz I would still prefer to add a check at this level to also align the error with polars (notice that the output of flatten
is a list anyway), but feel free to merge. We can follow up on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think there's some other places where this would be useful (like sort
) so we could probably make a validation utility for this and use it in multiple places
narwhals/_duckdb/dataframe.py
Outdated
@@ -409,6 +409,24 @@ def sort(self, *by: str, descending: bool | Sequence[bool], nulls_last: bool) -> | |||
) | |||
return self._with_native(self.native.sort(*it)) | |||
|
|||
def top_k(self, k: int, *, by: Iterable[str], reverse: bool | Sequence[bool]) -> Self: | |||
df = self.native # noqa: F841 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you prefix the variable name with an underscore (_df
) you can avoid the # noqa: F841
flag. It's hacky I know
Co-authored-by: Francesco Bruzzesi <[email protected]>
Co-authored-by: Francesco Bruzzesi <[email protected]>
Thanks for the review @FBruzzesi ! I addressed your comments and will add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What type of PR is this? (check all applicable)
Related issues
{DataFrame/LazyFrame}.top_k
Β #2947Checklist
If you have comments or can explain your changes, please do so below