Skip to content

feat: add DataFrame.top_k and LazyFrame.top_k #2977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

raisadz
Copy link
Contributor

@raisadz raisadz commented Aug 12, 2025

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@raisadz raisadz added the pyspark Issue is related to pyspark backend label Aug 12, 2025
@raisadz raisadz marked this pull request as ready for review August 12, 2025 16:34
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raisadz - I left a few comments, only one which I really care about which is about length input validation at the narwhals level

def top_k(
self, k: int, *, by: str | Iterable[str], reverse: bool | Sequence[bool] = False
) -> Self:
flatten_by = flatten([by])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a check that if reverse is a sequence, and it's length is different than flatten_by, then an exception is raise? This guarantees that zip(by, reverse) at the compliant level is same as zip_strict.

From polars:

df = pl.DataFrame(
    {
        "a": ["a", "b", "a", "b", "b", "c"],
        "b": [2, 1, 1, 3, 2, 1],
    }
)

df.top_k(4, by=["b", "a"], reverse=[True])

ValueError: the length of reverse (1) does not match the length of by (2)

Copy link
Member

@FBruzzesi FBruzzesi Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raisadz I would still prefer to add a check at this level to also align the error with polars (notice that the output of flatten is a list anyway), but feel free to merge. We can follow up on it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there's some other places where this would be useful (like sort) so we could probably make a validation utility for this and use it in multiple places

@@ -409,6 +409,24 @@ def sort(self, *by: str, descending: bool | Sequence[bool], nulls_last: bool) ->
)
return self._with_native(self.native.sort(*it))

def top_k(self, k: int, *, by: Iterable[str], reverse: bool | Sequence[bool]) -> Self:
df = self.native # noqa: F841
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you prefix the variable name with an underscore (_df) you can avoid the # noqa: F841 flag. It's hacky I know

@FBruzzesi FBruzzesi added the enhancement New feature or request label Aug 16, 2025
@raisadz raisadz mentioned this pull request Aug 17, 2025
10 tasks
@raisadz
Copy link
Contributor Author

raisadz commented Aug 17, 2025

Thanks for the review @FBruzzesi ! I addressed your comments and will add zip_strict from #3003 after it is merged

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks both @raisadz and @FBruzzesi !

happy to ship it if there's no further comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pyspark Issue is related to pyspark backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

enh?: {DataFrame/LazyFrame}.top_k
3 participants