Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -379,6 +379,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrame.groupby` not offering selection by column name when ``axis=1`` (:issue:`27614`)
- Bug in :meth:`DataFrameGroupby.agg` not able to use lambda function with named aggregation (:issue:`27519`)
- Bug in :meth:`DataFrame.groupby` losing column name information when grouping by a categorical column (:issue:`28787`)
- Bug in :meth:`DataFrameGroupBy.quantile` where NA values in the grouping could cause segfaults or incorrect results (:issue:`28882`)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have removed this before merging; will clean up after backport


Reshaping
^^^^^^^^^
Expand Down
3 changes: 3 additions & 0 deletions pandas/_libs/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -763,6 +763,9 @@ def group_quantile(ndarray[float64_t] out,
with nogil:
for i in range(N):
lab = labels[i]
if lab == -1: # NA group label
continue

counts[lab] += 1
if not mask[i]:
non_na_counts[lab] += 1
Expand Down
23 changes: 23 additions & 0 deletions pandas/tests/groupby/test_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -1373,6 +1373,29 @@ def test_quantile_out_of_bounds_q_raises():
g.quantile(-1)


def test_quantile_missing_group_values_no_segfaults():
# GH 28662
data = np.array([1.0, np.nan, 1.0])
df = pd.DataFrame(dict(key=data, val=range(3)))

# Random segfaults; would have been guaranteed in loop
grp = df.groupby("key")
for _ in range(100):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i take it this is doesn't reliably segfault on the first try?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or i could just read the comment three lines up. never mind

grp.quantile()


def test_quantile_missing_group_values_correct_results():
# GH 28662
data = np.array([1.0, np.nan, 3.0, np.nan])
df = pd.DataFrame(dict(key=data, val=range(4)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to re-use the setup from the other test? could just assert the result after the loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different frame shapes. I think segfaults were occurring with oddly sized frames and bad results with evenly sized. Might be some other conflating factor


result = df.groupby("key").quantile()
expected = pd.DataFrame(
[1.0, 3.0], index=pd.Index([1.0, 3.0], name="key"), columns=["val"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the expected should be [0, 2], but also on 0.24.2 the columns index name was the q value. not sure if this is important.

>>> pd.__version__
'0.24.2'
>>>
>>> import numpy as np
>>>
>>> data = np.array([1.0, np.nan, 3.0, np.nan])
>>> df = pd.DataFrame(dict(key=data, val=range(4)))
>>>
>>>
>>> df.groupby("key").quantile()
0.5  val
key
1.0  0.0
3.0  2.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that this is a groupby and each element belongs to its own group. [1, 3] is essentially the identity of the non-NA groupings and should be correct

w.r.t. the column index name that probably goes back to https://github.com/pandas-dev/pandas/pull/20405/files#r208366338 which was an inconsistency in 0.24.2

)
tm.assert_frame_equal(result, expected)


# pipe
# --------------------------------

Expand Down