-
Notifications
You must be signed in to change notification settings - Fork 962
[REVIEW] Support Pandas 1.0+ #4546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 59 commits
7dbc000
094803b
944a992
86bce58
c7d7bea
ab78ba4
2717d32
fc34749
f84078c
b0454f3
7bb9f5e
2acb6c2
236db6b
d5fcb3c
5cf18b0
62d4cd8
89a8155
73c016a
155a4e7
37e7604
f1f2880
581e0b5
421f31e
ef4511a
187c07b
2649b6b
e544dc8
c4756d7
0036f4e
168b276
7039320
494d9cd
2947dd4
7c2782d
a6b5192
afd067f
2780052
17b10f2
0bd93b5
0202b22
e442376
fc2a976
02c169a
75970e1
9368564
f36d216
d55c3cb
6a0583a
7fa3495
cc2c9bb
a56bbfb
98ecb0c
d008e16
fddd84a
d053a01
36e2566
6305c73
ef922c7
b7b8ebd
ecacf11
470e480
0a119c3
e7351a7
3459448
3194645
b896d0e
1286410
cb1b638
6852e67
3b9ff5b
ac81f1f
7adc46e
9c06608
11d59d0
b79f623
7ae1097
9bba3f5
8210d73
47c69a0
8039af5
d7d3fc1
77d3886
b89eec9
5674032
66f6bb7
e7cc0f8
634782d
286ef80
92f5757
932c7ad
3af7173
cdb0fb9
3fe0120
73cca01
ec80735
b27ccd2
9ffd896
d461cba
acd3fad
0842d89
5acdeb3
e283200
dcf294a
6f356b1
8dec7a0
1c979c3
b6d2522
7514d02
c7b87f3
8d1d2d0
2aa5ac9
bd157ef
b8ef425
5bbeaf1
346eb85
55e3399
c16e30a
4b50cd5
423b7a2
80f5866
241b4d6
293415d
bdc05db
108805d
d237a3d
a824a3c
2a6b608
e4d9493
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -501,24 +501,30 @@ def find_and_replace(self, to_replace, replacement, all_nan): | |
""" | ||
replaced = column.as_column(self.cat().codes) | ||
|
||
to_replace_col, replacement_col = [], [] | ||
new_cats = cudf.Series(self.dtype.categories) | ||
for old_val, new_val in zip(to_replace, replacement): | ||
if new_val not in self.dtype.categories: | ||
new_cats = new_cats.replace(old_val, new_val) | ||
else: | ||
to_replace_col.append(self._encode(old_val)) | ||
replacement_col.append(self._encode(new_val)) | ||
|
||
to_replace_col = column.as_column( | ||
np.asarray( | ||
[self._encode(val) for val in to_replace], dtype=replaced.dtype | ||
) | ||
np.array(to_replace_col, dtype=replaced.dtype) | ||
kkraus14 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
) | ||
replacement_col = column.as_column( | ||
np.asarray( | ||
[self._encode(val) for val in replacement], | ||
dtype=replaced.dtype, | ||
) | ||
np.array(replacement_col, dtype=replaced.dtype) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we just call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is what I tried initially. The problem I ran into was that we can't currently There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had to rework this a bit. I hadn't realized that pandas categoricals require the codes to sequentially start from zero with no "gaps". This means we can't have the categories
kkraus14 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
) | ||
|
||
kkraus14 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
replaced = column.as_column(self.cat().codes) | ||
kkraus14 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
output = libcudf.replace.replace( | ||
replaced, to_replace_col, replacement_col | ||
) | ||
|
||
return column.build_categorical_column( | ||
categories=self.dtype.categories, | ||
categories=new_cats, | ||
kkraus14 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
codes=column.as_column(output.base_data, dtype=output.dtype), | ||
mask=output.base_mask, | ||
offset=output.offset, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few strange things going on here. The rabbit hole starts with the following change in behavior that started in pandas 1.0+:
pandas-dev/pandas#31302
It seems as though in pandas 0.25.3, the
min_periods
parameter was being ignored forcount
aggregations. The justification for this seems to have been that the designers wantedrolling.count
to behave similarly to othercount
methods elsewhere in the API - e.g. always counting all the non-null values in the group of data in question. It looks like the code here was meant to makecuDF
do the same. Concretely, this is the previous behavior in pandas:Here the
min_periods
parameter ends up getting ignored, evidently in two cases. Firstly, the very first window technically includes the nonexistent value at index position-1
. Sincemin_periods
is meant to be2
, so we should get aNaN
here. Secondly, the last value in the original series isNaN
, and sincecenter=True
the last value in the resulting series should also beNaN
as there's no way the window of length2
can have2
valid values at that point.Great I think, so the issue should be solved with Pandas 1.0 since a PR was merged for this, and I can remove this code in cuDF. In doing so we get the answer we'd expect in this case:
But here's what I actually get in Pandas:
I am not sure if I'm not interpreting what this is supposed to be doing correctly, or what is going on really, I find this quite confusing and am still trying to make sense of it. Perhaps this is still a bug in pandas, but as far as our tests go, we're kind of 'darned if we do, darned if we don't' with these two lines of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @shwina who spent a lot of time in
rolling
originally who may have some thoughts / insightsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused with Pandas behaviour here as well. Let's raise an issue, and xfail current tests for
count
?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raised pandas-dev/pandas#34466, waiting to see if it's simply something I do not understand about what is intended here.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I at least understand the results we're seeing in this repro around why for instance, the last element is considered a valid entry in this case. My understanding is that in pandas 1.0, it was decided that
count
should respectmin_periods
as it relates to the actual window size - meaning that if an element literally does not exist (is 'off the edge' of the series) then it won't count. For example if the window size is 2, the first element will be NaN because we won't have two elements to compare.This is distinct however from the behavior around NaNs for
count
. In pandas,NaN
doesn't contribute to thecount
of anything:The pandas devs make the point that this normalizes
rolling.count
against other kinds ofcount
s across the codebase. Just to be concrete about this, this particular function breaks from the other rolling functions. For instancemean
still requires it getsmin_periods
non-nan values.cc @kkraus14 @shwina