Skip to content

[REVIEW] Support Pandas 1.0+ #4546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 128 commits into from
Jun 26, 2020
Merged
Show file tree
Hide file tree
Changes from 59 commits
Commits
Show all changes
128 commits
Select commit Hold shift + click to select a range
7dbc000
Remove FrozenNDArray
brandon-b-miller Mar 16, 2020
094803b
test_categorical_binary_add raises a different error
brandon-b-miller Mar 16, 2020
944a992
allow multiindex codes to come from numpy array
brandon-b-miller Mar 16, 2020
86bce58
no need to materialize categorical in transpose
brandon-b-miller Mar 16, 2020
c7d7bea
test_to_from_pandas casts before roundtripping
brandon-b-miller Mar 16, 2020
ab78ba4
pd.core.indexes.base.Index name is now a property
brandon-b-miller Mar 16, 2020
2717d32
circumvent pandas is_bool_dtype behavior
brandon-b-miller Mar 17, 2020
fc34749
cant delete a property
brandon-b-miller Mar 17, 2020
f84078c
astype() no longer accepts kwargs
brandon-b-miller Mar 18, 2020
b0454f3
create basic test for nullable integer type
brandon-b-miller Mar 20, 2020
7bb9f5e
create basic test for nullable boolean type
brandon-b-miller Mar 20, 2020
2acb6c2
implement _cudf_nullable_pd_dtypes
brandon-b-miller Mar 20, 2020
236db6b
rework NumericalColumn.to_pandas()
brandon-b-miller Mar 20, 2020
d5fcb3c
create basic test for nullable string type
brandon-b-miller Mar 20, 2020
5cf18b0
rework StringColumn.to_pandas()
brandon-b-miller Mar 20, 2020
62d4cd8
Merge branch 'branch-0.14' into fea-support-pandas-1
brandon-b-miller Mar 24, 2020
89a8155
test_avro_reader_basic: drop to pandas before cast
brandon-b-miller Mar 24, 2020
73c016a
only set index if not none in string.to_pandas()
brandon-b-miller Mar 24, 2020
155a4e7
test_column_offset_and_size casts to string if object
brandon-b-miller Mar 24, 2020
37e7604
test_repeat: hacky string special casing
brandon-b-miller Mar 24, 2020
f1f2880
assert_eq normalizes string to object datatypes
brandon-b-miller Mar 26, 2020
581e0b5
move cast to after everything is pandas
brandon-b-miller Mar 26, 2020
421f31e
assert_eq instead of use pandas testing directly
brandon-b-miller Mar 26, 2020
ef4511a
test_dataframe_hash_partition_masked_value expects pd.NA not -1
brandon-b-miller Mar 26, 2020
187c07b
fix more dataframe.py tests
brandon-b-miller Mar 26, 2020
2649b6b
merge 0.14
brandon-b-miller Apr 1, 2020
e544dc8
fix string.py merge
brandon-b-miller Apr 1, 2020
c4756d7
Merge branch 'branch-0.14' into fea-support-pandas-1
brandon-b-miller Apr 3, 2020
0036f4e
pass test_string_export results through assert_eq
brandon-b-miller Apr 6, 2020
168b276
fix stringcolumn imports
brandon-b-miller Apr 9, 2020
7039320
test_string_cat: no lists, also a bug
brandon-b-miller Apr 9, 2020
494d9cd
numpy array is_list_like
brandon-b-miller Apr 9, 2020
2947dd4
merge 0.14 and resolve conflicts, broken
brandon-b-miller May 18, 2020
7c2782d
astype takes kwargs again
brandon-b-miller May 18, 2020
a6b5192
remove new to_pandas() functions for now
brandon-b-miller May 18, 2020
afd067f
remove kwargs from test_series_astype_null_cases
brandon-b-miller May 18, 2020
2780052
handle failure in pandas 1.0.3 here
brandon-b-miller May 19, 2020
17b10f2
cleanup
brandon-b-miller May 19, 2020
0bd93b5
workaround expected format for astype
brandon-b-miller May 19, 2020
0202b22
test_datetime_unique: circumvent new pandas bhvr.
brandon-b-miller May 19, 2020
e442376
workaround pandas no longer coercing str to dt
brandon-b-miller May 19, 2020
fc2a976
remove _cudf_nullable_pd_dtypes for now
brandon-b-miller May 19, 2020
02c169a
fails due to pandas bug in comment
brandon-b-miller May 19, 2020
75970e1
is_bool_dtype(slice) seems to be fixed
brandon-b-miller May 19, 2020
9368564
is_bool_dtype handled tuples probably by mistake
brandon-b-miller May 19, 2020
f36d216
xfail nullable integer and bool types
brandon-b-miller May 19, 2020
d55c3cb
fix parquet tests due to rename
brandon-b-miller May 20, 2020
6a0583a
Revert "fix parquet tests due to rename"
brandon-b-miller May 21, 2020
7fa3495
actually fix parquet tests
brandon-b-miller May 21, 2020
cc2c9bb
actually, really fix parquet tests
brandon-b-miller May 21, 2020
a56bbfb
rolling cython respects min_periods
brandon-b-miller May 22, 2020
98ecb0c
test string astype uses expected default format
brandon-b-miller May 22, 2020
d008e16
match pandas astype function signatureA
brandon-b-miller May 26, 2020
fddd84a
set default datetime to string format if not already set
brandon-b-miller May 26, 2020
d053a01
cleanup and roll back dtype changes
brandon-b-miller May 26, 2020
36e2566
correctly pass kwargs
brandon-b-miller May 26, 2020
6305c73
roll back assert_eq changes wrt new pandas dtypes
brandon-b-miller May 26, 2020
ef922c7
categorical replace works if new category isn't present
brandon-b-miller May 27, 2020
b7b8ebd
roll back superceded changes
brandon-b-miller May 27, 2020
ecacf11
update test_string_cat circumvent pandas bugs/index alignment
brandon-b-miller May 27, 2020
470e480
bump pandas to 1.0.3+
brandon-b-miller May 27, 2020
0a119c3
can no longer create a pd.DatetimeIndex using endpoints
brandon-b-miller May 28, 2020
e7351a7
cleanup
brandon-b-miller May 28, 2020
3459448
Merge branch 'branch-0.15' into fea-support-pandas-1
brandon-b-miller May 28, 2020
3194645
changelog
brandon-b-miller May 28, 2020
b896d0e
remove duplicated code
brandon-b-miller May 29, 2020
1286410
correctly set new categories
brandon-b-miller May 29, 2020
cb1b638
test categorical ordered dtype directly
brandon-b-miller May 29, 2020
6852e67
categorical astype tests still test ordered/unordered to other
brandon-b-miller May 29, 2020
3b9ff5b
update categorical astype tests
brandon-b-miller May 29, 2020
ac81f1f
drop bool column from json test
brandon-b-miller Jun 1, 2020
7adc46e
reference pandas issue directly
brandon-b-miller Jun 1, 2020
9c06608
Update python/cudf/cudf/utils/dtypes.py
brandon-b-miller Jun 1, 2020
11d59d0
Merge branch 'branch-0.15' into fea-support-pandas-1
brandon-b-miller Jun 3, 2020
b79f623
test datetime to string casting
brandon-b-miller Jun 3, 2020
7ae1097
implement dtype dependent string cast formatting
brandon-b-miller Jun 3, 2020
9bba3f5
adjust pandas version requirements
brandon-b-miller Jun 3, 2020
8210d73
update test_isin_dataframe
brandon-b-miller Jun 4, 2020
47c69a0
Merge branch 'branch-0.15' into fea-support-pandas-1
brandon-b-miller Jun 4, 2020
8039af5
fix _categories_equal for all situations hopefully
brandon-b-miller Jun 4, 2020
d7d3fc1
fix categorical column
brandon-b-miller Jun 5, 2020
77d3886
pick a format for test_series_astype_null_cases datetime
brandon-b-miller Jun 5, 2020
b89eec9
fix test_df_astype_datetime_to_other
brandon-b-miller Jun 5, 2020
5674032
cant make pandas df from cupy arrays, worked by accident
brandon-b-miller Jun 5, 2020
66f6bb7
rework and test categorical replace
brandon-b-miller Jun 9, 2020
e7cc0f8
update test_string_numeric_astype
brandon-b-miller Jun 9, 2020
634782d
bump pandas version
brandon-b-miller Jun 9, 2020
286ef80
test sort_values keep_index parameter
brandon-b-miller Jun 9, 2020
92f5757
implement sort_values keep_index parameter
brandon-b-miller Jun 9, 2020
932c7ad
_categories_equal doesn't gather index
brandon-b-miller Jun 9, 2020
3af7173
categorical replace remix
brandon-b-miller Jun 10, 2020
cdb0fb9
remove breakpoint
brandon-b-miller Jun 10, 2020
3fe0120
replace instead of scatter for every element
brandon-b-miller Jun 10, 2020
73cca01
align sort_values func signatures with pandas
brandon-b-miller Jun 10, 2020
ec80735
merge 0.15
brandon-b-miller Jun 10, 2020
b27ccd2
update try/excepts in tests
brandon-b-miller Jun 10, 2020
9ffd896
merge 0.15
brandon-b-miller Jun 11, 2020
d461cba
build dtype_replace without iteration
brandon-b-miller Jun 11, 2020
acd3fad
remove debugging code
brandon-b-miller Jun 11, 2020
0842d89
fix dtypes.py
brandon-b-miller Jun 11, 2020
5acdeb3
Apply suggestions from code review
brandon-b-miller Jun 11, 2020
e283200
do not gather index
brandon-b-miller Jun 11, 2020
dcf294a
Merge branch 'fea-support-pandas-1' of https://github.com/brandon-b-m…
brandon-b-miller Jun 11, 2020
6f356b1
Fix unnecessary dtypes changes
kkraus14 Jun 12, 2020
8dec7a0
Merge branch 'branch-0.15' into fea-support-pandas-1
brandon-b-miller Jun 12, 2020
1c979c3
xfail rolling count tests for nonzero data
brandon-b-miller Jun 12, 2020
b6d2522
remove dtype changes from earlier pr
brandon-b-miller Jun 12, 2020
7514d02
style
brandon-b-miller Jun 12, 2020
c7b87f3
Merge branch 'fea-support-pandas-1' of https://github.com/brandon-b-m…
brandon-b-miller Jun 12, 2020
8d1d2d0
merge 0.15 and fix test_string_numeric_astype
brandon-b-miller Jun 15, 2020
2aa5ac9
fix tests
brandon-b-miller Jun 15, 2020
bd157ef
style
brandon-b-miller Jun 15, 2020
b8ef425
Update python/cudf/cudf/tests/test_dataframe.py
brandon-b-miller Jun 15, 2020
5bbeaf1
move try/excepts around
brandon-b-miller Jun 15, 2020
346eb85
use numpy in string tests
brandon-b-miller Jun 15, 2020
55e3399
style
brandon-b-miller Jun 15, 2020
c16e30a
update build.sh to test pandas 1.0
brandon-b-miller Jun 16, 2020
4b50cd5
-f -> --force
brandon-b-miller Jun 16, 2020
423b7a2
fix dask_cudf tests
brandon-b-miller Jun 17, 2020
80f5866
style
brandon-b-miller Jun 17, 2020
241b4d6
Merge branch 'branch-0.15' into fea-support-pandas-1
brandon-b-miller Jun 17, 2020
293415d
can't make a datetimeindex with start/stop anymore
brandon-b-miller Jun 17, 2020
bdc05db
add comments to CategoricalColumn.find_and_replace
brandon-b-miller Jun 22, 2020
108805d
update test_numerical.py
brandon-b-miller Jun 22, 2020
d237a3d
Remove Pandas 1.0 installation from GPU build script
kkraus14 Jun 25, 2020
a824a3c
Forgot line
kkraus14 Jun 25, 2020
2a6b608
Merge branch 'branch-0.15' of https://github.com/rapidsai/cudf into f…
rgsl888prabhu Jun 26, 2020
e4d9493
fix test case
rgsl888prabhu Jun 26, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions python/cudf/cudf/_lib/rolling.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -75,8 +75,6 @@ def rolling(Column source_column, Column pre_column_window,
agg)
)
else:
if op == "count":
min_periods = 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few strange things going on here. The rabbit hole starts with the following change in behavior that started in pandas 1.0+:

pandas-dev/pandas#31302

It seems as though in pandas 0.25.3, the min_periods parameter was being ignored for count aggregations. The justification for this seems to have been that the designers wanted rolling.count to behave similarly to other count methods elsewhere in the API - e.g. always counting all the non-null values in the group of data in question. It looks like the code here was meant to make cuDF do the same. Concretely, this is the previous behavior in pandas:

>>> pd.__version__
'0.25.3'
>>> pd.Series([1,1,1,None]).rolling(2, min_periods=2, center=True).count()
0    1.0
1    2.0
2    2.0
3    1.0
dtype: float64

Here the min_periods parameter ends up getting ignored, evidently in two cases. Firstly, the very first window technically includes the nonexistent value at index position -1. Since min_periods is meant to be 2, so we should get a NaN here. Secondly, the last value in the original series is NaN, and since center=True the last value in the resulting series should also be NaN as there's no way the window of length 2 can have 2 valid values at that point.

Great I think, so the issue should be solved with Pandas 1.0 since a PR was merged for this, and I can remove this code in cuDF. In doing so we get the answer we'd expect in this case:

>>> cudf.Series.from_pandas(pd.Series([1,1,1,None])).rolling(2, min_periods=2, center=True).count()
0    null
1       2
2       2
3    null
dtype: int32

But here's what I actually get in Pandas:

>>> pd.__version__
'1.0.3'
>>> pd.Series([1,1,1,None]).rolling(2, min_periods=2, center=True).count()
0    NaN
1    2.0
2    2.0
3    1.0
dtype: float64

I am not sure if I'm not interpreting what this is supposed to be doing correctly, or what is going on really, I find this quite confusing and am still trying to make sense of it. Perhaps this is still a bug in pandas, but as far as our tests go, we're kind of 'darned if we do, darned if we don't' with these two lines of code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @shwina who spent a lot of time in rolling originally who may have some thoughts / insights

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused with Pandas behaviour here as well. Let's raise an issue, and xfail current tests for count?

Copy link
Contributor Author

@brandon-b-miller brandon-b-miller May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raised pandas-dev/pandas#34466, waiting to see if it's simply something I do not understand about what is intended here.

Copy link
Contributor Author

@brandon-b-miller brandon-b-miller Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I at least understand the results we're seeing in this repro around why for instance, the last element is considered a valid entry in this case. My understanding is that in pandas 1.0, it was decided that count should respect min_periods as it relates to the actual window size - meaning that if an element literally does not exist (is 'off the edge' of the series) then it won't count. For example if the window size is 2, the first element will be NaN because we won't have two elements to compare.

This is distinct however from the behavior around NaNs for count. In pandas, NaN doesn't contribute to the count of anything:

>>> pd.Series([None, 1, 2]).count()
2

The pandas devs make the point that this normalizes rolling.count against other kinds of counts across the codebase. Just to be concrete about this, this particular function breaks from the other rolling functions. For instance mean still requires it gets min_periods non-nan values.

>>> pd.Series([1,1,1,None]).rolling(2, min_periods=2, center=True).mean()
0    NaN
1    1.0
2    1.0
3    NaN
dtype: float64
>>> pd.Series([1,1,1,None]).rolling(2, min_periods=1, center=True).mean()
0    1.0
1    1.0
2    1.0
3    1.0

cc @kkraus14 @shwina

c_min_periods = min_periods
if center:
c_window = (window // 2) + 1
Expand Down
22 changes: 14 additions & 8 deletions python/cudf/cudf/core/column/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -501,24 +501,30 @@ def find_and_replace(self, to_replace, replacement, all_nan):
"""
replaced = column.as_column(self.cat().codes)

to_replace_col, replacement_col = [], []
new_cats = cudf.Series(self.dtype.categories)
for old_val, new_val in zip(to_replace, replacement):
if new_val not in self.dtype.categories:
new_cats = new_cats.replace(old_val, new_val)
else:
to_replace_col.append(self._encode(old_val))
replacement_col.append(self._encode(new_val))

to_replace_col = column.as_column(
np.asarray(
[self._encode(val) for val in to_replace], dtype=replaced.dtype
)
np.array(to_replace_col, dtype=replaced.dtype)
)
replacement_col = column.as_column(
np.asarray(
[self._encode(val) for val in replacement],
dtype=replaced.dtype,
)
np.array(replacement_col, dtype=replaced.dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just call as_column on to_replace and replacement and then batch encode all the values at once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what I tried initially. The problem I ran into was that we can't currently _encode a value that doesn't already exist in as one of the categories in the dtype. So we need to adjust the dtype along the way, and I figured the easiest way of doing that was to just adjust the label that corresponds to those underlying integers, so no actual replacement has to happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to rework this a bit. I hadn't realized that pandas categoricals require the codes to sequentially start from zero with no "gaps". This means we can't have the categories ['one', 'two'] represented by the integers 0 and 2, it has to be 0 and 1. I believe cuDF doesn't actually have the same restriction internally but since we interact with pandas for things like repr it will start to throw errors if we aren't compatible.

)

replaced = column.as_column(self.cat().codes)

output = libcudf.replace.replace(
replaced, to_replace_col, replacement_col
)

return column.build_categorical_column(
categories=self.dtype.categories,
categories=new_cats,
codes=column.as_column(output.base_data, dtype=output.dtype),
mask=output.base_mask,
offset=output.offset,
Expand Down
4 changes: 3 additions & 1 deletion python/cudf/cudf/core/column/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -1271,11 +1271,13 @@ def as_column(arbitrary, nan_as_null=None, dtype=None, length=None):
)

elif isinstance(arbitrary, pa.NullArray):
new_dtype = pd.api.types.pandas_dtype(dtype)
if (type(dtype) == str and dtype == "empty") or dtype is None:
new_dtype = pd.api.types.pandas_dtype(
arbitrary.type.to_pandas_dtype()
)
else:
new_dtype = pd.api.types.pandas_dtype(dtype)


if is_categorical_dtype(new_dtype):
arbitrary = arbitrary.dictionary_encode()
Expand Down
3 changes: 2 additions & 1 deletion python/cudf/cudf/core/column/datetime.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,8 @@ def as_numerical_column(self, dtype, **kwargs):

def as_string_column(self, dtype, **kwargs):
from cudf.core.column import string

if not kwargs.get("format"):
kwargs["format"] = "%Y-%m-%d %H:%M:%S.%f"
if len(self) > 0:
return string._numeric_to_str_typecast_functions[
np.dtype(self.dtype)
Expand Down
7 changes: 4 additions & 3 deletions python/cudf/cudf/core/multiindex.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,6 @@ def __init__(
names,
(
Sequence,
pd.core.indexes.frozen.FrozenNDArray,
pd.core.indexes.frozen.FrozenList,
),
):
Expand All @@ -85,7 +84,7 @@ def __init__(
raise ValueError("Must pass non-zero number of levels/codes")

if not isinstance(codes, DataFrame) and not isinstance(
codes[0], (Sequence, pd.core.indexes.frozen.FrozenNDArray)
codes[0], (Sequence, np.ndarray)
):
raise TypeError("Codes is not a Sequence of sequences")

Expand Down Expand Up @@ -470,7 +469,9 @@ def _index_and_downcast(self, result, index, index_key):
def _get_row_major(self, df, row_tuple):
from cudf import Series

if pd.api.types.is_bool_dtype(row_tuple):
if pd.api.types.is_bool_dtype(
list(row_tuple) if isinstance(row_tuple, tuple) else row_tuple
):
return df[row_tuple]

valid_indices = self._get_valid_indices_by_tuple(
Expand Down
5 changes: 2 additions & 3 deletions python/cudf/cudf/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -1417,7 +1417,7 @@ def as_mask(self):
"""
return self._column.as_mask()

def astype(self, dtype, copy=False, errors="raise", **kwargs):
def astype(self, dtype, copy=False, errors="raise"):
"""
Cast the Series to the given dtype

Expand All @@ -1439,7 +1439,6 @@ def astype(self, dtype, copy=False, errors="raise", **kwargs):
object.
- ``warn`` : prints last exceptions as warnings and
return original object.
**kwargs : extra arguments to pass on to the constructor

Returns
-------
Expand All @@ -1461,7 +1460,7 @@ def astype(self, dtype, copy=False, errors="raise", **kwargs):
if pd.api.types.is_dtype_equal(dtype, self.dtype):
return self.copy(deep=copy)
try:
data = self._column.astype(dtype, **kwargs)
data = self._column.astype(dtype)

return self._copy_construct(
data=data.copy(deep=True) if copy else data, index=self.index
Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/tests/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ def test_categorical_binary_add():

with pytest.raises(TypeError) as raises:
pdsr + pdsr
raises.match(r"Series cannot perform the operation \+")
raises.match("unsupported operand")

with pytest.raises(TypeError) as raises:
sr + sr
Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/tests/test_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -395,7 +395,7 @@ def test_csv_reader_usecols_int_char(tmpdir):
assert len(out.columns) == len(df_out.columns)
assert len(out) == len(df_out)
pd.util.testing.assert_frame_equal(
df_out, out.to_pandas(), check_names=False
df_out, out.to_pandas(), check_names=False
)


Expand Down
60 changes: 24 additions & 36 deletions python/cudf/cudf/tests/test_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -826,7 +826,6 @@ def test_dataframe_hash_partition_masked_value(nrows):
got_value = row.val
assert expected_value == got_value


@pytest.mark.parametrize("nrows", [3, 10, 50])
def test_dataframe_hash_partition_masked_keys(nrows):
gdf = DataFrame()
Expand Down Expand Up @@ -1439,13 +1438,6 @@ def test_dataframe_transpose_category(num_cols, num_rows):
got_function = gdf.transpose()
got_property = gdf.T

# materialize our categoricals because pandas
for name, col in got_function._data.items():
got_function[name] = col.astype(col.dtype.type)

for name, col in got_property._data.items():
got_property[name] = col.astype(col.dtype.type)

expect = pdf.transpose()

assert_eq(expect, got_function.to_pandas())
Expand Down Expand Up @@ -3314,13 +3306,11 @@ def test_series_astype_numeric_to_other(dtype, as_dtype):
def test_series_astype_string_to_other(as_dtype):
if "datetime64" in as_dtype:
data = ["2001-01-01", "2002-02-02", "2000-01-05"]
kwargs = {"format": "%Y-%m-%d"}
else:
data = ["1", "2", "3"]
kwargs = {}
psr = pd.Series(data)
gsr = gd.from_pandas(psr)
assert_eq(psr.astype(as_dtype), gsr.astype(as_dtype, **kwargs))
assert_eq(psr.astype(as_dtype), gsr.astype(as_dtype))


@pytest.mark.parametrize(
Expand All @@ -3338,7 +3328,7 @@ def test_series_astype_datetime_to_other(as_dtype):
data = ["2001-01-01", "2002-02-02", "2001-01-05"]
psr = pd.Series(data)
gsr = gd.from_pandas(psr)
assert_eq(psr.astype(as_dtype), gsr.astype(as_dtype, format="%Y-%m-%d"))
assert_eq(psr.astype(as_dtype), gsr.astype(as_dtype))


@pytest.mark.parametrize(
Expand All @@ -3357,22 +3347,20 @@ def test_series_astype_datetime_to_other(as_dtype):
def test_series_astype_categorical_to_other(as_dtype):
if "datetime64" in as_dtype:
data = ["2001-01-01", "2002-02-02", "2000-01-05", "2001-01-01"]
kwargs = {"format": "%Y-%m-%d"}
else:
data = [1, 2, 3, 1]
kwargs = {}
psr = pd.Series(data, dtype="category")
gsr = gd.from_pandas(psr)
assert_eq(psr.astype(as_dtype), gsr.astype(as_dtype, **kwargs))
assert_eq(psr.astype(as_dtype), gsr.astype(as_dtype))


@pytest.mark.parametrize("ordered", [True, False])
def test_series_astype_to_categorical_ordered(ordered):
psr = pd.Series([1, 2, 3, 1], dtype="category")
gsr = gd.from_pandas(psr)
assert_eq(
psr.astype("int32", ordered=ordered),
gsr.astype("int32", ordered=ordered),
psr.astype("int32"),
gsr.astype("int32"),
)


Expand All @@ -3393,7 +3381,7 @@ def test_series_astype_null_cases():

assert_eq(
gd.Series(data, dtype="datetime64[ms]"),
gd.Series(data).astype("datetime64[ms]", format="%Y-%m-%d"),
gd.Series(data).astype("datetime64[ms]"),
)

# categorical to other
Expand All @@ -3410,7 +3398,7 @@ def test_series_astype_null_cases():
assert_eq(
gd.Series(data, dtype="datetime64[ms]"),
gd.Series(data, dtype="category").astype(
"datetime64[ms]", format="%Y-%m-%d"
"datetime64[ms]"
),
)

Expand All @@ -3426,7 +3414,7 @@ def test_series_astype_null_cases():
dtype="datetime64[ms]",
),
gd.Series(["2001-01-01", "2001-02-01", None, "2001-03-01"]).astype(
"datetime64[ms]", format="%Y-%m-%d"
"datetime64[ms]"
),
)

Expand All @@ -3436,12 +3424,11 @@ def test_series_astype_null_cases():
)

# datetime to other
data = ["2001-01-01", "2001-02-01", None, "2001-03-01"]

data = ["2001-01-01 00:00:00.000000", "2001-02-01 00:00:00.000000", None, "2001-03-01 00:00:00.000000"]
assert_eq(
gd.from_pandas(pd.Series(data)),
gd.from_pandas(pd.Series(data, dtype="datetime64[ns]")).astype(
"str", format="%Y-%m-%d"
"str"
),
)

Expand Down Expand Up @@ -4023,10 +4010,14 @@ def test_isin_dataframe(data, values):
with pytest.raises(TypeError):
gdf.isin(values)
else:
expected = pdf.isin(values)
if isinstance(values, (pd.DataFrame, pd.Series)):
values = gd.from_pandas(values)
got = gdf.isin(values)
try:
expected = pdf.isin(values)
if isinstance(values, (pd.DataFrame, pd.Series)):
values = gd.from_pandas(values)
got = gdf.isin(values)
except ValueError as e:
if str(e) == "Lengths must match.":
pytest.xfail(reason='xref https://github.com/pandas-dev/pandas/issues/34256')

assert_eq(got, expected)

Expand Down Expand Up @@ -4173,13 +4164,11 @@ def test_df_astype_string_to_other(as_dtype):
# change None to "NaT" after this issue is fixed:
# https://github.com/rapidsai/cudf/issues/5117
data = ["2001-01-01", "2002-02-02", "2000-01-05", None]
kwargs = {"format": "%Y-%m-%d"}
elif as_dtype == "int32":
data = [1, 2, 3]
kwargs = {}
elif as_dtype == "category":
data = ["1", "2", "3", None]
kwargs = {}
elif "float" in as_dtype:
data = [1.0, 2.0, 3.0, np.nan]
kwargs = {}
Expand All @@ -4196,7 +4185,7 @@ def test_df_astype_string_to_other(as_dtype):
expect["foo"] = expect_data
expect["bar"] = expect_data

got = gdf.astype(as_dtype, **kwargs)
got = gdf.astype(as_dtype)
assert_eq(expect, got)


Expand All @@ -4212,7 +4201,7 @@ def test_df_astype_string_to_other(as_dtype):
],
)
def test_df_astype_datetime_to_other(as_dtype):
data = ["1991-11-20", "2004-12-04", "2016-09-13", None]
data = ["1991-11-20 00:00:00.000000", "2004-12-04 00:00:00.000000", "2016-09-13 00:00:00.000000", None]

gdf = DataFrame()
expect = DataFrame()
Expand All @@ -4237,7 +4226,7 @@ def test_df_astype_datetime_to_other(as_dtype):
expect["foo"] = Series(data, dtype=as_dtype)
expect["bar"] = Series(data, dtype=as_dtype)

got = gdf.astype(as_dtype, format="%Y-%m-%d")
got = gdf.astype(as_dtype)

assert_eq(expect, got)

Expand All @@ -4258,7 +4247,6 @@ def test_df_astype_datetime_to_other(as_dtype):
def test_df_astype_categorical_to_other(as_dtype):
if "datetime64" in as_dtype:
data = ["2001-01-01", "2002-02-02", "2000-01-05", "2001-01-01"]
kwargs = {"format": "%Y-%m-%d"}
else:
data = [1, 2, 3, 1]
kwargs = {}
Expand All @@ -4267,7 +4255,7 @@ def test_df_astype_categorical_to_other(as_dtype):
pdf["foo"] = psr
pdf["bar"] = psr
gdf = DataFrame.from_pandas(pdf)
assert_eq(pdf.astype(as_dtype), gdf.astype(as_dtype, **kwargs))
assert_eq(pdf.astype(as_dtype), gdf.astype(as_dtype))


@pytest.mark.parametrize("ordered", [True, False])
Expand All @@ -4279,8 +4267,8 @@ def test_df_astype_to_categorical_ordered(ordered):
gdf = DataFrame.from_pandas(pdf)

assert_eq(
gdf.astype("int32", ordered=ordered),
gdf.astype("int32", ordered=ordered),
gdf.astype("int32"),
gdf.astype("int32"),
)


Expand Down
7 changes: 2 additions & 5 deletions python/cudf/cudf/tests/test_datetime.py
Original file line number Diff line number Diff line change
Expand Up @@ -362,7 +362,7 @@ def test_typecast_from_datetime_to_datetime(data, from_dtype, to_dtype):
@pytest.mark.parametrize("data", [numerical_data()])
@pytest.mark.parametrize("nulls", ["some", "all"])
def test_to_from_pandas_nulls(data, nulls):
pd_data = pd.Series(data.copy())
pd_data = pd.Series(data.copy().astype('datetime64[ns]'))
if nulls == "some":
# Fill half the values with NaT
pd_data[list(range(0, len(pd_data), 2))] = np.datetime64("nat", "ns")
Expand Down Expand Up @@ -419,10 +419,7 @@ def test_datetime_unique(data, nulls):
expected = psr.unique()
got = gsr.unique()

# convert to int64 for equivalence testing
np.testing.assert_array_almost_equal(
got.to_pandas().astype(int), expected.astype(int)
)
assert_eq(pd.Series(expected), got.to_pandas())


@pytest.mark.parametrize(
Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/tests/test_feather.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def pdf(request):
nrows=nrows, ncols=ncols, data_gen_f=lambda r, c: r, r_idx_type="i"
)
# Delete the name of the column index, and rename the row index
del test_pdf.columns.name
test_pdf.columns.name = None
test_pdf.index.name = "index"

# Cast all the column dtypes to objects, rename them, and then cast to
Expand Down
10 changes: 10 additions & 0 deletions python/cudf/cudf/tests/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -870,6 +870,16 @@ def test_series_setitem_datetime():
psr = pd.Series(["2001", "2002", "2003"], dtype="datetime64[ns]")
gsr = cudf.from_pandas(psr)

psr[0] = np.datetime64("2005")
gsr[0] = np.datetime64("2005")

assert_eq(psr, gsr)

@pytest.mark.xfail(reason='Pandas will coerce to object datatype here')
def test_series_setitem_datetime_coerced():
psr = pd.Series(["2001", "2002", "2003"], dtype="datetime64[ns]")
gsr = cudf.from_pandas(psr)

psr[0] = "2005"
gsr[0] = "2005"

Expand Down
Loading