Skip to content

differences in Series.map with defaultdict with different dtypes #49011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Oct 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,7 @@ Indexing
Missing
^^^^^^^
- Bug in :meth:`Index.equals` raising ``TypeError`` when :class:`Index` consists of tuples that contain ``NA`` (:issue:`48446`)
-
- Bug in :meth:`Series.map` caused incorrect result when data has NaNs and defaultdict mapping was used (:issue:`48813`)

MultiIndex
^^^^^^^^^^
Expand Down
4 changes: 3 additions & 1 deletion pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -831,7 +831,9 @@ def _map_values(self, mapper, na_action=None):
# If a dictionary subclass defines a default value method,
# convert mapper to a lookup function (GH #15999).
dict_with_default = mapper
mapper = lambda x: dict_with_default[x]
mapper = lambda x: dict_with_default[
np.nan if isinstance(x, float) and np.isnan(x) else x
]
else:
# Dictionary does not have a default. Thus it's safe to
# convert to an Series for efficiency.
Expand Down
32 changes: 31 additions & 1 deletion pandas/tests/apply/test_series_apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -598,6 +598,36 @@ def test_map_dict_na_key():
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize("na_action", [None, "ignore"])
def test_map_defaultdict_na_key(na_action):
# GH 48813
s = Series([1, 2, np.nan])
default_map = defaultdict(lambda: "missing", {1: "a", 2: "b", np.nan: "c"})
result = s.map(default_map, na_action=na_action)
expected = Series({0: "a", 1: "b", 2: "c" if na_action is None else np.nan})
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize("na_action", [None, "ignore"])
def test_map_defaultdict_missing_key(na_action):
# GH 48813
s = Series([1, 2, np.nan])
default_map = defaultdict(lambda: "missing", {1: "a", 2: "b", 3: "c"})
result = s.map(default_map, na_action=na_action)
expected = Series({0: "a", 1: "b", 2: "missing" if na_action is None else np.nan})
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize("na_action", [None, "ignore"])
def test_map_defaultdict_unmutated(na_action):
# GH 48813
s = Series([1, 2, np.nan])
default_map = defaultdict(lambda: "missing", {1: "a", 2: "b", np.nan: "c"})
expected_default_map = default_map.copy()
s.map(default_map, na_action=na_action)
assert default_map == expected_default_map


@pytest.mark.parametrize("arg_func", [dict, Series])
def test_map_dict_ignore_na(arg_func):
# GH#47527
Expand All @@ -613,7 +643,7 @@ def test_map_defaultdict_ignore_na():
mapping = defaultdict(int, {1: 10, np.nan: 42})
ser = Series([1, np.nan, 2])
result = ser.map(mapping)
expected = Series([10, 0, 0])
expected = Series([10, 42, 0])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this older testcase seemed incorrect & needed to be corrected with this change.
Am however unsure if the older behavior is as per intention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the discussion we had in #47585 the old behavior is correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh I see.
Do we keep it as it is now ? or is this enhancement fine - since there are some discrepancies observed (based on dtype) in map behavior as highlighted in bug description #48813
TIA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @rhshadrach can you weigh in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on changing the behavior in this test. The 0 in question here is because np.nan does not equal itself, and NumPy often returns views so that ids are not equal either; e.g.

mapping = defaultdict(int, {1: 10, np.nan: 42})
arr = np.array([1, np.nan, 2])
print(mapping[arr[1]])

# 0

I think introducing a better lookup for NaN values makes sense, and brings this in line with the Series case:

mapping = pd.Series({1: 10, np.nan: 42})
ser = Series([1, np.nan, 2])
print(ser.map(mapping))

# 0    10.0
# 1    42.0
# 2     NaN
# dtype: float64

tm.assert_series_equal(result, expected)


Expand Down