-
-
Notifications
You must be signed in to change notification settings - Fork 18.9k
REF: Dispatch string methods to ExtensionArray #36357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jreback
merged 35 commits into
pandas-dev:master
from
TomAugspurger:dispatch-string-methods
Sep 30, 2020
Merged
Changes from 29 commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
9e90d4e
Implement BaseDtypeTests for ArrowStringDtype
xhochy 92f1d26
Refactor to use parametrized StringDtype
TomAugspurger 00096f0
wip
TomAugspurger 5a89dbf
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
TomAugspurger 89f8e6a
annoyed
TomAugspurger 3f82225
Merge remote-tracking branch 'upstream/master' into dispatch-string-m…
TomAugspurger fabc01e
wip
TomAugspurger a4d4ad5
remove old
TomAugspurger e76a3c1
fixup
TomAugspurger 49dff8a
Merge remote-tracking branch 'upstream/master' into dispatch-string-m…
TomAugspurger 75831b3
fixup
TomAugspurger 1cf54cc
doctest
TomAugspurger fc81ebe
docstrings
TomAugspurger 6be1af6
typing
TomAugspurger 95b3310
typing
TomAugspurger 20a8705
wip
TomAugspurger 136831a
Merge remote-tracking branch 'upstream/master' into dispatch-string-m…
TomAugspurger 38c1611
wip
TomAugspurger ea27e57
Merge remote-tracking branch 'upstream/master' into dispatch-string-m…
TomAugspurger 8d3aecd
Move to arrays
TomAugspurger d11c2ba
Fixup types
TomAugspurger 349e281
test coverage
TomAugspurger c6b99cb
Merge remote-tracking branch 'upstream/master' into dispatch-string-m…
TomAugspurger b7ab130
fixup
TomAugspurger 3b837d1
Merge remote-tracking branch 'upstream/master' into dispatch-string-m…
TomAugspurger 28cf7e6
Merge remote-tracking branch 'upstream/master' into dispatch-string-m…
TomAugspurger 6dcd44e
update docstring
TomAugspurger efb3e3d
document current implementation
TomAugspurger 0da7031
typo
TomAugspurger 35a97ab
Merge remote-tracking branch 'upstream/master' into dispatch-string-m…
TomAugspurger d681f99
fixup
TomAugspurger cc5ceed
Merge remote-tracking branch 'upstream/master' into dispatch-string-m…
TomAugspurger 457c112
fixup
TomAugspurger 58e1bb9
Merge remote-tracking branch 'upstream/master' into dispatch-string-m…
TomAugspurger cb2fb24
simplify inheritance
TomAugspurger File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -51,6 +51,7 @@ | |||||
from pandas.core.missing import interpolate_2d | ||||||
from pandas.core.ops.common import unpack_zerodim_and_defer | ||||||
from pandas.core.sorting import nargsort | ||||||
from pandas.core.strings.object_array import ObjectStringArrayMixin | ||||||
|
||||||
from pandas.io.formats import console | ||||||
|
||||||
|
@@ -176,7 +177,7 @@ def contains(cat, key, container): | |||||
return any(loc_ in container for loc_ in loc) | ||||||
|
||||||
|
||||||
class Categorical(NDArrayBackedExtensionArray, PandasObject): | ||||||
class Categorical(NDArrayBackedExtensionArray, PandasObject, ObjectStringArrayMixin): | ||||||
""" | ||||||
Represent a categorical variable in classic R / S-plus fashion. | ||||||
|
||||||
|
@@ -2312,6 +2313,25 @@ def replace(self, to_replace, value, inplace: bool = False): | |||||
if not inplace: | ||||||
return cat | ||||||
|
||||||
# ------------------------------------------------------------------------ | ||||||
# String methods interface | ||||||
def _str_map(self, f, na_value=np.nan, dtype=np.dtype(object)): | ||||||
# Optimization to apply the callable `f` to the categories once | ||||||
# and rebuild the result by `take`ing from the result with the codes. | ||||||
# Returns the same type as the object-dtype impelmentation though. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
from pandas.core.arrays import PandasArray | ||||||
|
||||||
categories = self.categories | ||||||
codes = self.codes | ||||||
result = PandasArray(categories.to_numpy())._str_map(f, na_value, dtype) | ||||||
return take_1d(result, codes, fill_value=na_value) | ||||||
|
||||||
def _str_get_dummies(self, sep="|"): | ||||||
# sep may not be in categories. Just bail on this. | ||||||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
from pandas.core.arrays import PandasArray | ||||||
|
||||||
return PandasArray(self.astype(str))._str_get_dummies(sep) | ||||||
|
||||||
|
||||||
# The Series.cat accessor | ||||||
|
||||||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,8 +6,14 @@ | |
from pandas._libs import lib, missing as libmissing | ||
|
||
from pandas.core.dtypes.base import ExtensionDtype, register_extension_dtype | ||
from pandas.core.dtypes.common import pandas_dtype | ||
from pandas.core.dtypes.inference import is_array_like | ||
from pandas.core.dtypes.common import ( | ||
is_array_like, | ||
is_bool_dtype, | ||
is_integer_dtype, | ||
is_object_dtype, | ||
is_string_dtype, | ||
pandas_dtype, | ||
) | ||
|
||
from pandas import compat | ||
from pandas.core import ops | ||
|
@@ -16,6 +22,7 @@ | |
from pandas.core.construction import extract_array | ||
from pandas.core.indexers import check_array_indexer | ||
from pandas.core.missing import isna | ||
from pandas.core.strings.object_array import ObjectStringArrayMixin | ||
|
||
if TYPE_CHECKING: | ||
import pyarrow # noqa: F401 | ||
|
@@ -96,7 +103,7 @@ def __from_arrow__( | |
return StringArray._concat_same_type(results) | ||
|
||
|
||
class StringArray(PandasArray): | ||
class StringArray(PandasArray, ObjectStringArrayMixin): | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
Extension array for string data. | ||
|
||
|
@@ -347,6 +354,59 @@ def _add_arithmetic_ops(cls): | |
cls.__rmul__ = cls._create_arithmetic_method(ops.rmul) | ||
|
||
_create_comparison_method = _create_arithmetic_method | ||
# ------------------------------------------------------------------------ | ||
# String methods interface | ||
_str_na_value = StringDtype.na_value | ||
|
||
def _str_map(self, f, na_value=None, dtype=None): | ||
from pandas.arrays import BooleanArray, IntegerArray, StringArray | ||
from pandas.core.arrays.string_ import StringDtype | ||
|
||
if dtype is None: | ||
dtype = StringDtype() | ||
if na_value is None: | ||
na_value = self.dtype.na_value | ||
|
||
mask = isna(self) | ||
arr = self | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this needed? |
||
arr = np.asarray(self) | ||
|
||
if is_integer_dtype(dtype) or is_bool_dtype(dtype): | ||
constructor: Union[Type[IntegerArray], Type[BooleanArray]] | ||
if is_integer_dtype(dtype): | ||
constructor = IntegerArray | ||
else: | ||
constructor = BooleanArray | ||
|
||
na_value_is_na = isna(na_value) | ||
if na_value_is_na: | ||
na_value = 1 | ||
result = lib.map_infer_mask( | ||
arr, | ||
f, | ||
mask.view("uint8"), | ||
convert=False, | ||
na_value=na_value, | ||
dtype=np.dtype(dtype), | ||
) | ||
|
||
if not na_value_is_na: | ||
mask[:] = False | ||
|
||
return constructor(result, mask) | ||
|
||
elif is_string_dtype(dtype) and not is_object_dtype(dtype): | ||
# i.e. StringDtype | ||
result = lib.map_infer_mask( | ||
arr, f, mask.view("uint8"), convert=False, na_value=na_value | ||
) | ||
return StringArray(result) | ||
else: | ||
# This is when the result type is object. We reach this when | ||
# -> We know the result type is truly object (e.g. .encode returns bytes | ||
# or .findall returns a list). | ||
# -> We don't know the result type. E.g. `.get` can return anything. | ||
return lib.map_infer_mask(arr, f, mask.view("uint8")) | ||
|
||
|
||
StringArray._add_arithmetic_ops() | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.