-
-
Notifications
You must be signed in to change notification settings - Fork 366
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Using "string[pyarrow]" string alias for both the DataFrame and the DataFrameSchema result in a failing validation. This is very unintuitive.
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandera.
- (optional) I have confirmed this bug exists on the main branch of pandera.
Code Sample, a copy-pastable example
df = pd.DataFrame({"col1": ["a", "b"]}, dtype = "string[pyarrow]")
schema = pa.DataFrameSchema(columns={"col1": pa.Column("string[pyarrow]")})
schema.validate(df)
# SchemaError: expected series 'col1' to have type string[pyarrow], got stringExpected behavior
I expect the validation to pass.
Desktop (please complete the following information):
- OS: Windows
- Browser: NA
- Version: 0.24.0
Additional context
My guess is that while in Pandera the "string[pyarrow]" alias points at the actual Arrow datatype, in pandas "string[pyarrow]" maps to pd.StringDtype("pyarrow"). See this from the Pandas documentation (https://pandas.pydata.org/docs/user_guide/pyarrow.html):
The string alias "string[pyarrow]" maps to pd.StringDtype("pyarrow") which is not equivalent to specifying dtype=pd.ArrowDtype(pa.string()). Generally, operations on the data will behave similarly except pd.StringDtype("pyarrow") can return NumPy-backed nullable types while pd.ArrowDtype(pa.string()) will return [ArrowDtype](https://pandas.pydata.org/docs/reference/api/pandas.ArrowDtype.html#pandas.ArrowDtype).
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working