Skip to content

Validation of pandas dataframe column of "string[pyarrow]" dtype fails #2017

@ClauPet

Description

@ClauPet

Describe the bug
Using "string[pyarrow]" string alias for both the DataFrame and the DataFrameSchema result in a failing validation. This is very unintuitive.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample, a copy-pastable example

df = pd.DataFrame({"col1": ["a", "b"]}, dtype = "string[pyarrow]")
schema = pa.DataFrameSchema(columns={"col1": pa.Column("string[pyarrow]")})
schema.validate(df)
# SchemaError: expected series 'col1' to have type string[pyarrow], got string

Expected behavior

I expect the validation to pass.

Desktop (please complete the following information):

  • OS: Windows
  • Browser: NA
  • Version: 0.24.0

Additional context

My guess is that while in Pandera the "string[pyarrow]" alias points at the actual Arrow datatype, in pandas "string[pyarrow]" maps to pd.StringDtype("pyarrow"). See this from the Pandas documentation (https://pandas.pydata.org/docs/user_guide/pyarrow.html):

The string alias "string[pyarrow]" maps to pd.StringDtype("pyarrow") which is not equivalent to specifying dtype=pd.ArrowDtype(pa.string()). Generally, operations on the data will behave similarly except pd.StringDtype("pyarrow") can return NumPy-backed nullable types while pd.ArrowDtype(pa.string()) will return [ArrowDtype](https://pandas.pydata.org/docs/reference/api/pandas.ArrowDtype.html#pandas.ArrowDtype).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions