Custom Hypothesis Strategies #648
ghost
started this conversation in
Show and tell
Replies: 1 comment 6 replies
-
UpdateI was able to remove the need to use an example to infer the column type, I just make a record of the original dtype before handing it over to Pandera. class DaskSchemaModel(pa.SchemaModel):
@classmethod
def to_schema(cls) -> DaskDataFrameSchema:
column_dtypes = {
name: column[0].arg
for name, column in cls._collect_fields().items()
}
schema = super().to_schema()
schema = schema.remove_columns(cls._remove_columns)
return DaskDataFrameSchema.from_schema(schema, column_dtypes)
class DaskDataFrameSchema(pa.DataFrameSchema):
def __init__(
self,
*args: Any,
column_dtypes: dict[str, Any],
**kwargs: Any,
) -> None:
super().__init__(*args, **kwargs)
self._column_dtypes = column_dtypes
@classmethod
def from_schema(
cls,
schema: pa.DataFrameSchema,
column_dtypes: dict[str, Any],
) -> 'DaskDataFrameSchema':
return cls(
columns=schema.columns,
checks=schema.checks,
index=schema.index,
dtype=schema.dtype,
coerce=schema.coerce,
strict=schema.strict,
name=schema.name,
ordered=schema.ordered,
unique=schema.unique,
column_dtypes=column_dtypes,
)
def strategy(
self,
*,
min_size: int = 0,
max_size: Optional[int] = None,
) -> st.SearchStrategy:
def build_column(column: Column) -> hypothesis.extra.pandas.column:
try:
[column_strategy] = [
check.strategy()
for check in column.checks
if check.name == 'strategy'
]
except ValueError:
raise ValueError('Schema fields must specify strategy.')
dtype = self._column_dtypes[column.name]
return hypothesis.extra.pandas.column(
name=column.name,
elements=column_strategy,
dtype=dtype,
)
columns = [build_column(column) for column in self.columns.values()]
index = hypothesis.extra.pandas.range_indexes(
min_size=min_size,
max_size=max_size,
)
data_frame = hypothesis.extra.pandas.data_frames(
columns=columns,
index=index,
)
return data_frame.map(partial(dd.from_pandas, npartitions=1)) |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
This is related to this discussion #647, but I wanted to separate the two distinct use cases. This discussion address synthesizing data using a "bottom-up" approach rather than Pandera's provided "top-down" approach (please correct me if that's not an accurate description). Many of the objects in the code below reference Dask because of the other use case in the related discussion.
Custom Hypothesis Strategies
I was able to support custom hypothesis strategies by registering the following custom check method.
This allows me to define complex data structures using the hypothesis library. A schema like this would most often be used in the first step of a data pipeline, before we have had a chance to flatten and normalize the schema.
In order to make this approach work with Pandera, I needed to override the schema models strategy generation logic to use the hypothesis data_frames strategy, rather than Pandera's strategy. The DaskDataFrameSchema.strategy method is very limited because of its need to generate an example to guess the dtype. I have not found a way to specify a dtype that will survive Pandera's internals yet, but I think there may be a way by improving the custom check method defined above.
This approach allows me to build DataFrames that adhere to the specific column level hypothesis strategies.
Beta Was this translation helpful? Give feedback.
All reactions