Custom Hypothesis Strategies #648

ghost · 2021-10-07T16:46:34Z

ghost
Oct 7, 2021

This is related to this discussion #647, but I wanted to separate the two distinct use cases. This discussion address synthesizing data using a "bottom-up" approach rather than Pandera's provided "top-down" approach (please correct me if that's not an accurate description). Many of the objects in the code below reference Dask because of the other use case in the related discussion.

Custom Hypothesis Strategies

I was able to support custom hypothesis strategies by registering the following custom check method.

def pass_through_strategy(
    *_: Any,
    passed_strategy: st.SearchStrategy,
) -> st.SearchStrategy:
    return passed_strategy


@register_check_method(
    statistics=['passed_strategy'],
    strategy=pass_through_strategy,
)
def strategy(*_: Any, passed_strategy: st.SearchStrategy) -> bool:
    return True

This allows me to define complex data structures using the hypothesis library. A schema like this would most often be used in the first step of a data pipeline, before we have had a chance to flatten and normalize the schema.

class Schema(DaskSchemaModel):
    symbols: Series[object] = pa.Field(
        strategy=st.lists(
            st.fixed_dictionaries(
                {
                    'symbol': st.text(string.ascii_uppercase),
                    'cusip': st.text(string.ascii_uppercase + string.digits),
                    'type': st.sampled_from(['common', 'units', 'warrants']),
                    'isin': st.text(string.ascii_uppercase + string.digits),
                },
            ),
            min_size=1,
        )
    )

In order to make this approach work with Pandera, I needed to override the schema models strategy generation logic to use the hypothesis data_frames strategy, rather than Pandera's strategy. The DaskDataFrameSchema.strategy method is very limited because of its need to generate an example to guess the dtype. I have not found a way to specify a dtype that will survive Pandera's internals yet, but I think there may be a way by improving the custom check method defined above.

class DaskSchemaModel(pa.SchemaModel):
    @classmethod
    def strategy(
        cls,
        *,
        min_size: int = 0,
        max_size: Optional[int] = None,
    ) -> st.SearchStrategy:
        return cls.to_schema().strategy(min_size=min_size, max_size=max_size)

class DaskDataFrameSchema(pa.DataFrameSchema):
    def strategy(
        self,
        *,
        min_size: int = 0,
        max_size: Optional[int] = None,
    ) -> st.SearchStrategy:
        def build_column(column: Column) -> hypothesis.extra.pandas.column:
            try:
                [column_strategy] = [
                    check.strategy()
                    for check in column.checks
                    if check.name == 'strategy'
                ]
            except ValueError:
                raise ValueError('Schema fields must specify strategy.')

            with warnings.catch_warnings():
                warnings.filterwarnings(
                    action='ignore',
                    category=NonInteractiveExampleWarning,
                )
                example = column_strategy.example()

            if isinstance(example, (list, dict)):
                dtype = object
            elif isinstance(example, bool):
                dtype = bool
            elif isinstance(example, int):
                dtype = int
            elif isinstance(example, str):
                dtype = pd.StringDtype
            else:
                dtype = None

            return hypothesis.extra.pandas.column(
                name=column.name,
                elements=column_strategy,
                dtype=dtype,
            )

        columns = [build_column(column) for column in self.columns.values()]
        index = hypothesis.extra.pandas.range_indexes(
            min_size=min_size,
            max_size=max_size,
        )
        data_frame = hypothesis.extra.pandas.data_frames(
            columns=columns,
            index=index,
        )
        return data_frame.map(partial(dd.from_pandas, npartitions=1))

This approach allows me to build DataFrames that adhere to the specific column level hypothesis strategies.

@hypothesis.given(df=Schema.strategy())
def test_transform(df: DaskDataFrame[Schema]) -> None:
    ...

ghost · 2021-10-07T19:11:49Z

ghost
Oct 7, 2021

Update

I was able to remove the need to use an example to infer the column type, I just make a record of the original dtype before handing it over to Pandera.

class DaskSchemaModel(pa.SchemaModel):
    @classmethod
    def to_schema(cls) -> DaskDataFrameSchema:
        column_dtypes = {
            name: column[0].arg
            for name, column in cls._collect_fields().items()
        }

        schema = super().to_schema()
        schema = schema.remove_columns(cls._remove_columns)
        return DaskDataFrameSchema.from_schema(schema, column_dtypes)

class DaskDataFrameSchema(pa.DataFrameSchema):
    def __init__(
        self,
        *args: Any,
        column_dtypes: dict[str, Any],
        **kwargs: Any,
    ) -> None:
        super().__init__(*args, **kwargs)
        self._column_dtypes = column_dtypes

    @classmethod
    def from_schema(
        cls,
        schema: pa.DataFrameSchema,
        column_dtypes: dict[str, Any],
    ) -> 'DaskDataFrameSchema':
        return cls(
            columns=schema.columns,
            checks=schema.checks,
            index=schema.index,
            dtype=schema.dtype,
            coerce=schema.coerce,
            strict=schema.strict,
            name=schema.name,
            ordered=schema.ordered,
            unique=schema.unique,
            column_dtypes=column_dtypes,
        )

    def strategy(
        self,
        *,
        min_size: int = 0,
        max_size: Optional[int] = None,
    ) -> st.SearchStrategy:
        def build_column(column: Column) -> hypothesis.extra.pandas.column:
            try:
                [column_strategy] = [
                    check.strategy()
                    for check in column.checks
                    if check.name == 'strategy'
                ]
            except ValueError:
                raise ValueError('Schema fields must specify strategy.')

            dtype = self._column_dtypes[column.name]

            return hypothesis.extra.pandas.column(
                name=column.name,
                elements=column_strategy,
                dtype=dtype,
            )

        columns = [build_column(column) for column in self.columns.values()]
        index = hypothesis.extra.pandas.range_indexes(
            min_size=min_size,
            max_size=max_size,
        )
        data_frame = hypothesis.extra.pandas.data_frames(
            columns=columns,
            index=index,
        )
        return data_frame.map(partial(dd.from_pandas, npartitions=1))

6 replies

cosmicBboy Oct 8, 2021
Maintainer

ok, I think I get it... strategy is a check that provides a way of defining custom strategies! this is really elegant! 😮

cosmicBboy Oct 8, 2021
Maintainer

thinking about the use case of columns with json objects (dicts containing json-serializable data), I think the pandera-specific solution to this would be to support a Json data type where users can specify some set of expected keys and values... ultimately pandera is a parsing/validation library, which takes the approach of taking a set of constraints and converting that into data synthesis strategies (what you call the top-down approach). Probably supporting pydantic types, dataclasses, and/or JSONSchema would be a nice way of handling the complexity of arbitrarily nested json-like objects.

However, I have been thinking of adding enough flexibility to the data synthesis API so that a strategy keyword argument in Column / Field totally overrides the base strategy used by pandera to construct the data_frame hypothesis strategy. This way, you could do:

class Schema(DaskSchemaModel):
    symbols: Series[object] = pa.Field(
        # strategy, or base_strategy to be more explicit
        strategy=st.lists(  # strategy where each value in a column is a list of dictionaries of particular types.
            st.fixed_dictionaries(
                {
                    'symbol': st.text(string.ascii_uppercase),
                    'cusip': st.text(string.ascii_uppercase + string.digits),
                    'type': st.sampled_from(['common', 'units', 'warrants']),
                    'isin': st.text(string.ascii_uppercase + string.digits),
                },
            ),
            min_size=1,
        )
    )

but in this case strategy would be passed into the field_element_strategy where it currently raises a BaseStrategyOnlyError. By raising this restriction, the user-provided base_strategy as the first strategy in a potentially long chain of check strategies.

This way, the only extension you need for synthesizing dask dataframes is Schema.strategy().map(partial(dd.from_pandas, npartitions=1))

ghost Oct 10, 2021

However, I have been thinking of adding enough flexibility to the data synthesis API so that a strategy keyword argument in Column / Field totally overrides the base strategy used by pandera to construct the data_frame hypothesis strategy.

I think this is a great idea! Another example where this might be helpful is a UUID column. I don't think this could be achieved nearly as easily with the custom check API.

class Schema(SchemaModel):
    user_id: Series[str] = pa.Field(strategy=st.uuids().map(str))

On a separate note, I found that the pattern I used above to override column strategies was also able to store arbitrary metadata in a field. Previously I had been maintaining a separate pyarrow schema for each dataset in addition to the Pandera one, but this pattern allowed me to move the pyarrow information into the Pandera schema.

class MySchema(SchemaModel):
    float_col: Series[float] = Field(
        strategy=st.floats(),
        schema=pyarrow.float64(),
    )

I added a to_pyarrow_schema method to my DaskDataFrameSchema and DaskSchemaModel classes to extract this information from each field and construct the pyarrow schema for a DataFrame. This allows me to have a single schema for validation, synthesization, and serialization - all defined in the familiar pydantic style. While serialization is likely out of scope for the pandera project, I do think a flexible API to add metadata to a field would support this use case and others. Perhaps even the additional of an optional metadata dict parameter to the Field constructor would be sufficient.

class MySchema(SchemaModel):
    float_col: Series[float] = Field(
        strategy=st.floats(),
        metadata={'pyarrow_schema': pyarrow.float64(), 'description': '...'},
    )
)

cosmicBboy Oct 20, 2021
Maintainer

On a separate note, I found that the pattern I used above to override column strategies was also able to store arbitrary metadata in a field.

for sure pandera would benefit from storing arbitrary metadata, similar to the role argument brought up by @jeffzi in #374 (comment) (tho I think role also functions as a way to select columns). A couple of things to make metadata useful would be:

some way to configure the serialization format of metadata in the case of writing out to yaml
create extensions.io functionality to standardize the API for io utilities

Of course people can always subclass DataFrameSchema or SchemaModel and access it with schema.metadata if they want to implement custom methods as you have.

to_pyarrow_schema

Interoperability and converting to different schema/serialization formats are definitely in line with the pandera design philosophy. Currently the io module contains certain functionality:

to/from_yaml
to_script
from_frictionless_schema (to_frictionless_schema would need a contribution)

That module probably needs to be turned into a sub-package and documented so that it's easier to contribute.

Some issues related to this:

cosmicBboy Oct 20, 2021
Maintainer

@bphillips-exos would you interested in writing up feature request issues for:

metadata in schemas/schema components
the strategy override issue? maybe you can edit Pandera strategy re-write: improve base implementation and add API for custom strategies and global schema-level override strategy #561, which articulates a similar idea but it's pretty vague.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Custom Hypothesis Strategies #648

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Custom Hypothesis Strategies #648

Uh oh!

ghost Oct 7, 2021