Skip to content

Pandera strategy re-write: improve base implementation and add API for custom strategies and global schema-level override strategy #561

@cosmicBboy

Description

@cosmicBboy

Is your feature request related to a problem? Please describe.

Currently, strategies are limited by the hypothesis.extras.pandas convention of how to define a dataframe. Namely, the strategy used to generate data values are at the element-level. This makes it hard to create strategies for a whole column or those that model the dependencies between columns.

For previous context on the problem with strategies, see #1605, #1220, #1275.

Describe the solution you'd like

We need a re-write! 🔥

As described in #1605, the requirements for a pandera pandas strategy rewrite are:

  • Strategies that work for all pandera schemas (this is a really high bar, but I think possible), with reasonable escape hatches when pandera cannot automatically figure out how to generate a df.
  • Generating entire columns instead of individual elements
  • Incorporating cross-column dependencies
  • A user-friendly way of overriding strategies (from pre-existing Checks) or custom strategies
  • Columns with multiple checks should not chain strategies with filter, it should maybe override data with the new constraint.
  • ... (others?)

More context on the current state

At a high level, this is how pandera currently translates a schema to a hypothesis strategy:

  • For each column, index, obtain the following metadata:
    • Column name, datatype, and checks
  • If the column name is a regex expression, generate column names based on the regex
  • Define a hypothesis column. This contains the datatypes, elements, and other properties of the column.
  • Based on the pa.Column dtype, properties (e.g. unique), and first check in the list of check, forward them to the hypothesis column. This creates an element strategy for a single value in that column.
  • For any subsequent Check in the list, get their check stats (constraint values) and chain them to the element strategy with filter (this really sucks, i.e. slows down performance.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions