Skip to content

Commit 6c6eb57

Browse files
kykyicosmicBboy
andauthored
Enhancement: drop invalid rows on validate with new param (#1189)
* Basic ArraySchema default for str series Signed-off-by: kykyi <[email protected]> * Add parameterised test cases for various data types Signed-off-by: kykyi <[email protected]> * Ensure column has a default Signed-off-by: kykyi <[email protected]> * Add some tests asserting Column.default works as expected Signed-off-by: kykyi <[email protected]> * Add tests asserting default causes an error when there is a dtype mismatch Signed-off-by: kykyi <[email protected]> * Remove inplace=True hardcoding, add default as kwarg across various classes and functions Signed-off-by: kykyi <[email protected]> * Simplify Column tests to avoid using DataFrameSchema Signed-off-by: kykyi <[email protected]> * Add test to raise error if inplace is False and default is non null Signed-off-by: kykyi <[email protected]> * any -> Any Signed-off-by: kykyi <[email protected]> * clean up PR Signed-off-by: Niels Bantilan <[email protected]> * remove codecov Signed-off-by: Niels Bantilan <[email protected]> * xfail pyspark tests Signed-off-by: Niels Bantilan <[email protected]> * Simplify drop_invalid into a kwarg for schema.validate(). Signed-off-by: Baden Ashford <[email protected]> * Update docstrings Signed-off-by: Baden Ashford <[email protected]> * Add a couple more test cases Signed-off-by: Baden Ashford <[email protected]> * Re-raise error on drop_invalid false, move some logic into a private method Signed-off-by: Baden Ashford <[email protected]> * Add drop_invalid for SeriesSchema Signed-off-by: Baden Ashford <[email protected]> * Add drop_invalid to MultiIndex Signed-off-by: Baden Ashford <[email protected]> * Small changes to fix mypy Signed-off-by: Baden Ashford <[email protected]> * More mypy fixes Signed-off-by: Baden Ashford <[email protected]> * Move run_checks_and_handle_errors into it's own method with core checks within Signed-off-by: Baden Ashford <[email protected]> * Remove try/catch Signed-off-by: Baden Ashford <[email protected]> * Move drop_logic into it's own method for array.py and container.py Signed-off-by: Baden Ashford <[email protected]> * drop_invalid -> drop_invalid_data Signed-off-by: Baden Ashford <[email protected]> * Remove main() block from test_schemas.py Signed-off-by: Baden Ashford <[email protected]> * Fix typo Signed-off-by: Baden Ashford <[email protected]> * Add test for ColumnBackend Signed-off-by: Baden Ashford <[email protected]> * Move drop_invalid from validation to schema init Add drop_invalid attr to BaseConfig Signed-off-by: Baden Ashford <[email protected]> * Stylistic changes Signed-off-by: Baden Ashford <[email protected]> * Remove incorrect rescue logic in ColumnBackend Signed-off-by: Baden Ashford <[email protected]> * Add draft docs Signed-off-by: Baden Ashford <[email protected]> * Add functionality for drop_invalid on DataFrameModel schemas Signed-off-by: Baden Ashford <[email protected]> * Standardise tests Signed-off-by: Baden Ashford <[email protected]> * Update docs for DataFrameModel Signed-off-by: Baden Ashford <[email protected]> * Add docstrings Signed-off-by: Baden Ashford <[email protected]> * rename of `drop_invalid_rows`, exception handling, update docs Signed-off-by: Niels Bantilan <[email protected]> --------- Signed-off-by: kykyi <[email protected]> Signed-off-by: Niels Bantilan <[email protected]> Signed-off-by: Baden Ashford <[email protected]> Co-authored-by: Niels Bantilan <[email protected]>
1 parent 33c3671 commit 6c6eb57

File tree

15 files changed

+414
-57
lines changed

15 files changed

+414
-57
lines changed

docs/source/drop_invalid_rows.rst

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
.. currentmodule:: pandera
2+
3+
.. _drop_invalid_rows:
4+
5+
Dropping Invalid Rows
6+
=====================
7+
8+
*New in version 0.16.0*
9+
10+
If you wish to use the validation step to remove invalid data, you can pass the
11+
``drop_invalid_rows=True`` argument to the ``schema`` object on creation. On ``schema.validate()``,
12+
if a data-level check fails, then that row which caused the failure will be removed from the dataframe
13+
when it is returned.
14+
15+
``drop_invalid`` will prevent data-level schema errors being raised and will instead
16+
remove the rows which causes the failure.
17+
18+
This functionality is available on ``DataFrameSchema``, ``SeriesSchema``, ``Column``,
19+
as well as ``DataFrameModel`` schemas.
20+
21+
Dropping invalid rows with :class:`~pandera.api.pandas.container.DataFrameSchema`:
22+
23+
.. testcode:: drop_invalid_rows_data_frame_schema
24+
25+
import pandas as pd
26+
import pandera as pa
27+
28+
from pandera import Check, Column, DataFrameSchema
29+
30+
df = pd.DataFrame({"counter": ["1", "2", "3"]})
31+
schema = DataFrameSchema(
32+
{"counter": Column(int, checks=[Check(lambda x: x >= 3)])},
33+
drop_invalid_rows=True,
34+
)
35+
36+
schema.validate(df, lazy=True)
37+
38+
Dropping invalid rows with :class:`~pandera.api.pandas.array.SeriesSchema`:
39+
40+
.. testcode:: drop_invalid_rows_series_schema
41+
42+
import pandas as pd
43+
import pandera as pa
44+
45+
from pandera import Check, SeriesSchema
46+
47+
series = pd.Series(["1", "2", "3"])
48+
schema = SeriesSchema(
49+
int,
50+
checks=[Check(lambda x: x >= 3)],
51+
drop_invalid_rows=True,
52+
)
53+
54+
schema.validate(series, lazy=True)
55+
56+
Dropping invalid rows with :class:`~pandera.api.pandas.components.Column`:
57+
58+
.. testcode:: drop_invalid_rows_column
59+
60+
import pandas as pd
61+
import pandera as pa
62+
63+
from pandera import Check, Column
64+
65+
df = pd.DataFrame({"counter": ["1", "2", "3"]})
66+
schema = Column(
67+
int,
68+
name="counter",
69+
drop_invalid_rows=True,
70+
checks=[Check(lambda x: x >= 3)]
71+
)
72+
73+
schema.validate(df, lazy=True)
74+
75+
Dropping invalid rows with :class:`~pandera.api.pandas.model.DataFrameModel`:
76+
77+
.. testcode:: drop_invalid_rows_data_frame_model
78+
79+
import pandas as pd
80+
import pandera as pa
81+
82+
from pandera import Check, DataFrameModel, Field
83+
84+
class MySchema(DataFrameModel):
85+
counter: int = Field(in_range={"min_value": 3, "max_value": 5})
86+
87+
class Config:
88+
drop_invalid_rows = True
89+
90+
91+
MySchema.validate(
92+
pd.DataFrame({"counter": [1, 2, 3, 4, 5, 6]}), lazy=True
93+
)
94+
95+
.. note::
96+
In order to use ``drop_invalid_rows=True``, ``lazy=True`` must
97+
be passed to the ``schema.validate()``. :ref:`lazy_validation` enables all schema
98+
errors to be collected and raised together, meaning all invalid rows can be dropped together.
99+
This provides clear API for ensuring the validated dataframe contains only valid data.

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -360,6 +360,7 @@ page or reach out to the maintainers and pandera community on
360360
hypothesis
361361
dtypes
362362
decorators
363+
drop_invalid_rows
363364
schema_inference
364365
lazy_validation
365366
data_synthesis_strategies

pandera/api/base/schema.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ def __init__(
3232
name=None,
3333
title=None,
3434
description=None,
35+
drop_invalid_rows=False,
3536
):
3637
"""Abstract base schema initializer."""
3738
self.dtype = dtype
@@ -40,6 +41,7 @@ def __init__(
4041
self.name = name
4142
self.title = title
4243
self.description = description
44+
self.drop_invalid_rows = drop_invalid_rows
4345

4446
def validate(
4547
self,

pandera/api/pandas/array.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ def __init__(
3737
title: Optional[str] = None,
3838
description: Optional[str] = None,
3939
default: Optional[Any] = None,
40+
drop_invalid_rows: bool = False,
4041
) -> None:
4142
"""Initialize array schema.
4243
@@ -63,6 +64,8 @@ def __init__(
6364
:param title: A human-readable label for the series.
6465
:param description: An arbitrary textual description of the series.
6566
:param default: The default value for missing values in the series.
67+
:param drop_invalid_rows: if True, drop invalid rows on validation.
68+
6669
"""
6770

6871
super().__init__(
@@ -72,6 +75,7 @@ def __init__(
7275
name=name,
7376
title=title,
7477
description=description,
78+
drop_invalid_rows=drop_invalid_rows,
7579
)
7680

7781
if checks is None:
@@ -300,6 +304,7 @@ def __init__(
300304
title: Optional[str] = None,
301305
description: Optional[str] = None,
302306
default: Optional[Any] = None,
307+
drop_invalid_rows: bool = False,
303308
) -> None:
304309
"""Initialize series schema base object.
305310
@@ -327,6 +332,7 @@ def __init__(
327332
:param title: A human-readable label for the series.
328333
:param description: An arbitrary textual description of the series.
329334
:param default: The default value for missing values in the series.
335+
:param drop_invalid_rows: if True, drop invalid rows on validation.
330336
331337
"""
332338
super().__init__(
@@ -340,6 +346,7 @@ def __init__(
340346
title,
341347
description,
342348
default,
349+
drop_invalid_rows,
343350
)
344351
self.index = index
345352

pandera/api/pandas/components.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ def __init__(
3030
title: Optional[str] = None,
3131
description: Optional[str] = None,
3232
default: Optional[Any] = None,
33+
drop_invalid_rows: bool = False,
3334
) -> None:
3435
"""Create column validator object.
3536
@@ -54,6 +55,7 @@ def __init__(
5455
:param title: A human-readable label for the column.
5556
:param description: An arbitrary textual description of the column.
5657
:param default: The default value for missing values in the column.
58+
:param drop_invalid_rows: if True, drop invalid rows on validation.
5759
5860
:raises SchemaInitError: if impossible to build schema from parameters
5961
@@ -85,6 +87,7 @@ def __init__(
8587
title=title,
8688
description=description,
8789
default=default,
90+
drop_invalid_rows=drop_invalid_rows,
8891
)
8992
if (
9093
name is not None

pandera/api/pandas/container.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ def __init__(
4646
unique_column_names: bool = False,
4747
title: Optional[str] = None,
4848
description: Optional[str] = None,
49+
drop_invalid_rows: bool = False,
4950
) -> None:
5051
"""Initialize DataFrameSchema validator.
5152
@@ -77,6 +78,7 @@ def __init__(
7778
:param unique_column_names: whether or not column names must be unique.
7879
:param title: A human-readable label for the schema.
7980
:param description: An arbitrary textual description of the schema.
81+
:param drop_invalid_rows: if True, drop invalid rows on validation.
8082
8183
:raises SchemaInitError: if impossible to build schema from parameters
8284
@@ -152,6 +154,7 @@ def __init__(
152154
self._unique = unique
153155
self.report_duplicates = report_duplicates
154156
self.unique_column_names = unique_column_names
157+
self.drop_invalid_rows = drop_invalid_rows
155158

156159
# this attribute is not meant to be accessed by users and is explicitly
157160
# set to True in the case that a schema is created by infer_schema.

pandera/api/pandas/model.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -268,6 +268,7 @@ def to_schema(cls) -> DataFrameSchema:
268268
"title": cls.__config__.title,
269269
"description": cls.__config__.description or cls.__doc__,
270270
"unique_column_names": cls.__config__.unique_column_names,
271+
"drop_invalid_rows": cls.__config__.drop_invalid_rows,
271272
}
272273
cls.__schema__ = DataFrameSchema(
273274
columns,

pandera/api/pandas/model_config.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ class BaseConfig(BaseModelConfig): # pylint:disable=R0903
2121
title: Optional[str] = None #: human-readable label for schema
2222
description: Optional[str] = None #: arbitrary textual description
2323
coerce: bool = False #: coerce types of all schema components
24+
drop_invalid_rows: bool = False #: drop invalid rows on validation
2425

2526
#: make sure certain column combinations are unique
2627
unique: Optional[Union[str, List[str]]] = None

pandera/backends/base/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,10 @@ def failure_cases_metadata(
124124
"""Get failure cases metadata for lazy validation."""
125125
raise NotImplementedError
126126

127+
def drop_invalid_rows(self, check_obj, error_handler):
128+
"""Remove invalid elements in a `check_obj` according to failures in caught by the `error_handler`"""
129+
raise NotImplementedError
130+
127131

128132
class BaseCheckBackend(ABC):
129133
"""Abstract base class for a check backend implementation."""

pandera/backends/pandas/array.py

Lines changed: 47 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
SchemaError,
2121
SchemaErrors,
2222
SchemaErrorReason,
23+
SchemaDefinitionError,
2324
)
2425

2526

@@ -45,6 +46,11 @@ def validate(
4546
error_handler = SchemaErrorHandler(lazy)
4647
check_obj = self.preprocess(check_obj, inplace)
4748

49+
if getattr(schema, "drop_invalid_rows", False) and not lazy:
50+
raise SchemaDefinitionError(
51+
"When drop_invalid_rows is True, lazy must be set to True."
52+
)
53+
4854
# fill nans with `default` if it's present
4955
if hasattr(schema, "default") and pd.notna(schema.default):
5056
check_obj.fillna(schema.default, inplace=True)
@@ -55,6 +61,42 @@ def validate(
5561
except SchemaError as exc:
5662
error_handler.collect_error(exc.reason_code, exc)
5763

64+
# run the core checks
65+
error_handler = self.run_checks_and_handle_errors(
66+
error_handler,
67+
schema,
68+
check_obj,
69+
head,
70+
tail,
71+
sample,
72+
random_state,
73+
)
74+
75+
if lazy and error_handler.collected_errors:
76+
if getattr(schema, "drop_invalid_rows", False):
77+
check_obj = self.drop_invalid_rows(check_obj, error_handler)
78+
return check_obj
79+
else:
80+
raise SchemaErrors(
81+
schema=schema,
82+
schema_errors=error_handler.collected_errors,
83+
data=check_obj,
84+
)
85+
86+
return check_obj
87+
88+
def run_checks_and_handle_errors(
89+
self,
90+
error_handler,
91+
schema,
92+
check_obj,
93+
head,
94+
tail,
95+
sample,
96+
random_state,
97+
):
98+
"""Run checks on schema"""
99+
# pylint: disable=too-many-locals
58100
field_obj_subsample = self.subsample(
59101
check_obj if is_field(check_obj) else check_obj[schema.name],
60102
head,
@@ -71,14 +113,15 @@ def validate(
71113
random_state,
72114
)
73115

74-
# run the core checks
75-
for core_check, args in (
116+
core_checks = [
76117
(self.check_name, (field_obj_subsample, schema)),
77118
(self.check_nullable, (field_obj_subsample, schema)),
78119
(self.check_unique, (field_obj_subsample, schema)),
79120
(self.check_dtype, (field_obj_subsample, schema)),
80121
(self.run_checks, (check_obj_subsample, schema)),
81-
):
122+
]
123+
124+
for core_check, args in core_checks:
82125
results = core_check(*args)
83126
if isinstance(results, CoreCheckResult):
84127
results = [results]
@@ -106,13 +149,7 @@ def validate(
106149
original_exc=result.original_exc,
107150
)
108151

109-
if lazy and error_handler.collected_errors:
110-
raise SchemaErrors(
111-
schema=schema,
112-
schema_errors=error_handler.collected_errors,
113-
data=check_obj,
114-
)
115-
return check_obj
152+
return error_handler
116153

117154
def coerce_dtype(
118155
self,

0 commit comments

Comments
 (0)