Skip to content

Commit 735e7fe

Browse files
committed
implement dataframe types (#672)
* implement dataframe types - added submodules in pandera.typing module for dask, modin, koalas - new documentation for mypy integration, other dataframe library support - update copy on existing documentation - expand scope * fix lint * fix lint, docs tests
1 parent d75298f commit 735e7fe

29 files changed

+1229
-226
lines changed

README.md

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
<hr>
55

6-
*A data validation library for scientists, engineers, and analysts seeking
6+
*A dataframe validation library for scientists, engineers, and analysts seeking
77
correctness.*
88

99
<br>
@@ -22,10 +22,18 @@ correctness.*
2222
[![Downloads](https://pepy.tech/badge/pandera/month)](https://pepy.tech/project/pandera)
2323
[![Downloads](https://pepy.tech/badge/pandera)](https://pepy.tech/project/pandera)
2424

25-
`pandas` data structures contain information that `pandera` explicitly
26-
validates at runtime. This is useful in production-critical or reproducible
27-
research settings. With `pandera`, you can:
25+
`pandera` provides a flexible and expressive API for performing data
26+
validation on dataframes to make data processing pipelines more readable and
27+
robust.
2828

29+
Dataframes contain information that `pandera` explicitly validates at runtime.
30+
This is useful in production-critical or reproducible research settings. With
31+
`pandera`, you can:
32+
33+
1. Define a schema once and use it to validate
34+
[different dataframe types](https://pandera.readthedocs.io/en/stable/supported_libraries.html)
35+
including [pandas](http://pandas.pydata.org), [dask](https://dask.org),
36+
[modin](https://modin.readthedocs.io/), and [koalas](https://koalas.readthedocs.io).
2937
1. [Check](https://pandera.readthedocs.io/en/stable/checks.html) the types and
3038
properties of columns in a `DataFrame` or values in a `Series`.
3139
1. Perform more complex statistical validation like
@@ -37,11 +45,11 @@ research settings. With `pandera`, you can:
3745
with pydantic-style syntax and validate dataframes using the typing syntax.
3846
1. [Synthesize data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html#data-synthesis-strategies)
3947
from schema objects for property-based testing with pandas data structures.
40-
41-
`pandera` provides a flexible and expressive API for performing data validation
42-
on tidy (long-form) and wide data to make data processing pipelines more
43-
readable and robust.
44-
48+
1. [Lazily Validate](https://pandera.readthedocs.io/en/stable/lazy_validation.html)
49+
dataframes so that all validation checks are executed before raising an error.
50+
1. [Integrate](https://pandera.readthedocs.io/en/stable/integrations.html) with
51+
a rich ecosystem of python tools like [pydantic](https://pydantic-docs.helpmanual.io)
52+
and [mypy](http://mypy-lang.org/).
4553

4654
## Documentation
4755

docs/source/conf.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@
5050
]
5151

5252
doctest_global_setup = """
53+
import platform
5354
import sys
5455
import pandas as pd
5556
import numpy as np
@@ -76,6 +77,8 @@
7677
SKIP_PANDAS_LT_V1 = version.parse(pd.__version__).release < (1, 0) or PY36
7778
SKIP_SCALING = True
7879
SKIP_SCHEMA_MODEL = SKIP_PANDAS_LT_V1 or KOALAS_INSTALLED
80+
SKIP_MODIN = platform.system() == "Windows"
81+
7982
"""
8083

8184
doctest_default_flags = (
@@ -175,7 +178,10 @@
175178
intersphinx_mapping = {
176179
"python": ("https://docs.python.org/3/", None),
177180
"numpy": ("https://docs.scipy.org/doc/numpy/", None),
178-
"pandas": ("http://pandas.pydata.org/pandas-docs/stable/", None),
181+
"pandas": ("https://pandas.pydata.org/pandas-docs/stable/", None),
182+
"dask": ("https://docs.dask.org/en/latest/", None),
183+
"koalas": ("https://koalas.readthedocs.io/en/latest/", None),
184+
"modin": ("https://modin.readthedocs.io/en/latest/", None),
179185
}
180186

181187
# strip prompts

docs/source/dask.rst

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
.. currentmodule:: pandera
2+
3+
.. _scaling_dask:
4+
5+
Data Validation with Dask
6+
=========================
7+
8+
*new in 0.8.0*
9+
10+
`Dask <https://docs.dask.org/en/latest/dataframe.html>`__ is a distributed
11+
compute framework that offers a pandas-like dataframe API.
12+
You can use pandera to validate :py:func:`~dask.dataframe.DataFrame`
13+
and :py:func:`~dask.dataframe.Series` objects directly. First, install
14+
``pandera`` with the ``dask`` extra:
15+
16+
.. code:: bash
17+
18+
pip install pandera[dask]
19+
20+
21+
Then you can use pandera schemas to validate dask dataframes. In the example
22+
below we'll use the :ref:`class-based API <schema_models>` to define a
23+
:py:class:`SchemaModel` for validation.
24+
25+
.. testcode:: scaling_dask
26+
27+
import dask.dataframe as dd
28+
import pandas as pd
29+
import pandera as pa
30+
31+
from pandera.typing.dask import DataFrame, Series
32+
33+
34+
class Schema(pa.SchemaModel):
35+
state: Series[str]
36+
city: Series[str]
37+
price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})
38+
39+
40+
ddf = dd.from_pandas(
41+
pd.DataFrame(
42+
{
43+
'state': ['FL','FL','FL','CA','CA','CA'],
44+
'city': [
45+
'Orlando',
46+
'Miami',
47+
'Tampa',
48+
'San Francisco',
49+
'Los Angeles',
50+
'San Diego',
51+
],
52+
'price': [8, 12, 10, 16, 20, 18],
53+
}
54+
),
55+
npartitions=2
56+
)
57+
pandera_ddf = Schema(ddf)
58+
59+
print(pandera_ddf)
60+
61+
62+
.. testoutput:: scaling_dask
63+
64+
Dask DataFrame Structure:
65+
state city price
66+
npartitions=2
67+
0 object object int64
68+
3 ... ... ...
69+
5 ... ... ...
70+
Dask Name: validate, 4 tasks
71+
72+
73+
As you can see, passing the dask dataframe into ``Schema`` will produce
74+
another dask dataframe which hasn't been evaluated yet. What this means is
75+
that pandera will only validate when the dask graph is evaluated.
76+
77+
.. testcode:: scaling_dask
78+
79+
print(pandera_ddf.compute())
80+
81+
82+
.. testoutput:: scaling_dask
83+
84+
state city price
85+
0 FL Orlando 8
86+
1 FL Miami 12
87+
2 FL Tampa 10
88+
3 CA San Francisco 16
89+
4 CA Los Angeles 20
90+
5 CA San Diego 18
91+
92+
93+
You can also use the :py:func:`~pandera.check_types` decorator to validate
94+
dask dataframes at runtime:
95+
96+
.. testcode:: scaling_dask
97+
98+
@pa.check_types
99+
def function(ddf: DataFrame[Schema]) -> DataFrame[Schema]:
100+
return ddf[ddf["state"] == "CA"]
101+
102+
print(function(ddf).compute())
103+
104+
105+
.. testoutput:: scaling_dask
106+
107+
state city price
108+
3 CA San Francisco 16
109+
4 CA Los Angeles 20
110+
5 CA San Diego 18
111+
112+
113+
And of course, you can use the object-based API to validate dask dataframes:
114+
115+
116+
.. testcode:: scaling_dask
117+
118+
schema = pa.DataFrameSchema({
119+
"state": pa.Column(str),
120+
"city": pa.Column(str),
121+
"price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20))
122+
})
123+
print(schema(ddf).compute())
124+
125+
126+
.. testoutput:: scaling_dask
127+
128+
state city price
129+
0 FL Orlando 8
130+
1 FL Miami 12
131+
2 FL Tampa 10
132+
3 CA San Francisco 16
133+
4 CA Los Angeles 20
134+
5 CA San Diego 18

docs/source/dataframe_schemas.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,10 @@ The :class:`~pandera.schemas.DataFrameSchema` object consists of |column|_\s and
3939
coerce=True,
4040
)
4141

42+
You can refer to :ref:`schema_models` to see how to define dataframe schemas
43+
using the alternative pydantic/dataclass-style syntax.
44+
45+
4246
.. _column:
4347

4448
Column Validation

docs/source/dtypes.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44

55
.. _dtypes:
66

7-
Pandera Data Types (new)
8-
========================
7+
Pandera Data Types
8+
==================
99

1010
*new in 0.7.0*
1111

Lines changed: 22 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
.. currentmodule:: pandera
22

3-
.. _scaling:
3+
.. _scaling_fugue:
44

5-
Scaling Pandera to Big Data
6-
=================================
5+
Data Validation with Fugue
6+
==========================
77

88
Validation on big data comes in two forms. The first is performing one set of
99
validations on data that doesn't fit in memory. The second happens when a large dataset
@@ -17,8 +17,8 @@ code can be used on top of ``Spark`` or ``Dask`` engines with
1717
to be performed in a distributed setting. ``Fugue`` is an open source abstraction layer that
1818
ports ``Python``, ``pandas``, and ``SQL`` code to ``Spark`` and ``Dask``.
1919

20-
Fugue
21-
-----
20+
What is Fugue?
21+
--------------
2222

2323
``Fugue`` serves as an interface to distributed computing. Because of its non-invasive design,
2424
existing ``Python`` code can be scaled to a distributed setting without significant changes.
@@ -40,17 +40,22 @@ In this example, a pandas ``DataFrame`` is created with ``state``, ``city`` and
4040
columns. ``Pandera`` will be used to validate that the ``price`` column values are within
4141
a certain range.
4242

43-
.. testcode:: scaling_pandera
43+
.. testcode:: scaling_fugue
4444

4545
import pandas as pd
4646

47-
data = pd.DataFrame({'state': ['FL','FL','FL','CA','CA','CA'],
48-
'city': ['Orlando', 'Miami', 'Tampa',
49-
'San Francisco', 'Los Angeles', 'San Diego'],
50-
'price': [8, 12, 10, 16, 20, 18]})
47+
data = pd.DataFrame(
48+
{
49+
'state': ['FL','FL','FL','CA','CA','CA'],
50+
'city': [
51+
'Orlando', 'Miami', 'Tampa', 'San Francisco', 'Los Angeles', 'San Diego'
52+
],
53+
'price': [8, 12, 10, 16, 20, 18],
54+
}
55+
)
5156
print(data)
5257

53-
.. testoutput:: scaling_pandera
58+
.. testoutput:: scaling_fugue
5459

5560
state city price
5661
0 FL Orlando 8
@@ -64,7 +69,7 @@ a certain range.
6469
Validation is then applied using pandera. A ``price_validation`` function is
6570
created that runs the validation. None of this will be new.
6671

67-
.. testcode:: scaling_pandera
72+
.. testcode:: scaling_fugue
6873

6974
from pandera import Column, DataFrameSchema, Check
7075

@@ -85,7 +90,7 @@ to run the code on top of ``Spark``. ``Fugue`` also has a ``DaskExecutionEngine`
8590
the default pandas-based ``ExecutionEngine``. Because the ``SparkExecutionEngine`` is used, the result
8691
becomes a ``Spark DataFrame``.
8792

88-
.. testcode:: scaling_pandera
93+
.. testcode:: scaling_fugue
8994
:skipif: SKIP_SCALING
9095

9196
from fugue import transform
@@ -94,7 +99,7 @@ becomes a ``Spark DataFrame``.
9499
spark_df = transform(data, price_validation, schema="*", engine=SparkExecutionEngine)
95100
spark_df.show()
96101

97-
.. testoutput:: scaling_pandera
102+
.. testoutput:: scaling_fugue
98103
:skipif: SKIP_SCALING
99104

100105
+-----+-------------+-----+
@@ -118,7 +123,7 @@ price range for the records with ``state`` FL is lower than the range for the ``
118123
Two :class:`~pandera.schemas.DataFrameSchema` will be created to reflect this. Notice their ranges
119124
for the :class:`~pandera.checks.Check` differ.
120125

121-
.. testcode:: scaling_pandera
126+
.. testcode:: scaling_fugue
122127

123128
price_check_FL = DataFrameSchema({
124129
"price": Column(int, Check.in_range(min_value=7,max_value=13)),
@@ -139,7 +144,7 @@ To partition our data by ``state``, all we need to do is pass it into the ``tran
139144
through the ``partition`` argument. This splits up the data across different workers before they
140145
each run the ``price_validation`` function. Again, this is like a groupby-validation.
141146

142-
.. testcode:: scaling_pandera
147+
.. testcode:: scaling_fugue
143148
:skipif: SKIP_SCALING
144149

145150
def price_validation(df:pd.DataFrame) -> pd.DataFrame:
@@ -156,7 +161,7 @@ each run the ``price_validation`` function. Again, this is like a groupby-valida
156161

157162
spark_df.show()
158163

159-
.. testoutput:: scaling_pandera
164+
.. testoutput:: scaling_fugue
160165
:skipif: SKIP_SCALING
161166

162167
SparkDataFrame

0 commit comments

Comments
 (0)