Skip to content

Commit 8c513b6

Browse files
authored
update docs for polars (#1613)
Signed-off-by: cosmicBboy <[email protected]>
1 parent dcb58c5 commit 8c513b6

File tree

11 files changed

+266
-63
lines changed

11 files changed

+266
-63
lines changed

docs/source/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -292,4 +292,5 @@ def linkcode_resolve(domain, info):
292292
myst_heading_anchors = 3
293293

294294
nb_execution_mode = "auto"
295+
nb_execution_timeout = 60
295296
nb_execution_excludepatterns = ["_contents/try_pandera.ipynb"]

docs/source/configuration.md

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,29 @@
44

55
*New in version 0.17.3*
66

7-
`pandera` provides a global config `~pandera.config.PanderaConfig`.
7+
`pandera` provides a global config `~pandera.config.PanderaConfig`. The
8+
global configuration is available through `pandera.config.CONFIG`. It can also
9+
be modified with a configuration context `~pandera.config.config_context` and
10+
fetched with `~pandera.config.get_config_context` in custom code.
811

9-
This configuration can also be set using environment variables. For instance:
12+
This configuration can also be set using environment variables.
13+
14+
## Validation depth
15+
16+
Validation depth determines whether pandera only runs schema-level validations
17+
(column names and datatypes), data-level validations (checks on actual values),
18+
or both:
1019

1120
```
1221
export PANDERA_VALIDATION_ENABLED=False
1322
export PANDERA_VALIDATION_DEPTH=DATA_ONLY # SCHEMA_AND_DATA, SCHEMA_ONLY, DATA_ONLY
1423
```
1524

16-
Runtime data validation incurs a performance overhead. To mitigate this, you have
17-
the option to disable validation globally. This can be achieved by setting the
18-
environment variable `PANDERA_VALIDATION_ENABLED=False`. When validation is
19-
disabled, any `validate` call will return `None`.
25+
## Enabling/disabling validation
26+
27+
Runtime data validation incurs a performance overhead. To mitigate this in the
28+
appropriate contexts, you have the option to disable validation globally.
29+
30+
This can be achieved by setting the environment variable
31+
`PANDERA_VALIDATION_ENABLED=False`. When validation is disabled, any
32+
`validate` call not actually run any validation checks.

docs/source/dataframe_schemas.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -472,6 +472,8 @@ df = pd.DataFrame({"a": [1, 2, 3]})
472472
schema.validate(df)
473473
```
474474

475+
(index-validation)=
476+
475477
## Index Validation
476478

477479
You can also specify an {class}`~pandera.api.pandas.components.Index` in the {class}`~pandera.api.pandas.container.DataFrameSchema`.
@@ -509,6 +511,8 @@ except pa.errors.SchemaError as exc:
509511
print(exc)
510512
```
511513

514+
(multiindex-validation)=
515+
512516
## MultiIndex Validation
513517

514518
`pandera` also supports multi-index column and index validation.

docs/source/index.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -326,6 +326,59 @@ extra column and the `None` value.
326326
This error report can be useful for debugging, with each item in the various
327327
lists corresponding to a `SchemaError`
328328

329+
330+
(supported-features)=
331+
332+
## Supported Features by DataFrame Backend
333+
334+
Currently, pandera provides three validation backends: `pandas`, `pyspark`, and
335+
`polars`. The table below shows which of pandera's features are available for the
336+
{ref}`supported dataframe libraries <dataframe-libraries>`:
337+
338+
:::{table}
339+
:widths: auto
340+
:align: left
341+
342+
| feature | pandas | pyspark | polars |
343+
| :------ | ------ | ------- | ------ |
344+
| {ref}`DataFrameSchema validation <dataframeschemas>` ||||
345+
| {ref}`DataFrameModel validation <dataframe-models>` ||||
346+
| {ref}`SeriesSchema validation <seriesschemas>` || 🚫 ||
347+
| {ref}`Index/MultiIndex validation <index-validation>` || 🚫 | 🚫 |
348+
| {ref}`Built-in and custom Checks <checks>` ||||
349+
| {ref}`Groupby checks <column-check-groups>` ||||
350+
| {ref}`Custom check registration <extensions>` ||||
351+
| {ref}`Hypothesis testing <hypothesis>` ||||
352+
| {ref}`Built-in <dtype-validation>` and {ref}`custom <dtypes>` `DataType`s ||||
353+
| {ref}`Preprocessing with Parsers <parsers>` ||||
354+
| {ref}`Data synthesis strategies <data-synthesis-strategies>` ||||
355+
| {ref}`Validation decorators <decorators>` ||||
356+
| {ref}`Lazy validation <lazy-validation>` ||||
357+
| {ref}`Dropping inalid rows <drop-invalid-rows>` ||||
358+
| {ref}`Pandera configuration <configuration>` ||||
359+
| {ref}`Schema Inference <schema-inference>` ||||
360+
| {ref}`Schema persistence <schema-persistence>` ||||
361+
| {ref}`Data Format Conversion <data-format-conversion>` ||||
362+
| {ref}`Pydantic type support <pydantic-integration>` ||||
363+
| {ref}`FastAPI support <fastapi-integration>` ||||
364+
365+
:::
366+
367+
:::{admonition} Legend
368+
:class: important
369+
370+
- ✅: Supported
371+
- ❌: Not supported
372+
- 🚫: Not applicable
373+
:::
374+
375+
376+
:::{note}
377+
The `dask`, `modin`, `geopandas`, and `pyspark.pandas` support in pandera all
378+
leverage the pandas validation backend.
379+
:::
380+
381+
329382
## Contributing
330383

331384
All contributions, bug reports, bug fixes, documentation improvements,

docs/source/parsers.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,10 @@ series objects before running the validation checks. This is useful when you wan
1818
to normalize, clip, or otherwise clean data values before applying validation
1919
checks.
2020

21+
:::{important}
22+
This feature is only available in the pandas validation backend.
23+
:::
24+
2125
## Parsing versus validation
2226

2327
Pandera distinguishes between data validation and parsing. Validation is the act

docs/source/polars.md

Lines changed: 83 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,14 @@ pip install 'pandera[polars]'
2727
:::{important}
2828
If you're on an Apple Silicon machine, you'll need to install polars via
2929
`pip install polars-lts-cpu`.
30+
31+
You may have to delete `polars` if it's already installed:
32+
33+
```
34+
pip uninstall polars
35+
pip install polars-lts-cpu
36+
```
37+
3038
:::
3139

3240
Then you can use pandera schemas to validate polars dataframes. In the example
@@ -89,14 +97,18 @@ schema.validate(lf).collect()
8997

9098
You can also validate {py:class}`polars.DataFrame` objects, which are objects that
9199
execute computations eagerly. Under the hood, `pandera` will convert
92-
the `polars.DataFrame` to a `polars.LazyFrame` before validating it:
100+
the `polars.DataFrame` to a `polars.LazyFrame` before validating it. This is done
101+
so that the internal validation routine that pandera implements can take
102+
advantage of the optimizations that the polars lazy API provides.
93103

94104
```{code-cell} python
95-
df = lf.collect()
105+
df: pl.DataFrame = lf.collect()
96106
schema.validate(df)
97107
```
98108

99-
:::{note}
109+
## Synthesizing data for testing
110+
111+
:::{warning}
100112
The {ref}`data-synthesis-strategies` functionality is not yet supported in
101113
the polars integration. At this time you can use the polars-native
102114
[parametric testing](https://docs.pola.rs/py-polars/html/reference/testing.html#parametric-testing)
@@ -107,7 +119,7 @@ functions to generate test data for polars.
107119

108120
Compared to the way `pandera` handles `pandas` dataframes, `pandera`
109121
attempts to leverage the `polars` [lazy API](https://docs.pola.rs/user-guide/lazy/using/)
110-
as much as possible to leverage its performance optimization benefits.
122+
as much as possible to leverage its query optimization benefits.
111123

112124
At a high level, this is what happens during schema validation:
113125

@@ -130,19 +142,19 @@ informative error messages since all failure cases can be reported.
130142
:::
131143

132144
`pandera`'s validation behavior aligns with the way `polars` handles lazy
133-
vs. eager operations. When you can `schema.validate()` on a `polars.LazyFrame`,
145+
vs. eager operations. When you call `schema.validate()` on a `polars.LazyFrame`,
134146
`pandera` will apply all of the parsers and checks that can be done without
135147
any `collect()` operations. This means that it only does validations
136148
at the schema-level, e.g. column names and data types.
137149

138-
However, if you validate a `polars.DataFrame`, `pandera` perform
150+
However, if you validate a `polars.DataFrame`, `pandera` performs
139151
schema-level and data-level validations.
140152

141153
:::{note}
142-
Under the hood, `pandera` will convert ``` polars.DataFrame``s to a
143-
``polars.LazyFrame``s before validating them. This is done to leverage the
154+
Under the hood, `pandera` will convert `polars.DataFrame`s to a
155+
`polars.LazyFrame`s before validating them. This is done to leverage the
144156
polars lazy API during the validation process. While this feature isn't
145-
fully optimized in the ``pandera ``` library, this design decision lays the
157+
fully optimized in the `pandera` library, this design decision lays the
146158
ground-work for future performance improvements.
147159
:::
148160

@@ -411,6 +423,7 @@ pandera.errors.SchemaErrors: {
411423

412424
::::
413425

426+
(supported-polars-dtypes)=
414427

415428
## Supported Data Types
416429

@@ -491,6 +504,53 @@ class ModelWithDtypeKwargs(pa.DataFrameModel):
491504

492505
::::
493506

507+
### Time-agnostic DateTime
508+
509+
In some use cases, it may not matter whether a column containing `pl.DateTime`
510+
data has a timezone or not. In that case, you can use the pandera-native
511+
polars datatype:
512+
513+
::::{tab-set}
514+
515+
:::{tab-item} DataFrameSchema
516+
517+
```{testcode} polars
518+
from pandera.engines.polars_engine import DateTime
519+
520+
521+
schema = pa.DataFrameSchema({
522+
"created_at": pa.Column(DateTime(time_zone_agnostic=True)),
523+
})
524+
```
525+
526+
:::
527+
528+
:::{tab-item} DataFrameModel (Annotated)
529+
530+
```{testcode} polars
531+
from pandera.engines.polars_engine import DateTime
532+
533+
534+
class DateTimeModel(pa.DataFrameModel):
535+
created_at: Annotated[DateTime, True]
536+
```
537+
538+
:::
539+
540+
:::{tab-item} DataFrameModel (Field)
541+
542+
```{testcode} polars
543+
from pandera.engines.polars_engine import DateTime
544+
545+
546+
class DateTimeModel(pa.DataFrameModel):
547+
created_at: DateTime = pa.Field(dtype_kwargs={"time_zone_agnostic": True})
548+
```
549+
550+
:::
551+
552+
::::
553+
494554

495555
## Custom checks
496556

@@ -620,7 +680,7 @@ For column-level checks, the custom check function should return a
620680

621681
### DataFrame-level Checks
622682

623-
If you need to validate values on an entire dataframe, you can specify at check
683+
If you need to validate values on an entire dataframe, you can specify a check
624684
at the dataframe level. The expected output is a `polars.LazyFrame` containing
625685
multiple boolean columns, a single boolean column, or a scalar boolean.
626686

@@ -737,11 +797,11 @@ lf: pl.LazyFrame = (
737797
```
738798

739799
This syntax is nice because it's clear what's happening just from reading the
740-
code. Pandera schemas serve as an apparent point in the method chain that
741-
materializes data.
800+
code. Pandera schemas serve as a clear point in the method chain where the data
801+
is materialized.
742802

743803
However, if you don't mind a little magic 🪄, you can set the
744-
`PANDERA_VALIDATION_DEPTH` variable to `SCHEMA_AND_DATA` to
804+
`PANDERA_VALIDATION_DEPTH` environment variable to `SCHEMA_AND_DATA` to
745805
validate data-level properties on a `polars.LazyFrame`. This will be equivalent
746806
to the explicit code above:
747807

@@ -761,3 +821,13 @@ lf: pl.LazyFrame = (
761821
Under the hood, the validation process will make `.collect()` calls on the
762822
LazyFrame in order to run data-level validation checks, and it will still
763823
return a `pl.LazyFrame` after validation is done.
824+
825+
## Supported and Unsupported Functionality
826+
827+
Since the pandera-polars integration is less mature than pandas support, some
828+
of the functionality offered by the pandera with pandas DataFrames are
829+
not yet supported with polars DataFrames.
830+
831+
Here is a list of supported and unsupported features. You can
832+
refer to the {ref}`supported features matrix <supported-features>` to see
833+
which features are implemented in the polars validation backend.

docs/source/pyspark_sql.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -338,3 +338,14 @@ nature. It only works with `Config`.
338338

339339
Use with caution.
340340
:::
341+
342+
343+
## Supported and Unsupported Functionality
344+
345+
Since the pandera-pyspark-sql integration is less mature than pandas support, some
346+
of the functionality offered by the pandera with pandas DataFrames are
347+
not yet supported with pyspark sql DataFrames.
348+
349+
Here is a list of supported and unsupported features. You can
350+
refer to the {ref}`supported features matrix <supported-features>` to see
351+
which features are implemented in the pyspark-sql validation backend.

docs/source/reference/core.rst

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,3 +51,17 @@ Data Objects
5151

5252
pandera.api.polars.types.PolarsData
5353
pandera.api.pyspark.types.PysparkDataframeColumnObject
54+
55+
Configuration
56+
-------------
57+
58+
.. autosummary::
59+
:toctree: generated
60+
:template: class.rst
61+
:nosignatures:
62+
63+
pandera.config.PanderaConfig
64+
pandera.config.ValidationDepth
65+
pandera.config.ValidationScope
66+
pandera.config.config_context
67+
pandera.config.get_config_context

docs/source/schema_inference.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,16 @@ file_format: mystnb
77

88
(schema-inference)=
99

10-
# Schema Inference
10+
# Schema Inference and Persistence
1111

1212
*New in version 0.4.0*
1313

1414
With simple use cases, writing a schema definition manually is pretty
1515
straight-forward with pandera. However, it can get tedious to do this with
1616
dataframes that have many columns of various data types.
1717

18+
## Inferring a schema from data
19+
1820
To help you handle these cases, the {func}`~pandera.schema_inference.pandas.infer_schema` function enables
1921
you to quickly infer a draft schema from a pandas dataframe or series. Below
2022
is a simple example:
@@ -52,7 +54,7 @@ inferred schema.
5254

5355
(schema-persistence)=
5456

55-
## Schema Persistence
57+
## Persisting a schema
5658

5759
The schema persistence feature requires a pandera installation with the `io`
5860
extension. See the {ref}`installation<installation>` instructions for more

0 commit comments

Comments
 (0)