update docs for polars (#1613)

cosmicBboy · web-flow · commit 8c513b6f3c61 · 2024-05-05T21:29:36.000-04:00
Signed-off-by: cosmicBboy &lt;niels.bantilan@gmail.com&gt;
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -292,4 +292,5 @@ def linkcode_resolve(domain, info):
 myst_heading_anchors = 3
 
 nb_execution_mode = "auto"
+nb_execution_timeout = 60
 nb_execution_excludepatterns = ["_contents/try_pandera.ipynb"]
diff --git a/docs/source/configuration.md b/docs/source/configuration.md
@@ -4,16 +4,29 @@
 
 *New in version 0.17.3*
 
-`pandera` provides a global config `~pandera.config.PanderaConfig`.
+`pandera` provides a global config `~pandera.config.PanderaConfig`. The
+global configuration is available through `pandera.config.CONFIG`. It can also
+be modified with a configuration context `~pandera.config.config_context` and
+fetched with `~pandera.config.get_config_context` in custom code.
 
-This configuration can also be set using environment variables. For instance:
+This configuration can also be set using environment variables.
+
+## Validation depth
+
+Validation depth determines whether pandera only runs schema-level validations
+(column names and datatypes), data-level validations (checks on actual values),
+or both:
 
 ```
 export PANDERA_VALIDATION_ENABLED=False
 export PANDERA_VALIDATION_DEPTH=DATA_ONLY  # SCHEMA_AND_DATA, SCHEMA_ONLY, DATA_ONLY
 ```
 
-Runtime data validation incurs a performance overhead. To mitigate this, you have
-the option to disable validation globally. This can be achieved by setting the
-environment variable `PANDERA_VALIDATION_ENABLED=False`. When validation is
-disabled, any `validate` call will return `None`.
+## Enabling/disabling validation
+
+Runtime data validation incurs a performance overhead. To mitigate this in the
+appropriate contexts, you have the option to disable validation globally.
+
+This can be achieved by setting the environment variable
+`PANDERA_VALIDATION_ENABLED=False`. When validation is disabled, any
+`validate` call not actually run any validation checks.
diff --git a/docs/source/dataframe_schemas.md b/docs/source/dataframe_schemas.md
@@ -472,6 +472,8 @@ df = pd.DataFrame({"a": [1, 2, 3]})
 schema.validate(df)
 ```
 
+(index-validation)=
+
 ## Index Validation
 
 You can also specify an {class}`~pandera.api.pandas.components.Index` in the {class}`~pandera.api.pandas.container.DataFrameSchema`.
@@ -509,6 +511,8 @@ except pa.errors.SchemaError as exc:
     print(exc)
 ```
 
+(multiindex-validation)=
+
 ## MultiIndex Validation
 
 `pandera` also supports multi-index column and index validation.
diff --git a/docs/source/index.md b/docs/source/index.md
@@ -326,6 +326,59 @@ extra column and the `None` value.
 This error report can be useful for debugging, with each item in the various
 lists corresponding to a `SchemaError`
 
+
+(supported-features)=
+
+## Supported Features by DataFrame Backend
+
+Currently, pandera provides three validation backends: `pandas`, `pyspark`, and
+`polars`. The table below shows which of pandera's features are available for the
+{ref}`supported dataframe libraries <dataframe-libraries>`:
+
+:::{table}
+:widths: auto
+:align: left
+
+| feature | pandas | pyspark | polars |
+| :------ | ------ | ------- | ------ |
+| {ref}`DataFrameSchema validation <dataframeschemas>`                      | ✅ | ✅ | ✅ |
+| {ref}`DataFrameModel validation <dataframe-models>`                       | ✅ | ✅ | ✅ |
+| {ref}`SeriesSchema validation <seriesschemas>`                            | ✅ | 🚫 | ❌ |
+| {ref}`Index/MultiIndex validation <index-validation>`                     | ✅ | 🚫 | 🚫 |
+| {ref}`Built-in and custom Checks <checks>`                                | ✅ | ✅ | ✅ |
+| {ref}`Groupby checks <column-check-groups>`                               | ✅ | ❌ | ❌ |
+| {ref}`Custom check registration <extensions>`                             | ✅ | ✅ | ❌ |
+| {ref}`Hypothesis testing <hypothesis>`                                    | ✅ | ❌ | ❌ |
+| {ref}`Built-in <dtype-validation>` and {ref}`custom <dtypes>` `DataType`s | ✅ | ✅ | ✅ |
+| {ref}`Preprocessing with Parsers <parsers>`                               | ✅ | ❌ | ❌ |
+| {ref}`Data synthesis strategies <data-synthesis-strategies>`              | ✅ | ❌ | ❌ |
+| {ref}`Validation decorators <decorators>`                                 | ✅ | ✅ | ✅ |
+| {ref}`Lazy validation <lazy-validation>`                                  | ✅ | ✅ | ✅ |
+| {ref}`Dropping inalid rows <drop-invalid-rows>`                           | ✅ | ❌ | ✅ |
+| {ref}`Pandera configuration <configuration>`                              | ✅ | ✅ | ✅ |
+| {ref}`Schema Inference <schema-inference>`                                | ✅ | ❌ | ❌ |
+| {ref}`Schema persistence <schema-persistence>`                            | ✅ | ❌ | ❌ |
+| {ref}`Data Format Conversion <data-format-conversion>`                    | ✅ | ❌ | ❌ |
+| {ref}`Pydantic type support <pydantic-integration>`                       | ✅ | ❌ | ❌ |
+| {ref}`FastAPI support <fastapi-integration>`                              | ✅ | ❌ | ❌ |
+
+:::
+
+:::{admonition} Legend
+:class: important
+
+- ✅: Supported
+- ❌: Not supported
+- 🚫: Not applicable
+:::
+
+
+:::{note}
+The `dask`, `modin`, `geopandas`, and `pyspark.pandas` support in pandera all
+leverage the pandas validation backend.
+:::
+
+
 ## Contributing
 
 All contributions, bug reports, bug fixes, documentation improvements,
diff --git a/docs/source/parsers.md b/docs/source/parsers.md
@@ -18,6 +18,10 @@ series objects before running the validation checks. This is useful when you wan
 to normalize, clip, or otherwise clean data values before applying validation
 checks.
 
+:::{important}
+This feature is only available in the pandas validation backend.
+:::
+
 ## Parsing versus validation
 
 Pandera distinguishes between data validation and parsing. Validation is the act
diff --git a/docs/source/polars.md b/docs/source/polars.md
@@ -27,6 +27,14 @@ pip install 'pandera[polars]'
 :::{important}
 If you're on an Apple Silicon machine, you'll need to install polars via
 `pip install polars-lts-cpu`.
+
+You may have to delete `polars` if it's already installed:
+
+```
+pip uninstall polars
+pip install polars-lts-cpu
+```
+
 :::
 
 Then you can use pandera schemas to validate polars dataframes. In the example
@@ -89,14 +97,18 @@ schema.validate(lf).collect()
 
 You can also validate {py:class}`polars.DataFrame` objects, which are objects that
 execute computations eagerly. Under the hood, `pandera` will convert
-the `polars.DataFrame` to a `polars.LazyFrame` before validating it:
+the `polars.DataFrame` to a `polars.LazyFrame` before validating it. This is done
+so that the internal validation routine that pandera implements can take
+advantage of the optimizations that the polars lazy API provides.
 
 ```{code-cell} python
-df = lf.collect()
+df: pl.DataFrame = lf.collect()
 schema.validate(df)
 ```
 
-:::{note}
+## Synthesizing data for testing
+
+:::{warning}
 The {ref}`data-synthesis-strategies` functionality is not yet supported in
 the polars integration. At this time you can use the polars-native
 [parametric testing](https://docs.pola.rs/py-polars/html/reference/testing.html#parametric-testing)
@@ -107,7 +119,7 @@ functions to generate test data for polars.
 
 Compared to the way `pandera` handles `pandas` dataframes, `pandera`
 attempts to leverage the `polars` [lazy API](https://docs.pola.rs/user-guide/lazy/using/)
-as much as possible to leverage its performance optimization benefits.
+as much as possible to leverage its query optimization benefits.
 
 At a high level, this is what happens during schema validation:
 
@@ -130,19 +142,19 @@ informative error messages since all failure cases can be reported.
 :::
 
 `pandera`'s validation behavior aligns with the way `polars` handles lazy
-vs. eager operations. When you can `schema.validate()` on a `polars.LazyFrame`,
+vs. eager operations. When you call `schema.validate()` on a `polars.LazyFrame`,
 `pandera` will apply all of the parsers and checks that can be done without
 any `collect()` operations. This means that it only does validations
 at the schema-level, e.g. column names and data types.
 
-However, if you validate a `polars.DataFrame`, `pandera` perform
+However, if you validate a `polars.DataFrame`, `pandera` performs
 schema-level and data-level validations.
 
 :::{note}
-Under the hood, `pandera` will convert ``` polars.DataFrame``s to a
-``polars.LazyFrame``s before validating them. This is done to leverage the
+Under the hood, `pandera` will convert `polars.DataFrame`s to a
+`polars.LazyFrame`s before validating them. This is done to leverage the
 polars lazy API during the validation process. While this feature isn't
-fully optimized in the ``pandera ``` library, this design decision lays the
+fully optimized in the `pandera` library, this design decision lays the
 ground-work for future performance improvements.
 :::
 
@@ -411,6 +423,7 @@ pandera.errors.SchemaErrors: {
 
 ::::
 
+(supported-polars-dtypes)=
 
 ## Supported Data Types
 
@@ -491,6 +504,53 @@ class ModelWithDtypeKwargs(pa.DataFrameModel):
 
 ::::
 
+### Time-agnostic DateTime
+
+In some use cases, it may not matter whether a column containing `pl.DateTime`
+data has a timezone or not. In that case, you can use the pandera-native
+polars datatype:
+
+::::{tab-set}
+
+:::{tab-item} DataFrameSchema
+
+```{testcode} polars
+from pandera.engines.polars_engine import DateTime
+
+
+schema = pa.DataFrameSchema({
+    "created_at": pa.Column(DateTime(time_zone_agnostic=True)),
+})
+```
+
+:::
+
+:::{tab-item} DataFrameModel (Annotated)
+
+```{testcode} polars
+from pandera.engines.polars_engine import DateTime
+
+
+class DateTimeModel(pa.DataFrameModel):
+    created_at: Annotated[DateTime, True]
+```
+
+:::
+
+:::{tab-item} DataFrameModel (Field)
+
+```{testcode} polars
+from pandera.engines.polars_engine import DateTime
+
+
+class DateTimeModel(pa.DataFrameModel):
+    created_at: DateTime = pa.Field(dtype_kwargs={"time_zone_agnostic": True})
+```
+
+:::
+
+::::
+
 
 ## Custom checks
 
@@ -620,7 +680,7 @@ For column-level checks, the custom check function should return a
 
 ### DataFrame-level Checks
 
-If you need to validate values on an entire dataframe, you can specify at check
+If you need to validate values on an entire dataframe, you can specify a check
 at the dataframe level. The expected output is a `polars.LazyFrame` containing
 multiple boolean columns, a single boolean column, or a scalar boolean.
 
@@ -737,11 +797,11 @@ lf: pl.LazyFrame = (
 ```
 
 This syntax is nice because it's clear what's happening just from reading the
-code. Pandera schemas serve as an apparent point in the method chain that
-materializes data.
+code. Pandera schemas serve as a clear point in the method chain where the data
+is materialized.
 
 However, if you don't mind a little magic 🪄, you can set the
-`PANDERA_VALIDATION_DEPTH` variable to `SCHEMA_AND_DATA` to
+`PANDERA_VALIDATION_DEPTH` environment variable to `SCHEMA_AND_DATA` to
 validate data-level properties on a `polars.LazyFrame`. This will be equivalent
 to the explicit code above:
 
@@ -761,3 +821,13 @@ lf: pl.LazyFrame = (
 Under the hood, the validation process will make `.collect()` calls on the
 LazyFrame in order to run data-level validation checks, and it will still
 return a `pl.LazyFrame` after validation is done.
+
+## Supported and Unsupported Functionality
+
+Since the pandera-polars integration is less mature than pandas support, some
+of the functionality offered by the pandera with pandas DataFrames are
+not yet supported with polars DataFrames.
+
+Here is a list of supported and unsupported features. You can
+refer to the {ref}`supported features matrix <supported-features>` to see
+which features are implemented in the polars validation backend.
diff --git a/docs/source/pyspark_sql.md b/docs/source/pyspark_sql.md
@@ -338,3 +338,14 @@ nature. It only works with `Config`.
 
 Use with caution.
 :::
+
+
+## Supported and Unsupported Functionality
+
+Since the pandera-pyspark-sql integration is less mature than pandas support, some
+of the functionality offered by the pandera with pandas DataFrames are
+not yet supported with pyspark sql DataFrames.
+
+Here is a list of supported and unsupported features. You can
+refer to the {ref}`supported features matrix <supported-features>` to see
+which features are implemented in the pyspark-sql validation backend.
diff --git a/docs/source/reference/core.rst b/docs/source/reference/core.rst
@@ -51,3 +51,17 @@ Data Objects
 
    pandera.api.polars.types.PolarsData
    pandera.api.pyspark.types.PysparkDataframeColumnObject
+
+Configuration
+-------------
+
+.. autosummary::
+   :toctree: generated
+   :template: class.rst
+   :nosignatures:
+
+   pandera.config.PanderaConfig
+   pandera.config.ValidationDepth
+   pandera.config.ValidationScope
+   pandera.config.config_context
+   pandera.config.get_config_context
diff --git a/docs/source/schema_inference.md b/docs/source/schema_inference.md
@@ -7,14 +7,16 @@ file_format: mystnb
 
 (schema-inference)=
 
-# Schema Inference
+# Schema Inference and Persistence
 
 *New in version 0.4.0*
 
 With simple use cases, writing a schema definition manually is pretty
 straight-forward with pandera. However, it can get tedious to do this with
 dataframes that have many columns of various data types.
 
+## Inferring a schema from data
+
 To help you handle these cases, the {func}`~pandera.schema_inference.pandas.infer_schema` function enables
 you to quickly infer a draft schema from a pandas dataframe or series. Below
 is a simple example:
@@ -52,7 +54,7 @@ inferred schema.
 
 (schema-persistence)=
 
-## Schema Persistence
+## Persisting a schema
 
 The schema persistence feature requires a pandera installation with the `io`
 extension. See the {ref}`installation<installation>` instructions for more
diff --git a/docs/source/supported_libraries.md b/docs/source/supported_libraries.md
diff --git a/pandera/engines/polars_engine.py b/pandera/engines/polars_engine.py