@@ -27,6 +27,14 @@ pip install 'pandera[polars]'
2727:::{important}
2828If you're on an Apple Silicon machine, you'll need to install polars via
2929` pip install polars-lts-cpu ` .
30+
31+ You may have to delete ` polars ` if it's already installed:
32+
33+ ```
34+ pip uninstall polars
35+ pip install polars-lts-cpu
36+ ```
37+
3038:::
3139
3240Then you can use pandera schemas to validate polars dataframes. In the example
@@ -89,14 +97,18 @@ schema.validate(lf).collect()
8997
9098You can also validate {py: class }` polars.DataFrame ` objects, which are objects that
9199execute computations eagerly. Under the hood, ` pandera ` will convert
92- the ` polars.DataFrame ` to a ` polars.LazyFrame ` before validating it:
100+ the ` polars.DataFrame ` to a ` polars.LazyFrame ` before validating it. This is done
101+ so that the internal validation routine that pandera implements can take
102+ advantage of the optimizations that the polars lazy API provides.
93103
94104``` {code-cell} python
95- df = lf.collect()
105+ df: pl.DataFrame = lf.collect()
96106schema.validate(df)
97107```
98108
99- :::{note}
109+ ## Synthesizing data for testing
110+
111+ :::{warning}
100112The {ref}` data-synthesis-strategies ` functionality is not yet supported in
101113the polars integration. At this time you can use the polars-native
102114[ parametric testing] ( https://docs.pola.rs/py-polars/html/reference/testing.html#parametric-testing )
@@ -107,7 +119,7 @@ functions to generate test data for polars.
107119
108120Compared to the way ` pandera ` handles ` pandas ` dataframes, ` pandera `
109121attempts to leverage the ` polars ` [ lazy API] ( https://docs.pola.rs/user-guide/lazy/using/ )
110- as much as possible to leverage its performance optimization benefits.
122+ as much as possible to leverage its query optimization benefits.
111123
112124At a high level, this is what happens during schema validation:
113125
@@ -130,19 +142,19 @@ informative error messages since all failure cases can be reported.
130142:::
131143
132144` pandera ` 's validation behavior aligns with the way ` polars ` handles lazy
133- vs. eager operations. When you can ` schema.validate() ` on a ` polars.LazyFrame ` ,
145+ vs. eager operations. When you call ` schema.validate() ` on a ` polars.LazyFrame ` ,
134146` pandera ` will apply all of the parsers and checks that can be done without
135147any ` collect() ` operations. This means that it only does validations
136148at the schema-level, e.g. column names and data types.
137149
138- However, if you validate a ` polars.DataFrame ` , ` pandera ` perform
150+ However, if you validate a ` polars.DataFrame ` , ` pandera ` performs
139151schema-level and data-level validations.
140152
141153:::{note}
142- Under the hood, ` pandera ` will convert ``` polars.DataFrame` `s to a
143- `` polars.LazyFrame ` ` s before validating them. This is done to leverage the
154+ Under the hood, ` pandera ` will convert ` polars.DataFrame ` s to a
155+ ` polars.LazyFrame ` s before validating them. This is done to leverage the
144156polars lazy API during the validation process. While this feature isn't
145- fully optimized in the `` pandera `` ` library, this design decision lays the
157+ fully optimized in the ` pandera ` library, this design decision lays the
146158ground-work for future performance improvements.
147159:::
148160
@@ -411,6 +423,7 @@ pandera.errors.SchemaErrors: {
411423
412424::::
413425
426+ (supported-polars-dtypes)=
414427
415428## Supported Data Types
416429
@@ -491,6 +504,53 @@ class ModelWithDtypeKwargs(pa.DataFrameModel):
491504
492505::::
493506
507+ ### Time-agnostic DateTime
508+
509+ In some use cases, it may not matter whether a column containing ` pl.DateTime `
510+ data has a timezone or not. In that case, you can use the pandera-native
511+ polars datatype:
512+
513+ ::::{tab-set}
514+
515+ :::{tab-item} DataFrameSchema
516+
517+ ``` {testcode} polars
518+ from pandera.engines.polars_engine import DateTime
519+
520+
521+ schema = pa.DataFrameSchema({
522+ "created_at": pa.Column(DateTime(time_zone_agnostic=True)),
523+ })
524+ ```
525+
526+ :::
527+
528+ :::{tab-item} DataFrameModel (Annotated)
529+
530+ ``` {testcode} polars
531+ from pandera.engines.polars_engine import DateTime
532+
533+
534+ class DateTimeModel(pa.DataFrameModel):
535+ created_at: Annotated[DateTime, True]
536+ ```
537+
538+ :::
539+
540+ :::{tab-item} DataFrameModel (Field)
541+
542+ ``` {testcode} polars
543+ from pandera.engines.polars_engine import DateTime
544+
545+
546+ class DateTimeModel(pa.DataFrameModel):
547+ created_at: DateTime = pa.Field(dtype_kwargs={"time_zone_agnostic": True})
548+ ```
549+
550+ :::
551+
552+ ::::
553+
494554
495555## Custom checks
496556
@@ -620,7 +680,7 @@ For column-level checks, the custom check function should return a
620680
621681### DataFrame-level Checks
622682
623- If you need to validate values on an entire dataframe, you can specify at check
683+ If you need to validate values on an entire dataframe, you can specify a check
624684at the dataframe level. The expected output is a ` polars.LazyFrame ` containing
625685multiple boolean columns, a single boolean column, or a scalar boolean.
626686
@@ -737,11 +797,11 @@ lf: pl.LazyFrame = (
737797```
738798
739799This syntax is nice because it's clear what's happening just from reading the
740- code. Pandera schemas serve as an apparent point in the method chain that
741- materializes data .
800+ code. Pandera schemas serve as a clear point in the method chain where the data
801+ is materialized .
742802
743803However, if you don't mind a little magic 🪄, you can set the
744- ` PANDERA_VALIDATION_DEPTH ` variable to ` SCHEMA_AND_DATA ` to
804+ ` PANDERA_VALIDATION_DEPTH ` environment variable to ` SCHEMA_AND_DATA ` to
745805validate data-level properties on a ` polars.LazyFrame ` . This will be equivalent
746806to the explicit code above:
747807
@@ -761,3 +821,13 @@ lf: pl.LazyFrame = (
761821Under the hood, the validation process will make ` .collect() ` calls on the
762822LazyFrame in order to run data-level validation checks, and it will still
763823return a ` pl.LazyFrame ` after validation is done.
824+
825+ ## Supported and Unsupported Functionality
826+
827+ Since the pandera-polars integration is less mature than pandas support, some
828+ of the functionality offered by the pandera with pandas DataFrames are
829+ not yet supported with polars DataFrames.
830+
831+ Here is a list of supported and unsupported features. You can
832+ refer to the {ref}` supported features matrix <supported-features> ` to see
833+ which features are implemented in the polars validation backend.
0 commit comments