Skip to content

Commit f0ddcbf

Browse files
cosmicBboyJean-Francois Zinquetfwillemsabyz0123fkroll8
authored
bugfix release 0.7.1 (#615)
* Unique keyword arg (#580) * add copy button to docs (#448) * Add missing inplace arg to SchemaModel's validate (#450) * link documentation to github (#449) Co-authored-by: Niels Bantilan <[email protected]> * intermediate commit for review by @cosmicBboy * link documentation to github (#449) Co-authored-by: Niels Bantilan <[email protected]> * intermediate commit for review by @cosmicBboy * WIP * fix test errors, re-factor allow_duplicates handling * fix io tests * fix docs, remove _allow_duplicates private var * update unique type signature in strategies * completing tests for setters and lazy evaluation of unique kw * small fix for the linting errors * support dataframe-level uniqueness in strategies * add docs, fix error formatting, add multiindex support Co-authored-by: Jean-Francois Zinque <[email protected]> Co-authored-by: tfwillems <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: fkroll8 <[email protected]> * Add support for timezone-aware datetime strategies (#595) * add support for Any annotation in schema model (#594) * add support for Any annotation in schema model the motivation behind this feature is to support column annotations that can have any type, to support use cases like the one described in #592, where custom checks can be applied to any column except for ones that are explicitly defined in the schema model class attributes * update pylint, fix lint * Docs/scaling - Bring Pandera to Spark and Dask (#588) * scaling.rst * edited conf * finished first pass * removing FugueWorkflow * Update index.rst * Update docs/source/scaling.rst Co-authored-by: Niels Bantilan <[email protected]> * add support for timezone-aware datetime strategies * fix le/ge strategies with datetime * fix mypy errors Co-authored-by: Niels Bantilan <[email protected]> Co-authored-by: Kevin Kho <[email protected]> * schemas with multi-index columns correctly report errors (#600) fixes #589 * strategies module supports undefined checks in regex columns (#599) * Add support for empty data type annotation in SchemaModel (#602) * remove artifacts of py3.6 support * add support for empty data type annotation in SchemaModel * fix frictionless version in dev dependencies * fix setuptools version instead of frictionless * fix setuptools pinning * remove frictionless from core pandera deps (#609) * support frictionless primary keys with multiple fields (#608) * fix validation of check raising error without message (#613) * docs/requirements.txt pin setuptools (#611) * bump version 0.7.1 Co-authored-by: Jean-Francois Zinque <[email protected]> Co-authored-by: tfwillems <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: Kevin Kho <[email protected]>
1 parent 84ea3c2 commit f0ddcbf

26 files changed

+676
-197
lines changed

.readthedocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ formats: all
2020
python:
2121
version: 3.7
2222
install:
23+
- requirements: docs/requirements.txt
2324
- requirements: requirements-dev.txt
2425
- method: pip
2526
path: .

docs/requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# pin this due to issue described here: https://github.com/pandera-dev/pandera/pull/602#issuecomment-915622823
2+
setuptools < 58.0.0

docs/source/dataframe_schemas.rst

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -467,6 +467,38 @@ To validate the order of the Dataframe columns, specify ``ordered=True``:
467467

468468
.. _index:
469469

470+
Validating the joint uniqueness of columns
471+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
472+
473+
In some cases you might want to ensure that a group of columns are unique:
474+
475+
.. testcode:: joint_column_uniqueness
476+
477+
import pandas as pd
478+
import pandera as pa
479+
480+
schema = pa.DataFrameSchema(
481+
columns={col: pa.Column(int) for col in ["a", "b", "c"]},
482+
unique=["a", "c"],
483+
)
484+
df = pd.DataFrame.from_records([
485+
{"a": 1, "b": 2, "c": 3},
486+
{"a": 1, "b": 2, "c": 3},
487+
])
488+
schema.validate(df)
489+
490+
.. testoutput:: joint_column_uniqueness
491+
492+
Traceback (most recent call last):
493+
...
494+
SchemaError: columns '('a', 'c')' not unique:
495+
column index failure_case
496+
0 a 0 1
497+
1 a 1 1
498+
2 c 0 3
499+
3 c 1 3
500+
501+
470502
Index Validation
471503
----------------
472504

docs/source/schema_inference.rst

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ You can also write your schema to a python script with :func:`~pandera.io.to_scr
107107
Check.less_than_or_equal_to(max_value=20.0),
108108
],
109109
nullable=False,
110-
allow_duplicates=True,
110+
unique=False,
111111
coerce=False,
112112
required=True,
113113
regex=False,
@@ -116,7 +116,7 @@ You can also write your schema to a python script with :func:`~pandera.io.to_scr
116116
dtype=pandera.engines.numpy_engine.Object,
117117
checks=None,
118118
nullable=False,
119-
allow_duplicates=True,
119+
unique=False,
120120
coerce=False,
121121
required=True,
122122
regex=False,
@@ -132,7 +132,7 @@ You can also write your schema to a python script with :func:`~pandera.io.to_scr
132132
),
133133
],
134134
nullable=False,
135-
allow_duplicates=True,
135+
unique=False,
136136
coerce=False,
137137
required=True,
138138
regex=False,
@@ -185,15 +185,15 @@ is a convenience method for this functionality.
185185
checks:
186186
greater_than_or_equal_to: 5.0
187187
less_than_or_equal_to: 20.0
188-
allow_duplicates: true
188+
unique: false
189189
coerce: false
190190
required: true
191191
regex: false
192192
column2:
193193
dtype: object
194194
nullable: false
195195
checks: null
196-
allow_duplicates: true
196+
unique: false
197197
coerce: false
198198
required: true
199199
regex: false
@@ -203,7 +203,7 @@ is a convenience method for this functionality.
203203
checks:
204204
greater_than_or_equal_to: '2010-01-01 00:00:00'
205205
less_than_or_equal_to: '2012-01-01 00:00:00'
206-
allow_duplicates: true
206+
unique: false
207207
coerce: false
208208
required: true
209209
regex: false
@@ -218,6 +218,7 @@ is a convenience method for this functionality.
218218
coerce: false
219219
coerce: true
220220
strict: false
221+
unique: null
221222

222223
You can edit this yaml file by specifying column names under the ``column``
223224
key. The respective values map onto key-word arguments in the

environment.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ dependencies:
3232
- pytest-xdist
3333
- pytest-asyncio
3434
- xdoctest
35-
- setuptools >= 52.0.0
35+
- setuptools < 58.0.0
3636
- nox = 2020.12.31 # pinning due to UnicodeDecodeError, see https://github.com/pandera-dev/pandera/pull/504/checks?check_run_id=2841360122
3737
- importlib_metadata # required if python < 3.8
3838

pandera/engines/pandas_engine.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,13 @@ def numpy_dtype(cls, pandera_dtype: dtypes.DataType) -> np.dtype:
157157
alias = "bool"
158158
elif alias.startswith("string"):
159159
alias = "str"
160-
return np.dtype(alias)
160+
161+
try:
162+
return np.dtype(alias)
163+
except TypeError as err:
164+
raise TypeError(
165+
f"Data type '{pandera_dtype}' cannot be cast to a numpy dtype."
166+
) from err
161167

162168

163169
###############################################################################

pandera/errors.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,9 @@ def _parse_schema_errors(schema_errors: List[Dict[str, Any]]):
171171
schema_context=err.schema.__class__.__name__,
172172
check=check_identifier,
173173
check_number=err.check_index,
174-
column=column,
174+
# explicitly wrap `column` in a list of the column key is
175+
# a tuple in the case of MultiIndex column names.
176+
column=[column] if isinstance(column, tuple) else column,
175177
)
176178
check_failure_cases.append(failure_cases[column_order])
177179

pandera/io.py

Lines changed: 31 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ def _serialize_component_stats(component_stats):
108108
key: component_stats.get(key)
109109
for key in [
110110
"name",
111-
"allow_duplicates",
111+
"unique",
112112
"coerce",
113113
"required",
114114
"regex",
@@ -148,6 +148,7 @@ def _serialize_schema(dataframe_schema):
148148
"index": index,
149149
"coerce": dataframe_schema.coerce,
150150
"strict": dataframe_schema.strict,
151+
"unique": dataframe_schema.unique,
151152
}
152153

153154

@@ -195,6 +196,9 @@ def _deserialize_component_stats(serialized_component_stats):
195196
for key in [
196197
"name",
197198
"nullable",
199+
"unique",
200+
# deserialize allow_duplicates property for backwards
201+
# compatibility. Remove this for 0.8.0 release
198202
"allow_duplicates",
199203
"coerce",
200204
"required",
@@ -255,6 +259,7 @@ def _deserialize_schema(serialized_schema):
255259
index=index,
256260
coerce=serialized_schema.get("coerce", False),
257261
strict=serialized_schema.get("strict", False),
262+
unique=serialized_schema.get("unique", None),
258263
)
259264

260265

@@ -310,7 +315,7 @@ def _write_yaml(obj, stream):
310315
dtype={dtype},
311316
checks={checks},
312317
nullable={nullable},
313-
allow_duplicates={allow_duplicates},
318+
unique={unique},
314319
coerce={coerce},
315320
required={required},
316321
regex={regex},
@@ -397,7 +402,7 @@ def to_script(dataframe_schema, path_or_buf=None):
397402
),
398403
checks=_format_checks(properties["checks"]),
399404
nullable=properties["nullable"],
400-
allow_duplicates=properties["allow_duplicates"],
405+
unique=properties["unique"],
401406
coerce=properties["coerce"],
402407
required=properties["required"],
403408
regex=properties["regex"],
@@ -418,6 +423,7 @@ def to_script(dataframe_schema, path_or_buf=None):
418423
coerce=dataframe_schema.coerce,
419424
strict=dataframe_schema.strict,
420425
name=dataframe_schema.name.__repr__(),
426+
unique=dataframe_schema.unique,
421427
).strip()
422428

423429
# add pandas imports to handle datetime and timedelta.
@@ -445,15 +451,15 @@ class FrictionlessFieldParser:
445451
formats, titles, descriptions).
446452
447453
:param field: a field object from a frictionless schema.
448-
:param primary_keys: the primary keys from a frictionless schema. These are used
449-
to ensure primary key fields are treated properly - no duplicates,
450-
no missing values etc.
454+
:param primary_keys: the primary keys from a frictionless schema. These
455+
are used to ensure primary key fields are treated properly - no
456+
duplicates, no missing values etc.
451457
"""
452458

453459
def __init__(self, field, primary_keys) -> None:
454460
self.constraints = field.constraints or {}
461+
self.primary_keys = primary_keys
455462
self.name = field.name
456-
self.is_a_primary_key = self.name in primary_keys
457463
self.type = field.get("type", "string")
458464

459465
@property
@@ -544,18 +550,22 @@ def nullable(self) -> bool:
544550
"""Determine whether this field can contain missing values.
545551
546552
If a field is a primary key, this will return ``False``."""
547-
if self.is_a_primary_key:
553+
if self.name in self.primary_keys:
548554
return False
549555
return not self.constraints.get("required", False)
550556

551557
@property
552-
def allow_duplicates(self) -> bool:
558+
def unique(self) -> bool:
553559
"""Determine whether this field can contain duplicate values.
554560
555-
If a field is a primary key, this will return ``False``."""
556-
if self.is_a_primary_key:
557-
return False
558-
return not self.constraints.get("unique", False)
561+
If a field is a primary key, this will return ``True``.
562+
"""
563+
564+
# only set column-level uniqueness property if `primary_keys` contains
565+
# more than one field name.
566+
if len(self.primary_keys) == 1 and self.name in self.primary_keys:
567+
return True
568+
return self.constraints.get("unique", False)
559569

560570
@property
561571
def coerce(self) -> bool:
@@ -587,10 +597,10 @@ def regex(self) -> bool:
587597
def to_pandera_column(self) -> Dict:
588598
"""Export this field to a column spec dictionary."""
589599
return {
590-
"allow_duplicates": self.allow_duplicates,
591600
"checks": self.checks,
592601
"coerce": self.coerce,
593602
"nullable": self.nullable,
603+
"unique": self.unique,
594604
"dtype": self.dtype,
595605
"required": self.required,
596606
"name": self.name,
@@ -645,8 +655,8 @@ def from_frictionless_schema(
645655
[<Check in_range: in_range(10, 99)>]
646656
>>> schema.columns["column_1"].required
647657
True
648-
>>> schema.columns["column_1"].allow_duplicates
649-
False
658+
>>> schema.columns["column_1"].unique
659+
True
650660
>>> schema.columns["column_2"].checks
651661
[<Check str_length: str_length(None, 10)>, <Check str_matches: str_matches(re.compile('^\\\\S+$'))>]
652662
"""
@@ -664,5 +674,10 @@ def from_frictionless_schema(
664674
"checks": None,
665675
"coerce": True,
666676
"strict": True,
677+
# only set dataframe-level uniqueness if the frictionless primary
678+
# key property specifies more than one field
679+
"unique": (
680+
None if len(schema.primary_key) == 1 else list(schema.primary_key)
681+
),
667682
}
668683
return _deserialize_schema(assembled_schema)

pandera/model.py

Lines changed: 14 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -34,21 +34,9 @@
3434
FieldInfo,
3535
)
3636
from .schemas import DataFrameSchema
37-
from .typing import LEGACY_TYPING, AnnotationInfo, DataFrame, Index, Series
37+
from .typing import AnnotationInfo, DataFrame, Index, Series
3838

39-
if LEGACY_TYPING:
40-
41-
def get_type_hints(
42-
obj: Callable[..., Any],
43-
globalns: Optional[Dict[str, Any]] = None,
44-
localns: Optional[Dict[str, Any]] = None,
45-
include_extras: bool = False,
46-
) -> Dict[str, Any]:
47-
# pylint:disable=function-redefined, missing-function-docstring, unused-argument
48-
return typing.get_type_hints(obj, globalns, localns)
49-
50-
51-
elif sys.version_info[:2] < (3, 9):
39+
if sys.version_info[:2] < (3, 9):
5240
from typing_extensions import get_type_hints
5341
else:
5442
from typing import get_type_hints
@@ -82,6 +70,9 @@ class BaseConfig: # pylint:disable=R0903
8270
name: Optional[str] = None #: name of schema
8371
coerce: bool = False #: coerce types of all schema components
8472

73+
#: make sure certain column combinations are unique
74+
unique: Optional[Union[str, List[str]]] = None
75+
8576
#: make sure all specified columns are in the validated dataframe -
8677
#: if ``"filter"``, removes columns not specified in the schema
8778
strict: Union[bool, str] = False
@@ -218,6 +209,7 @@ def to_schema(cls) -> DataFrameSchema:
218209
strict=cls.__config__.strict,
219210
name=cls.__config__.name,
220211
ordered=cls.__config__.ordered,
212+
unique=cls.__config__.unique,
221213
)
222214
if cls not in MODEL_CACHE:
223215
MODEL_CACHE[cls] = cls.__schema__ # type: ignore
@@ -300,7 +292,10 @@ def _build_columns_index( # pylint:disable=too-many-locals
300292

301293
dtype = None if dtype is Any else dtype
302294

303-
if annotation.origin is Series:
295+
if (
296+
annotation.origin is Series
297+
or annotation.raw_annotation is Series
298+
):
304299
col_constructor = (
305300
field.to_column if field else schema_components.Column
306301
)
@@ -316,7 +311,10 @@ def _build_columns_index( # pylint:disable=too-many-locals
316311
checks=field_checks,
317312
name=field_name,
318313
)
319-
elif annotation.origin is Index:
314+
elif (
315+
annotation.origin is Index
316+
or annotation.raw_annotation is Index
317+
):
320318
if annotation.optional:
321319
raise SchemaInitError(
322320
f"Index '{field_name}' cannot be Optional."

0 commit comments

Comments
 (0)