Skip to content

Conversation

@derinwalters
Copy link
Contributor

@derinwalters derinwalters commented Oct 21, 2023

For #1108 and related GeoPandas discussion #693

Add support for GeoDataFrame, coercion of several input formats into GeometryArray/GeoSeries (wkt, wkb, GeoJSON dict), and extend the pandas engine Geometry dtype with a crs attribute. The latter allows for validation of data between CRS and coercion of related coordinate transformation. GeoDataFrame with multiple GeoSeries columns on different CRS is supported. Designating 'geometry' and 'crs' in GeoDataFrame not implemented because I wasn't sure the best way to go about it. It's worth noting that prior behavior of Geometry/GeoSeries remains the same (i.e., no code changes needed).

I understand that CRS isn't technically part of GeometryDtype but it makes a lot of sense to me to implement in the Geometry dtype in the context of GeoPandas. I would be open to some discussion here if there is a better way.

import pandera as pa
from pandera.typing.geopandas import GeoDataFrame
from pandera.engines.pandas_engine import Geometry


class MySchemaWithCoerce(pa.DataFrameModel):
    geometry: Geometry(crs="EPSG:4326") # ‘crs’ argument not required (default: None)
    count: int

    class Config:
        coerce = True

class MySchemaWithoutCoerce(pa.DataFrameModel):
    geometry: Geometry(crs="EPSG:4326")
    count: int

    class Config:
        coerce = False

data = {"geometry": [Point([450000, 900000])], "count": [100]}

# Coerce from GeoDataFrame constructor-designated EPSG:3857 to schema EPSG:4326
gdf = GeoDataFrame[MySchemaWithCoerce](data, crs="EPSG:3857")
print(f'name={gdf.geometry.name}, crs={gdf.geometry.crs}, data={gdf.geometry.to_numpy()}')
# name=geometry, crs=EPSG:4326, data=[<POINT (4.042 8.058)>]

# Do some coordinate transformations outside the comfort of Pandera
gdf = gdf.to_crs("EPSG:3395")
print(f'name={gdf.geometry.name}, crs={gdf.geometry.crs}, data={gdf.geometry.to_numpy()}')
# name=geometry, crs=EPSG:3395, data=[<POINT (450000 894014.468)>]

# Load geometry with different CRS than schema when disabling coerce will fail
try:
    gdf = GeoDataFrame[MySchemaWithoutCoerce](gdf)
except Exception as e:
    print(e)
print(f'name={gdf.geometry.name}, crs={gdf.geometry.crs}, data={gdf.geometry.to_numpy()}')
# CRS mismatch; actual EPSG:3395, expected EPSG:4326
# name=geometry, crs=EPSG:3395, data=[<POINT (450000 894014.468)>]

# Coerce back to schema
gdf = GeoDataFrame[MySchemaWithCoerce](gdf)
print(f'name={gdf.geometry.name}, crs={gdf.geometry.crs}, data={gdf.geometry.to_numpy()}')
# name=geometry, crs=EPSG:4326, data=[<POINT (4.042 8.058)>]

@codecov
Copy link

codecov bot commented Oct 21, 2023

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (fb680ac) 93.92% compared to head (7e54845) 93.99%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1392      +/-   ##
==========================================
+ Coverage   93.92%   93.99%   +0.06%     
==========================================
  Files          91       91              
  Lines        6787     6960     +173     
==========================================
+ Hits         6375     6542     +167     
- Misses        412      418       +6     
Files Coverage Δ
pandera/engines/pandas_engine.py 97.23% <100.00%> (+0.42%) ⬆️
pandera/typing/geopandas.py 94.00% <93.33%> (-6.00%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cosmicBboy
Copy link
Collaborator

Thanks @derinwalters !

One note though: we should use the parameterized dtypes syntax, either using Annotated or pa.Field(dtype_kwargs={...})

…eterized Geometry, and a bunch of tests to improve coverage (unionai-oss#1108)

Signed-off-by: Derin Walters <[email protected]>
…ove the GeoDataFrame from_records due to not understanding its purpose well (unionai-oss#1108)

Signed-off-by: Derin Walters <[email protected]>
@derinwalters
Copy link
Contributor Author

derinwalters commented Oct 22, 2023

@cosmicBboy Are you referring to parameterized GeoSeries? If I understand correctly, I'm assuming you mean that we should avoid the direct use of the argument in an instantiated type.

# Avoid this declaration
class Schema(pa.DataFrameModel):
    geometry: Geometry(crs="EPSG:4326")

# These declarations are okay
class Schema(pa.DataFrameModel):
    geometry: Geometry = pa.Field(dtype_kwargs={"crs": "EPSG:4326"})

class Schema(pa.DataFrameModel):
    geometry: GeoSeries[Annotated[Geometry, "EPSG:4326"]]

class Schema(pa.DataFrameModel):
    geometry: GeoSeries = pa.Field(dtype_kwargs={'crs': "EPSG:4326"})

@derinwalters
Copy link
Contributor Author

@cosmicBboy I looked through the implementation again and believe this PR is okay to merge. Do you have any timeline for the next expected release?

@cosmicBboy
Copy link
Collaborator

If I understand correctly, I'm assuming you mean that we should avoid the direct use of the argument in an instantiated type.

correct!

@cosmicBboy
Copy link
Collaborator

will be cutting a release candidate for 0.18.0 at the end of this week

@cosmicBboy
Copy link
Collaborator

Thanks for the contribution @derinwalters !

@cosmicBboy cosmicBboy merged commit a54d4db into unionai-oss:main Oct 31, 2023
@derinwalters
Copy link
Contributor Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants