ENH: improve support for datetime columns #486

theroggy · 2024-10-17T20:02:49Z

This PR improves support for datetime columns, mainly in read_dataframe and write_dataframe.

In general the PR tries to accomplish that:

datetime column data from a file can be read to a GeoDataFrame without data loss. For this, a parameter datetimes has been added to read_dataframe, with following possible values:
- "UTC": always return datetime columns as pandas datetime64 columns. If a column contains e.g. data with mixed timezone offsets the datetimes will be converted to UTC as pandas datetime64 columns don't support such data. This was the behaviour before this PR and stays the default.
- "DATETIME": return the datetime column values with timezone information as they were read from the file. In this case mixed timezone columns are returned as object columns with pandas.Timestamp values. This is to avoid the timezone information being lost. This option should be used if you want datetime data being roundtripped correctly in most situations.
- "STRING": return the datetime columns as ISO8601 strings.
(try to) get the treatment of datetimes consistent between when arrow is used or not. For use_arrow=True there are several situations where GDAL 3.11 is needed to get correct results.

More specifically:

Fix: when a GPKG was read with use_arrow, naive datetimes (no timezone) were interpreted as being UTC. So a naive time of 05:00 h was interpreted as 05:00 UTC.
Fix: when a .fgb was read with use_arrow, for datetime columns with a timezone the timezone was dropped, so 05:00+5:00 was read as 05:00.
Fix: when a file was written with use_arrow, for datetime columns with any timezone but UTC, the timezone was dropped, so 05:00+5:00 was written as 05:00 (a naive datetime), for all filetypes.
When reading datetimes with use_arrow, don't convert/represent them as being in UTC time if they have another timezone offset in the dataset.
Add support to write columns with mixed timezones. Typically the column needs to be of the object type with pandas.Timestamp or datetime objects in them as "standard" pandas datetime64 colums don't support mixed timezone offsets in a column.
Add support to read mixed timezone datetimes. These are returned in an object column with Timestamps.
For the cases with use_arrow, the fixes typically require GDAL >= 3.11 (OGRLayer::GetArrowStream(): add a DATETIME_AS_STRING=YES/NO option OSGeo/gdal#11213).

Resolves #487

jorisvandenbossche

Thanks for diving into this and improving the test coverage!

pyogrio/tests/test_geopandas_io.py

…ith-naive-datetimes-with-arrow

jorisvandenbossche

@theroggy thanks for further looking into this!

I do have some doubts about how much effort we should do to cover corner cases and what the desired default behaviour should be, see my comments below.

jorisvandenbossche · 2025-01-18T14:35:56Z

pyogrio/geopandas.py

-    # if object dtype, try parse as utc instead
-    if res.dtype == "object":
-        try:
-            res = pd.to_datetime(ser, utc=True, **datetime_kwargs)
-        except Exception:
-            pass


From your top post explanation:

Add support to read mixed timezone datetimes. These are returned in an object column with Timestamps.

First, I don't think this will work with upcoming pandas 3.x (we are suppressing the warning above about mixed timezones going to raise unless passing utc=True, and that you have to use apply and datetime.datetime.strptime instead to get mixed offset objects)
(but the tests are also passing, so maybe I am missing something)

Second, a column of mixed offset objects is in general not that particularly useful .. So changing this behaviour feels like a regression to me. I understand that we might want to provide the user the option to get this, but by default, I am not sure.

From your top post explanation:

Add support to read mixed timezone datetimes. These are returned in an object column with Timestamps.

First, I don't think this will work with upcoming pandas 3.x (we are suppressing the warning above about mixed timezones going to raise unless passing utc=True, and that you have to use apply and datetime.datetime.strptime instead to get mixed offset objects) (but the tests are also passing, so maybe I am missing something)

Yes, I saw. Do you know what the rationale is that in pandas 3 people are being forced to use a more inefficient way (apply) to get to your data? I did some performance tests and especially the way it is advised in the warning is really slow:

with to_datetime() it takes 0.6 sec to convert 1.5 mio strings

with apply using datetime.from_isoformat() it takes 1 sec to convert 1.5 mio strings, but you will first need to call to_datetime() first so it can throw an error, so up to 0.6 seconds will need to be added to the time.

with apply using datetime.strptime() it takes 56 sec to convert 1.5 mio strings

Second, a column of mixed offset objects is in general not that particularly useful .. So changing this behaviour feels like a regression to me. I understand that we might want to provide the user the option to get this, but by default, I am not sure.

For starters, to be clear, this is only relevant for mixed timezone data. Data saved in naive or UTC timestamps or if all datetimes have the same timezone offset, they should/will just stay "regular" pandas datetime columns.
Note: a column with datetimes in a timezone with daylight saving time will also typically lead to mixed timezones as they will typically have 2 different timezone offsets.

For the case of mixed timezone data, it depends on what you want to do with the datetime data. If it is just to look at it/show/keep it as it is part of the table data, the Timestamps look just fine to me. If you really want to do "fancy stuff" with the datetimes it will in pandas indeed be more convenient for some things to transform them into e.g. UTC datetimes to get a datetime column instead of an object column.

Regarding default behaviour, it feels quite odd to me to transform data by default to a form where information (the original time zone) is lost. Also because when you save the data again, it will then be saved as UTC as well, so also: the timezone information will be lost.

To me, the other way around is more logical: by default you don't loose data. If you want to do "fancy stuff" with a datetime column that contains mixed timezone data, you convert it to e.g. UTC, typically in an extra column, because most likely you will want to keep the original timezone information again when saving.

jorisvandenbossche · 2025-01-18T14:44:49Z

pyogrio/geopandas.py

+        elif col.dtype == "object":
+            # Column of Timestamp objects, also split in naive datetime and tz offset
+            col_na = df[col.notna()][name]
+            if len(col_na) and all(isinstance(x, pd.Timestamp) for x in col_na):


I am a bit hesitant to add custom support for this, exactly given that it is not really supported by pandas itself, do we then need to add special support for it?

Right now, if you have an object dtype column with timestamp columns, they already get written as strings, which in the end preserves the offset information (in the string representation).
It might read back as strings (depending on the file format), but at that point the user can handle this column as they see fit.

Without this support, it becomes impossible to read and write mixed timezone data so the written file is equivalent to the original file for quite some file formats.

If they are written as strings to the output files without the proper metadata, it depends on format to format if they will be recognized as datetimes when read. For text files they will typically be recognized as datetime as the data types are "guessed" when the file is read (e.g. geojson), for files like .fgb and .gpkg they won't be recognized as the file metadata will be wrong.

That's not very clean, and as it is very easy to solve I don't quite see the point of not supporting it properly?

pyogrio/geopandas.py

pyogrio/tests/test_geopandas_io.py

jorisvandenbossche · 2025-01-18T15:05:27Z

pyogrio/tests/test_geopandas_io.py

+    if use_arrow and ext == ".gpkg" and __gdal_version__ < (3, 11, 0):
+        pytest.skip("Arrow datetime handling improved in GDAL >= 3.11")


What is not yet working for the case with no tz for GPKG?

Datetimes in a GPKG without timezone are now interpreted as being UTC. So a naive time of 05:00 h is interpreted as 05:00 UTC.

This is one of the issues listed in #487 (comment)

…ith-naive-datetimes-with-arrow

theroggy · 2025-08-03T23:46:27Z

@jorisvandenbossche Tests with pandas 3.0 were failing because pd.to_datetime now indeed gives an error for mixed timezones.
-> I added a fallback for this case.

I tested several different options performance wise and I took the least bad one (for 3.3 mio rows conversion now takes between 5 and 12 seconds per datetime column versus 0.5 seconds before, but the adviced option in the warning was 50 seconds).

It is mainly creating the pd.Timestamp objects that is slow: datetime.fromisoformat() just takes 1 second, but Timestamp() takes 4-11 seconds. So an option is also to return datetime.datetime objects, but they are not as "rich" as pd.Timestamp feature-wise which is a bit of a pity...

Is it a realisting option that there would be added a feature in pandas to e.g. efficiently convert a list with iso-formatted strings to a list of pd.Timestamp objects, because it should be possible: in pd.to_datetime it was fast?

…ttps://github.com/theroggy/pyogrio into ENH-deal-properly-with-naive-datetimes-with-arrow

jorisvandenbossche · 2025-08-06T08:24:41Z

@theroggy thanks a lot for the updates here! The implementation is generally looking good, I mostly have some remaining concerns about the exact user-facing API.

"UTC" for the default seems a bit misleading, because it is only UTC for tz-aware columns. If you have a tz-naive column, that roundtrips as tz-naive. Also for tz-aware columns, if there is only one offset, you will get the fixed offset, not UTC (unless that behaviour changed with this PR?)
"DATETIME": the important aspect here is that it are datetime/timestamp objects, because "datetime" could also mean the dtype, and then it could be interpreted as "use a datetime dtype" (which is datetime64). Also the current PR returns pd.Timestamp and not datetime.datetime, so that is also a bit confusing (although we should maybe return datetime.datetime?)

We already have the datetime_as_string=True/False keyword (in the raw IO, just not exposed in the dataframe IO), so wondering if we could also expose that in read_dataframe and then add a datetime_as_object=True/False keyword.
Of course, two separate boolean keywords that then cannot both be True at the same time because they essentially control the same choice is also not great design API, but from a user point of view, for writing code I think I would find that easier to read in my code (read_dataframe(..., datetime_as_string=True) vs read_dataframe(..., datetimes="string"))

CHANGES.md

jorisvandenbossche · 2025-08-06T08:27:47Z

pyogrio/geopandas.py

-    if not use_arrow:
-        # For arrow, datetimes are read as is.
-        # For numpy IO, datetimes are read as string values to preserve timezone info
-        # as numpy does not directly support timezones.
-        kwargs["datetime_as_string"] = True
+
+    # Always read datetimes as string values to preserve (mixed) timezone info
+    # as numpy does not directly support timezones and arrow datetime columns
+    # don't support mixed timezones.


Do you know what the performance impact is of also setting datetime_as_string to True for the Arrow code path?
I would think that this is going to be quite a bit slower, and in that case we should maybe only set it if the user asked for getting datetime objects or strings, and not in the default case?

In several situations this leads to wrong data being returned, e.g. timezones being dropped ,... Especially for .fgb files, sometimes for GPKG files.

I did a quick test with the 3.3 mio buildings in new zealand on my laptop, and the times fluctuate some, but during a more stable period reading them with datetime_as_string=True took 10.6 secs and with datetime_as_string=False it was around 11.4 seconds, so it is a bit faster, but not a huge difference.

pyogrio/geopandas.py

jorisvandenbossche · 2025-08-06T08:57:23Z

pyogrio/geopandas.py

+            elif isinstance(dtype, pd.DatetimeTZDtype) and str(dtype.tz) != "UTC":
+                # When it is a datetime column with a timezone different than UTC, it
+                # needs to be converted to string, otherwise the timezone info is lost.
+                df[name] = col.astype("string")
+                datetime_cols.append(name)


Do you know why GDAL preserves the tz-awareness for UTC, but not for other offsets (even though the values written to the file are in UTC) in the Arrow write path?

I didn't test it again explicitly, but this follows the general logic that for most timezones daylight saving time is used, leading to (potential) different offsets in a column... and with different offsets in one column the timezone information gets lost... I updates the inline comment to clarify this.

jorisvandenbossche · 2025-08-06T09:04:08Z

I mostly have some remaining concerns about the exact user-facing API.

"UTC" for the default seems a bit misleading, because it is only UTC for tz-aware columns. If you have a tz-naive column, that roundtrips as tz-naive. Also for tz-aware columns, if there is only one offset, you will get the fixed offset, not UTC (unless that behaviour changed with this PR?)

"DATETIME": the important aspect here is that it are datetime/timestamp objects, because "datetime" could also mean the dtype, and then it could be interpreted as "use a datetime dtype" (which is datetime64). Also the current PR returns pd.Timestamp and not datetime.datetime, so that is also a bit confusing (although we should maybe return datetime.datetime?)

I clearly hadn't yet looked in detail at the code changes ànd the docstring explanation of the keyword .. because I see now that the names of the keyword options indeed more or less match with the behaviour, I just wasn't expecting that behaviour.

First, as also mentioned in the inline comments, I would like to preserve the current default behaviour of roundtripping tz-naive datetimes. The default of "UTC" no longer does that ..
For "DATETIME", I had naively interpreted it as always returning objects, but the your docstring explanation of the keyword is clear about still returning datetime64 when that does not loose mixed offset information.

jorisvandenbossche · 2025-08-06T09:17:46Z

pyogrio/geopandas.py

+          from the data source. Columns with values in a single timezone or
+          without timezone information will be returned as pandas datetime64
+          columns. Columns with mixed timezone data are returned as object


Small nitpick on the wording, but I think we should be clear about that it are "timezone-aware columns with mixed offsets", because the general case of mixed offsets (due to DST changes throughout the year) is still a single time zone.
The issue is that GDAL does not actually support the concept of time zones, only offsets.

True, I tried to clarify this.

jorisvandenbossche · 2025-08-06T09:22:16Z

It is mainly creating the pd.Timestamp objects that is slow: datetime.fromisoformat() just takes 1 second, but Timestamp() takes 4-11 seconds. So an option is also to return datetime.datetime objects, but they are not as "rich" as pd.Timestamp feature-wise which is a bit of a pity...

Given that, I think it would also be fine to return the datetime.datetime objects (that's also what pandas did in the past (< 2.0) when it could not parse to datetime64). It's true they have less attributes/methods, but the user can still do that conversion themselves if they need those.

I am not sure if pandas still has that functionality for vectorized conversion to pd.Timestamp under the hood (because it no longer needs it itself)

pyogrio/geopandas.py

theroggy · 2025-08-07T08:42:09Z

I mostly have some remaining concerns about the exact user-facing API.

"UTC" for the default seems a bit misleading, because it is only UTC for tz-aware columns. If you have a tz-naive column, that roundtrips as tz-naive. Also for tz-aware columns, if there is only one offset, you will get the fixed offset, not UTC (unless that behaviour changed with this PR?)

"DATETIME": the important aspect here is that it are datetime/timestamp objects, because "datetime" could also mean the dtype, and then it could be interpreted as "use a datetime dtype" (which is datetime64). Also the current PR returns pd.Timestamp and not datetime.datetime, so that is also a bit confusing (although we should maybe return datetime.datetime?)

I clearly hadn't yet looked in detail at the code changes ànd the docstring explanation of the keyword .. because I see now that the names of the keyword options indeed more or less match with the behaviour, I just wasn't expecting that behaviour.

First, as also mentioned in the inline comments, I would like to preserve the current default behaviour of roundtripping tz-naive datetimes. The default of "UTC" no longer does that ..

@jorisvandenbossche Yes, that was unintentional changed behaviour. Should be fixed.

For "DATETIME", I had naively interpreted it as always returning objects, but the your docstring explanation of the keyword is clear about still returning datetime64 when that does not loose mixed offset information.

I just want to expose the default behaviour of pandas.to_datetime (before it was crippled in pandas 3): return the "richest" format possible depending on the data, without changing the data... Hence also a small preference for pd.Timestamp... An option is also to have both a DATETIME and TIMESTAMP option... to be able to choose between "fastest" option and "most convenient" option?

API-wise, multiple bools are possible, but I think indeed it feels quite awkward. I get what you mean when reading code as you only see one option there, but when reading documentation it is also not super intuitif I think that you need to combine the doc of multiple parameters to get to know what the options are for a single concept.

If we would stick to string options, using something like "MIXED_TO_UTC", "MIXED_AS_DATETIME" (and possibly "MIXED_AS_TIMESTAMP") might explain better what behaviour is actually to be expected?

ENH: deal properly with naive datetimes with arrow

aaf8818

theroggy marked this pull request as draft October 17, 2024 20:02

theroggy mentioned this pull request Oct 17, 2024

Differences in how datetime columns are treated with arrow=True #487

Open

Add more testcases, also for tz datetimes

3e463a1

theroggy changed the title ~~ENH: deal properly with naive datetimes with arrow~~ TST: add tests exposing some issues with datetimes with arrow? Oct 18, 2024

jorisvandenbossche reviewed Nov 6, 2024

View reviewed changes

pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved

pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved

pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved

pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into ENH-deal-properly-w…

afdd0c1

…ith-naive-datetimes-with-arrow

theroggy changed the title ~~TST: add tests exposing some issues with datetimes with arrow?~~ ENH: improve datetime support with arrow for GDAL >= 3.11 Jan 16, 2025

theroggy changed the title ~~ENH: improve datetime support with arrow for GDAL >= 3.11~~ ENH: improve read support for naive and mixed datetimes with arrow for GDAL >= 3.11 Jan 16, 2025

theroggy changed the title ~~ENH: improve read support for naive and mixed datetimes with arrow for GDAL >= 3.11~~ ENH: improve read support for datetimes with arrow for GDAL >= 3.11 Jan 16, 2025

theroggy changed the title ~~ENH: improve read support for datetimes with arrow for GDAL >= 3.11~~ ENH: improve read support for datetime columns with arrow for GDAL >= 3.11 Jan 16, 2025

theroggy added 4 commits January 17, 2025 09:09

Use datetime_as_string for reading with arrow

c18ab22

Update _io.pyx

597855f

Skip tests where appropriate

fa4b86e

Improve support for mixed and naive datetimes

0e41ae4

theroggy changed the title ~~ENH: improve read support for datetime columns with arrow for GDAL >= 3.11~~ ENH: improve support for datetime columns with mixed or naive times Jan 17, 2025

theroggy added 7 commits January 17, 2025 22:42

Skip use_arrow tests with old gdal versions

1378ace

Take in account pandas version

0f1ab27

Update test_geopandas_io.py

6f78c68

Also support columns with datetime objects

336d0d8

Rename some test functions for consistency

3035a11

Avoid warning in test

9efdc09

Improve inline comment

eb80e08

theroggy marked this pull request as ready for review January 18, 2025 08:43

theroggy requested a review from jorisvandenbossche January 18, 2025 08:43

Update CHANGES.md

d50b2d0

jorisvandenbossche reviewed Jan 18, 2025

View reviewed changes

theroggy added 3 commits January 19, 2025 08:27

Merge remote-tracking branch 'upstream/main' into ENH-deal-properly-w…

47aa298

…ith-naive-datetimes-with-arrow

Symplify code

1efa5bf

Don't cast UTC data to string when writing

0032839

theroggy added 3 commits August 4, 2025 00:58

Fix long ago test for old versions of gdal

f23bd38

xfail dates of long ago with arrow and gdal <3.11

b55cc2f

Fix some errors with pandas 3

616a144

theroggy marked this pull request as ready for review August 3, 2025 23:45

theroggy mentioned this pull request Aug 3, 2025

Datetime field formatting breaks for datetimes before 1678-1-1 when reading from file #553

Open

theroggy added 10 commits August 4, 2025 11:39

Improve error

93dbc6e

Support for pandas 3.0

6af6d63

Update geopandas.py

066ec42

Support pandas 3.0

5c9efa1

Support pandas 3.0

b421a06

Update test_geopandas_io.py

4486c9e

Update test_geopandas_io.py

a3a0393

Merge branch 'ENH-deal-properly-with-naive-datetimes-with-arrow' of h…

69dcc6a

…ttps://github.com/theroggy/pyogrio into ENH-deal-properly-with-naive-datetimes-with-arrow

Small textual improvements

7248d1b

Typo in changelog

86529a2

jorisvandenbossche reviewed Aug 6, 2025

View reviewed changes

CHANGES.md Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Aug 6, 2025

View reviewed changes

pyogrio/geopandas.py Outdated Show resolved Hide resolved

theroggy added 2 commits August 7, 2025 03:54

Fix UTC mode so it is backwards compatible

8303c05

Fix for pandas 3.0

ac5f20f

theroggy marked this pull request as draft August 7, 2025 02:37

Improve inline doc.

23bf348

theroggy marked this pull request as ready for review August 7, 2025 13:48

theroggy mentioned this pull request Aug 9, 2025

REL: prepare release of version 0.12 #560

Open

7 tasks

		if use_arrow and ext == ".gpkg" and __gdal_version__ < (3, 11, 0):
		pytest.skip("Arrow datetime handling improved in GDAL >= 3.11")

Uh oh!

ENH: improve support for datetime columns #486

Are you sure you want to change the base?

ENH: improve support for datetime columns #486

Uh oh!

Conversation

theroggy commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theroggy Jan 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theroggy Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theroggy commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Aug 6, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Aug 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Aug 6, 2025

Uh oh!

Uh oh!

theroggy commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

theroggy commented Oct 17, 2024 •

edited

Loading

theroggy Jan 19, 2025 •

edited

Loading

theroggy Jan 20, 2025 •

edited

Loading

theroggy commented Aug 3, 2025 •

edited

Loading

theroggy commented Aug 7, 2025 •

edited

Loading