Skip to content

ENH: improve support for datetime columns #486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 65 commits into
base: main
Choose a base branch
from

Conversation

theroggy
Copy link
Member

@theroggy theroggy commented Oct 17, 2024

This PR improves support for datetime columns, mainly in read_dataframe and write_dataframe.

In general the PR tries to accomplish that:

  • datetime column data from a file can be read to a GeoDataFrame without data loss. For this, a parameter datetimes has been added to read_dataframe, with following possible values:
    • "UTC": always return datetime columns as pandas datetime64 columns. If a column contains e.g. data with mixed timezone offsets the datetimes will be converted to UTC as pandas datetime64 columns don't support such data. This was the behaviour before this PR and stays the default.
    • "DATETIME": return the datetime column values with timezone information as they were read from the file. In this case mixed timezone columns are returned as object columns with pandas.Timestamp values. This is to avoid the timezone information being lost. This option should be used if you want datetime data being roundtripped correctly in most situations.
    • "STRING": return the datetime columns as ISO8601 strings.
  • (try to) get the treatment of datetimes consistent between when arrow is used or not. For use_arrow=True there are several situations where GDAL 3.11 is needed to get correct results.

More specifically:

  • Fix: when a GPKG was read with use_arrow, naive datetimes (no timezone) were interpreted as being UTC. So a naive time of 05:00 h was interpreted as 05:00 UTC.
  • Fix: when a .fgb was read with use_arrow, for datetime columns with a timezone the timezone was dropped, so 05:00+5:00 was read as 05:00.
  • Fix: when a file was written with use_arrow, for datetime columns with any timezone but UTC, the timezone was dropped, so 05:00+5:00 was written as 05:00 (a naive datetime), for all filetypes.
  • When reading datetimes with use_arrow, don't convert/represent them as being in UTC time if they have another timezone offset in the dataset.
  • Add support to write columns with mixed timezones. Typically the column needs to be of the object type with pandas.Timestamp or datetime objects in them as "standard" pandas datetime64 colums don't support mixed timezone offsets in a column.
  • Add support to read mixed timezone datetimes. These are returned in an object column with Timestamps.
  • For the cases with use_arrow, the fixes typically require GDAL >= 3.11 (OGRLayer::GetArrowStream(): add a DATETIME_AS_STRING=YES/NO option OSGeo/gdal#11213).

Resolves #487

@theroggy theroggy changed the title ENH: deal properly with naive datetimes with arrow TST: add tests exposing some issues with datetimes with arrow? Oct 18, 2024
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for diving into this and improving the test coverage!

@theroggy theroggy changed the title TST: add tests exposing some issues with datetimes with arrow? ENH: improve datetime support with arrow for GDAL >= 3.11 Jan 16, 2025
@theroggy theroggy changed the title ENH: improve datetime support with arrow for GDAL >= 3.11 ENH: improve read support for naive and mixed datetimes with arrow for GDAL >= 3.11 Jan 16, 2025
@theroggy theroggy changed the title ENH: improve read support for naive and mixed datetimes with arrow for GDAL >= 3.11 ENH: improve read support for datetimes with arrow for GDAL >= 3.11 Jan 16, 2025
@theroggy theroggy changed the title ENH: improve read support for datetimes with arrow for GDAL >= 3.11 ENH: improve read support for datetime columns with arrow for GDAL >= 3.11 Jan 16, 2025
@theroggy theroggy changed the title ENH: improve read support for datetime columns with arrow for GDAL >= 3.11 ENH: improve support for datetime columns with mixed or naive times Jan 17, 2025
@theroggy theroggy marked this pull request as ready for review January 18, 2025 08:43
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theroggy thanks for further looking into this!

I do have some doubts about how much effort we should do to cover corner cases and what the desired default behaviour should be, see my comments below.

Comment on lines 54 to 66
# if object dtype, try parse as utc instead
if res.dtype == "object":
try:
res = pd.to_datetime(ser, utc=True, **datetime_kwargs)
except Exception:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your top post explanation:

Add support to read mixed timezone datetimes. These are returned in an object column with Timestamps.

First, I don't think this will work with upcoming pandas 3.x (we are suppressing the warning above about mixed timezones going to raise unless passing utc=True, and that you have to use apply and datetime.datetime.strptime instead to get mixed offset objects)
(but the tests are also passing, so maybe I am missing something)

Second, a column of mixed offset objects is in general not that particularly useful .. So changing this behaviour feels like a regression to me. I understand that we might want to provide the user the option to get this, but by default, I am not sure.

Copy link
Member Author

@theroggy theroggy Jan 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your top post explanation:

Add support to read mixed timezone datetimes. These are returned in an object column with Timestamps.

First, I don't think this will work with upcoming pandas 3.x (we are suppressing the warning above about mixed timezones going to raise unless passing utc=True, and that you have to use apply and datetime.datetime.strptime instead to get mixed offset objects) (but the tests are also passing, so maybe I am missing something)

Yes, I saw. Do you know what the rationale is that in pandas 3 people are being forced to use a more inefficient way (apply) to get to your data? I did some performance tests and especially the way it is advised in the warning is really slow:

  1. with to_datetime() it takes 0.6 sec to convert 1.5 mio strings
  2. with apply using datetime.from_isoformat() it takes 1 sec to convert 1.5 mio strings, but you will first need to call to_datetime() first so it can throw an error, so up to 0.6 seconds will need to be added to the time.
  3. with apply using datetime.strptime() it takes 56 sec to convert 1.5 mio strings

Second, a column of mixed offset objects is in general not that particularly useful .. So changing this behaviour feels like a regression to me. I understand that we might want to provide the user the option to get this, but by default, I am not sure.

For starters, to be clear, this is only relevant for mixed timezone data. Data saved in naive or UTC timestamps or if all datetimes have the same timezone offset, they should/will just stay "regular" pandas datetime columns.
Note: a column with datetimes in a timezone with daylight saving time will also typically lead to mixed timezones as they will typically have 2 different timezone offsets.

For the case of mixed timezone data, it depends on what you want to do with the datetime data. If it is just to look at it/show/keep it as it is part of the table data, the Timestamps look just fine to me. If you really want to do "fancy stuff" with the datetimes it will in pandas indeed be more convenient for some things to transform them into e.g. UTC datetimes to get a datetime column instead of an object column.

Regarding default behaviour, it feels quite odd to me to transform data by default to a form where information (the original time zone) is lost. Also because when you save the data again, it will then be saved as UTC as well, so also: the timezone information will be lost.

To me, the other way around is more logical: by default you don't loose data. If you want to do "fancy stuff" with a datetime column that contains mixed timezone data, you convert it to e.g. UTC, typically in an extra column, because most likely you will want to keep the original timezone information again when saving.

Comment on lines 504 to 507
elif col.dtype == "object":
# Column of Timestamp objects, also split in naive datetime and tz offset
col_na = df[col.notna()][name]
if len(col_na) and all(isinstance(x, pd.Timestamp) for x in col_na):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit hesitant to add custom support for this, exactly given that it is not really supported by pandas itself, do we then need to add special support for it?

Right now, if you have an object dtype column with timestamp columns, they already get written as strings, which in the end preserves the offset information (in the string representation).
It might read back as strings (depending on the file format), but at that point the user can handle this column as they see fit.

Copy link
Member Author

@theroggy theroggy Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this support, it becomes impossible to read and write mixed timezone data so the written file is equivalent to the original file for quite some file formats.

If they are written as strings to the output files without the proper metadata, it depends on format to format if they will be recognized as datetimes when read. For text files they will typically be recognized as datetime as the data types are "guessed" when the file is read (e.g. geojson), for files like .fgb and .gpkg they won't be recognized as the file metadata will be wrong.

That's not very clean, and as it is very easy to solve I don't quite see the point of not supporting it properly?

Comment on lines 382 to 383
if use_arrow and ext == ".gpkg" and __gdal_version__ < (3, 11, 0):
pytest.skip("Arrow datetime handling improved in GDAL >= 3.11")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is not yet working for the case with no tz for GPKG?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datetimes in a GPKG without timezone are now interpreted as being UTC. So a naive time of 05:00 h is interpreted as 05:00 UTC.

This is one of the issues listed in #487 (comment)

@theroggy theroggy marked this pull request as ready for review August 3, 2025 23:45
@theroggy
Copy link
Member Author

theroggy commented Aug 3, 2025

@jorisvandenbossche Tests with pandas 3.0 were failing because pd.to_datetime now indeed gives an error for mixed timezones.
-> I added a fallback for this case.

I tested several different options performance wise and I took the least bad one (for 3.3 mio rows conversion now takes between 5 and 12 seconds per datetime column versus 0.5 seconds before, but the adviced option in the warning was 50 seconds).

It is mainly creating the pd.Timestamp objects that is slow: datetime.fromisoformat() just takes 1 second, but Timestamp() takes 4-11 seconds. So an option is also to return datetime.datetime objects, but they are not as "rich" as pd.Timestamp feature-wise which is a bit of a pity...

Is it a realisting option that there would be added a feature in pandas to e.g. efficiently convert a list with iso-formatted strings to a list of pd.Timestamp objects, because it should be possible: in pd.to_datetime it was fast?

@jorisvandenbossche
Copy link
Member

@theroggy thanks a lot for the updates here! The implementation is generally looking good, I mostly have some remaining concerns about the exact user-facing API.

  • "UTC" for the default seems a bit misleading, because it is only UTC for tz-aware columns. If you have a tz-naive column, that roundtrips as tz-naive. Also for tz-aware columns, if there is only one offset, you will get the fixed offset, not UTC (unless that behaviour changed with this PR?)
  • "DATETIME": the important aspect here is that it are datetime/timestamp objects, because "datetime" could also mean the dtype, and then it could be interpreted as "use a datetime dtype" (which is datetime64). Also the current PR returns pd.Timestamp and not datetime.datetime, so that is also a bit confusing (although we should maybe return datetime.datetime?)

We already have the datetime_as_string=True/False keyword (in the raw IO, just not exposed in the dataframe IO), so wondering if we could also expose that in read_dataframe and then add a datetime_as_object=True/False keyword.
Of course, two separate boolean keywords that then cannot both be True at the same time because they essentially control the same choice is also not great design API, but from a user point of view, for writing code I think I would find that easier to read in my code (read_dataframe(..., datetime_as_string=True) vs read_dataframe(..., datetimes="string"))

Comment on lines 270 to 314
if not use_arrow:
# For arrow, datetimes are read as is.
# For numpy IO, datetimes are read as string values to preserve timezone info
# as numpy does not directly support timezones.
kwargs["datetime_as_string"] = True

# Always read datetimes as string values to preserve (mixed) timezone info
# as numpy does not directly support timezones and arrow datetime columns
# don't support mixed timezones.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what the performance impact is of also setting datetime_as_string to True for the Arrow code path?
I would think that this is going to be quite a bit slower, and in that case we should maybe only set it if the user asked for getting datetime objects or strings, and not in the default case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In several situations this leads to wrong data being returned, e.g. timezones being dropped ,... Especially for .fgb files, sometimes for GPKG files.

I did a quick test with the 3.3 mio buildings in new zealand on my laptop, and the times fluctuate some, but during a more stable period reading them with datetime_as_string=True took 10.6 secs and with datetime_as_string=False it was around 11.4 seconds, so it is a bit faster, but not a huge difference.

Comment on lines 682 to 686
elif isinstance(dtype, pd.DatetimeTZDtype) and str(dtype.tz) != "UTC":
# When it is a datetime column with a timezone different than UTC, it
# needs to be converted to string, otherwise the timezone info is lost.
df[name] = col.astype("string")
datetime_cols.append(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why GDAL preserves the tz-awareness for UTC, but not for other offsets (even though the values written to the file are in UTC) in the Arrow write path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test it again explicitly, but this follows the general logic that for most timezones daylight saving time is used, leading to (potential) different offsets in a column... and with different offsets in one column the timezone information gets lost... I updates the inline comment to clarify this.

@jorisvandenbossche
Copy link
Member

I mostly have some remaining concerns about the exact user-facing API.

  • "UTC" for the default seems a bit misleading, because it is only UTC for tz-aware columns. If you have a tz-naive column, that roundtrips as tz-naive. Also for tz-aware columns, if there is only one offset, you will get the fixed offset, not UTC (unless that behaviour changed with this PR?)

  • "DATETIME": the important aspect here is that it are datetime/timestamp objects, because "datetime" could also mean the dtype, and then it could be interpreted as "use a datetime dtype" (which is datetime64). Also the current PR returns pd.Timestamp and not datetime.datetime, so that is also a bit confusing (although we should maybe return datetime.datetime?)

I clearly hadn't yet looked in detail at the code changes ànd the docstring explanation of the keyword .. because I see now that the names of the keyword options indeed more or less match with the behaviour, I just wasn't expecting that behaviour.

First, as also mentioned in the inline comments, I would like to preserve the current default behaviour of roundtripping tz-naive datetimes. The default of "UTC" no longer does that ..
For "DATETIME", I had naively interpreted it as always returning objects, but the your docstring explanation of the keyword is clear about still returning datetime64 when that does not loose mixed offset information.

Comment on lines 260 to 262
from the data source. Columns with values in a single timezone or
without timezone information will be returned as pandas datetime64
columns. Columns with mixed timezone data are returned as object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick on the wording, but I think we should be clear about that it are "timezone-aware columns with mixed offsets", because the general case of mixed offsets (due to DST changes throughout the year) is still a single time zone.
The issue is that GDAL does not actually support the concept of time zones, only offsets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I tried to clarify this.

@jorisvandenbossche
Copy link
Member

It is mainly creating the pd.Timestamp objects that is slow: datetime.fromisoformat() just takes 1 second, but Timestamp() takes 4-11 seconds. So an option is also to return datetime.datetime objects, but they are not as "rich" as pd.Timestamp feature-wise which is a bit of a pity...

Given that, I think it would also be fine to return the datetime.datetime objects (that's also what pandas did in the past (< 2.0) when it could not parse to datetime64). It's true they have less attributes/methods, but the user can still do that conversion themselves if they need those.

I am not sure if pandas still has that functionality for vectorized conversion to pd.Timestamp under the hood (because it no longer needs it itself)

@theroggy theroggy marked this pull request as draft August 7, 2025 02:37
@theroggy
Copy link
Member Author

theroggy commented Aug 7, 2025

I mostly have some remaining concerns about the exact user-facing API.

  • "UTC" for the default seems a bit misleading, because it is only UTC for tz-aware columns. If you have a tz-naive column, that roundtrips as tz-naive. Also for tz-aware columns, if there is only one offset, you will get the fixed offset, not UTC (unless that behaviour changed with this PR?)
  • "DATETIME": the important aspect here is that it are datetime/timestamp objects, because "datetime" could also mean the dtype, and then it could be interpreted as "use a datetime dtype" (which is datetime64). Also the current PR returns pd.Timestamp and not datetime.datetime, so that is also a bit confusing (although we should maybe return datetime.datetime?)

I clearly hadn't yet looked in detail at the code changes ànd the docstring explanation of the keyword .. because I see now that the names of the keyword options indeed more or less match with the behaviour, I just wasn't expecting that behaviour.

First, as also mentioned in the inline comments, I would like to preserve the current default behaviour of roundtripping tz-naive datetimes. The default of "UTC" no longer does that ..

@jorisvandenbossche Yes, that was unintentional changed behaviour. Should be fixed.

For "DATETIME", I had naively interpreted it as always returning objects, but the your docstring explanation of the keyword is clear about still returning datetime64 when that does not loose mixed offset information.

I just want to expose the default behaviour of pandas.to_datetime (before it was crippled in pandas 3): return the "richest" format possible depending on the data, without changing the data... Hence also a small preference for pd.Timestamp... An option is also to have both a DATETIME and TIMESTAMP option... to be able to choose between "fastest" option and "most convenient" option?

API-wise, multiple bools are possible, but I think indeed it feels quite awkward. I get what you mean when reading code as you only see one option there, but when reading documentation it is also not super intuitif I think that you need to combine the doc of multiple parameters to get to know what the options are for a single concept.

If we would stick to string options, using something like "MIXED_TO_UTC", "MIXED_AS_DATETIME" (and possibly "MIXED_AS_TIMESTAMP") might explain better what behaviour is actually to be expected?

@theroggy theroggy marked this pull request as ready for review August 7, 2025 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Differences in how datetime columns are treated with arrow=True
2 participants