Skip to content

CFTimeIndex #1252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 75 commits into from
May 13, 2018
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
e1e8223
Start on implementing and testing NetCDFTimeIndex
spencerkclark Feb 5, 2017
6496458
TST Move to using pytest fixtures to structure tests
spencerkclark Feb 6, 2017
675b2f7
Address initial review comments
spencerkclark Feb 10, 2017
7beddc1
Address second round of review comments
spencerkclark Feb 11, 2017
3cf03bc
Fix failing python3 tests
spencerkclark Feb 11, 2017
53b085c
Match test method name to method name
spencerkclark Feb 11, 2017
738979b
Merge branch 'master' of https://github.com/pydata/xarray into NetCDF…
spencerkclark Apr 16, 2017
a177f89
First attempts at integrating NetCDFTimeIndex into xarray
spencerkclark May 10, 2017
48ec519
Cleanup
spencerkclark May 11, 2017
9e76df6
Merge branch 'master' into NetCDFTimeIndex
spencerkclark May 11, 2017
2a7b439
Fix DataFrame and Series test failures for NetCDFTimeIndex
spencerkclark May 11, 2017
b942724
First pass at making NetCDFTimeIndex compatible with #1356
spencerkclark May 11, 2017
7845e6d
Merge branch 'master' into NetCDFTimeIndex
spencerkclark Jun 20, 2017
a9ed3c8
Address initial review comments
spencerkclark Jun 26, 2017
3e23ed5
Merge branch 'master' into NetCDFTimeIndex
spencerkclark Aug 25, 2017
a9f3548
Merge branch 'master' into NetCDFTimeIndex
spencerkclark Jan 22, 2018
f00f59a
Restore test_conventions.py
spencerkclark Jan 22, 2018
b34879d
Fix failing test in test_utils.py
spencerkclark Jan 22, 2018
e93b62d
flake8
spencerkclark Jan 22, 2018
61e8bc6
Merge branch 'master' into NetCDFTimeIndex
spencerkclark Feb 20, 2018
0244f58
Merge branch 'master' into NetCDFTimeIndex
spencerkclark Mar 1, 2018
32d7986
Update for standalone netcdftime
spencerkclark Mar 1, 2018
9855176
Address stickler-ci comments
spencerkclark Mar 1, 2018
8d61fdb
Skip test_format_netcdftime_datetime if netcdftime not installed
spencerkclark Mar 1, 2018
6b87da7
A start on documentation
spencerkclark Mar 9, 2018
812710c
Merge branch 'master' into NetCDFTimeIndex
spencerkclark Mar 9, 2018
3610e6e
Fix failing zarr tests related to netcdftime encoding
spencerkclark Mar 9, 2018
8f69a90
Simplify test_decode_standard_calendar_single_element_non_ns_range
spencerkclark Mar 9, 2018
cec909c
Address a couple review comments
spencerkclark Mar 10, 2018
422792b
Use else clause in _maybe_cast_to_netcdftimeindex
spencerkclark Mar 10, 2018
de74037
Start on adding enable_netcdftimeindex option
spencerkclark Mar 10, 2018
2993e3c
Continue parametrizing tests in test_coding_times.py
spencerkclark Mar 10, 2018
f3438fd
Update time-series.rst for enable_netcdftimeindex option
spencerkclark Mar 10, 2018
c35364e
Use :py:func: in rst for xarray.set_options
spencerkclark Mar 10, 2018
08f72dc
Merge branch 'master' into NetCDFTimeIndex
spencerkclark Mar 10, 2018
62ce0ae
Add a what's new entry and test that resample raises a TypeError
spencerkclark Mar 11, 2018
ff05005
Merge branch 'master' of https://github.com/pydata/xarray into NetCDF…
spencerkclark Mar 12, 2018
20fea63
Merge branch 'master' into NetCDFTimeIndex
spencerkclark Mar 16, 2018
d5a3cef
Move what's new entry to the version 0.10.3 section
spencerkclark Mar 16, 2018
e721d26
Add version-dependent pathway for importing netcdftime.datetime
spencerkclark Mar 17, 2018
5e1c4a8
Make NetCDFTimeIndex and date decoding/encoding compatible with datet…
spencerkclark Mar 20, 2018
257f086
Merge branch 'master' into NetCDFTimeIndex
spencerkclark Mar 20, 2018
00e8ada
Merge branch 'master' into NetCDFTimeIndex
spencerkclark Apr 12, 2018
c9d0454
Remove logic to make NetCDFTimeIndex compatible with datetime.datetime
spencerkclark Apr 12, 2018
f678714
Documentation edits
spencerkclark Apr 12, 2018
b03e38e
Ensure proper enable_netcdftimeindex option is used under lazy decoding
spencerkclark Apr 13, 2018
890dde0
Add fix and test for concatenating variables with a NetCDFTimeIndex
spencerkclark Apr 13, 2018
80e05ba
Merge branch 'master' into NetCDFTimeIndex
spencerkclark Apr 16, 2018
13c8358
Further namespace changes due to netcdftime/cftime renaming
spencerkclark Apr 16, 2018
ab46798
NetCDFTimeIndex -> CFTimeIndex
spencerkclark Apr 16, 2018
67fd335
Documentation updates
spencerkclark Apr 16, 2018
7041a8d
Only allow use of CFTimeIndex when using the standalone cftime
spencerkclark Apr 16, 2018
9df4e11
Fix errant what's new changes
spencerkclark Apr 16, 2018
9391463
flake8
spencerkclark Apr 16, 2018
da12ecd
Fix skip logic in test_cftimeindex.py
spencerkclark Apr 16, 2018
a6997ec
Use only_use_cftime_datetimes option in num2date
spencerkclark Apr 26, 2018
7302d7e
Merge branch 'master' into NetCDFTimeIndex
spencerkclark Apr 26, 2018
9dc5539
Require standalone cftime library for all new functionality
spencerkclark Apr 28, 2018
1aa8d86
Improve skipping logic in test_cftimeindex.py
spencerkclark Apr 28, 2018
ef3f2b1
Fix skipping logic in test_cftimeindex.py for when cftime or netcdftime
spencerkclark Apr 28, 2018
4fb5a90
Fix skip logic in Python 3.4 build for test_cftimeindex.py
spencerkclark Apr 28, 2018
1fd205a
Improve error messages when for when the standalone cftime is not ins…
spencerkclark Apr 28, 2018
58a0715
Tweak skip logic in test_accessors.py
spencerkclark Apr 28, 2018
ca4d7dd
flake8
spencerkclark Apr 28, 2018
3947aac
Address review comments
spencerkclark Apr 30, 2018
a395db0
Temporarily remove cftime from py27 build environment on windows
spencerkclark Apr 30, 2018
1b00bde
flake8
spencerkclark Apr 30, 2018
5fdcd20
Install cftime via pip for Python 2.7 on Windows
spencerkclark Apr 30, 2018
459211c
Merge branch 'master' into NetCDFTimeIndex
spencerkclark Apr 30, 2018
7e9bb20
flake8
spencerkclark Apr 30, 2018
247c9eb
Remove unnecessary new lines; simplify _maybe_cast_to_cftimeindex
spencerkclark May 1, 2018
e66abe9
Restore test case for #2002 in test_coding_times.py
spencerkclark May 1, 2018
f25b0b6
Tweak dates out of range warning logic slightly to preserve current d…
spencerkclark May 2, 2018
b10cc73
Merge branch 'master' into NetCDFTimeIndex
spencerkclark May 2, 2018
c318755
Address review comments
spencerkclark May 12, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 180 additions & 0 deletions xarray/core/netcdftimeindex.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
import re
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't go in core, since there's nothing tying it to core xarray internals. Instead, it should probably go in a new top level module, maybe a new directory alongside the contents of the existing conventions module (rename it to xarray.conventions.coding?).

from datetime import timedelta

import numpy as np
import pandas as pd

from pandas.lib import isscalar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V minor but there is an xarray version of this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the pandas version isn't public API :)



def named(name, pattern):
return '(?P<' + name + '>' + pattern + ')'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think .format is faster (as well as idiomatic) because this way will build n strings

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only be called once, probably at module import time, so it should not matter for performance. I would just go with whatever is most readable.



def optional(x):
return '(?:' + x + ')?'


def trailing_optional(xs):
if not xs:
return ''
return xs[0] + optional(trailing_optional(xs[1:]))


def build_pattern(date_sep='\-', datetime_sep='T', time_sep='\:'):
pieces = [(None, 'year', '\d{4}'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need negative or five digit years?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't see myself needing it in the near future, but I'm not necessarily opposed to adding that support if others would find it useful.

It would make writing simple positive four-digit year dates more complicated though right? Would you always need the leading zero and the sign?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's not bother until someone asks. Per Wikipedia's ISO 8601 you can optionally use an expanded year representation with + and -. I don't think they would always be necessary but I haven't read the original document (which unfortunately I think is not available only).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI NCAR's TraCE simulation project is a 21k yr paleoclimate simulation. Not sure how they handle calendars/times. I know somebody who has analyzed data from this simulation; will ask what it looks like.

(date_sep, 'month', '\d{2}'),
(date_sep, 'day', '\d{2}'),
(datetime_sep, 'hour', '\d{2}'),
(time_sep, 'minute', '\d{2}'),
(time_sep, 'second', '\d{2}' + optional('\.\d+'))]
pattern_list = []
for sep, name, sub_pattern in pieces:
pattern_list.append((sep if sep else '') + named(name, sub_pattern))
# TODO: allow timezone offsets?
return '^' + trailing_optional(pattern_list) + '$'


def parse_iso8601(datetime_string):
basic_pattern = build_pattern(date_sep='', time_sep='')
extended_pattern = build_pattern()
patterns = [basic_pattern, extended_pattern]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save this in as global variable.

for pattern in patterns:
match = re.match(pattern, datetime_string)
if match:
return match.groupdict()
raise ValueError('no ISO-8601 match for string: %s' % datetime_string)


def _parse_iso8601_with_reso(date_type, timestr):
default = date_type(1, 1, 1)
result = parse_iso8601(timestr)
replace = {}

for attr in ['year', 'month', 'day', 'hour', 'minute', 'second']:
value = result.get(attr, None)
if value is not None:
replace[attr] = int(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that seconds can be fractional

resolution = attr

return default.replace(**replace), resolution


def _parsed_string_to_bounds(date_type, resolution, parsed):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is based on a pandas function

if resolution == 'year':
return (date_type(parsed.year, 1, 1),
date_type(parsed.year + 1, 1, 1) - timedelta(microseconds=1))
if resolution == 'month':
if parsed.month == 12:
end = date_type(parsed.year + 1, 1, 1) - timedelta(microseconds=1)
else:
end = (date_type(parsed.year, parsed.month + 1, 1) -
timedelta(microseconds=1))
return date_type(parsed.year, parsed.month, 1), end
if resolution == 'day':
start = date_type(parsed.year, parsed.month, parsed.day)
return start, start + timedelta(days=1, microseconds=-1)
if resolution == 'hour':
start = date_type(parsed.year, parsed.month, parsed.day, parsed.hour)
return start, start + timedelta(hours=1, microseconds=-1)
if resolution == 'minute':
start = date_type(parsed.year, parsed.month, parsed.day, parsed.hour,
parsed.minute)
return start, start + timedelta(minutes=1, microseconds=-1)
if resolution == 'second':
start = date_type(parsed.year, parsed.month, parsed.day, parsed.hour,
parsed.minute, parsed.second)
return start, start + timedelta(seconds=1, microseconds=-1)
else:
raise KeyError


def get_date_field(datetimes, field):
return [getattr(date, field) for date in datetimes]


def _field_accessor(name, docstring=None):
def f(self):
return get_date_field(self._data, name)

f.__name__ = name
f.__doc__ = docstring
return property(f)


def get_date_type(self):
return type(self._data[0])


class NetCDFTimeIndex(pd.Index):
def __new__(cls, data):
result = object.__new__(cls)
result._data = np.array(data)
return result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider validating that every array element has correct type


year = _field_accessor('year', 'The year of the datetime')
month = _field_accessor('month', 'The month of the datetime')
day = _field_accessor('day', 'The days of the datetime')
hour = _field_accessor('hour', 'The hours of the datetime')
minute = _field_accessor('minute', 'The minutes of the datetime')
second = _field_accessor('second', 'The seconds of the datetime')
microsecond = _field_accessor('microsecond',
'The microseconds of the datetime')
date_type = property(get_date_type)

def _partial_date_slice(self, resolution, parsed,
use_lhs=True, use_rhs=True):
start, end = _parsed_string_to_bounds(self.date_type, resolution,
parsed)
lhs_mask = (self._data >= start) if use_lhs else True
rhs_mask = (self._data <= end) if use_rhs else True
return (lhs_mask & rhs_mask).nonzero()[0]

def _get_string_slice(self, key, use_lhs=True, use_rhs=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need the use_lhs and use_rhs arguments?

parsed, resolution = _parse_iso8601_with_reso(self.date_type, key)
loc = self._partial_date_slice(resolution, parsed, use_lhs, use_rhs)
return loc

def get_loc(self, key, method=None, tolerance=None):
if isinstance(key, pd.compat.string_types):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use xarray's pycompat module instead of pandas's

result = self._get_string_slice(key)
# Prevents problem with __contains__ if key corresponds to only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would solve this problem by checking for boolean dtype instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lifted the __contains__ function from pandas. It appears that for a non-monotonic DatetimeIndex, this problem also exists (I would expect line 4 to return True):

In [1]: import pandas as pd

In [2]: from datetime import datetime

In [3]: non_monotonic = pd.DatetimeIndex([datetime(2000, 1, 1), datetime(1999, 1
   ...: , 1), datetime(2001, 1, 1)])

In [4]: '2000-01-01' in non_monotonic
Out[4]: False

In [5]: '1999-01-01' in non_monotonic
Out[5]: True

In [6]: '2001-01-01' in non_monotonic
Out[6]: True

In [7]: monotonic = pd.DatetimeIndex([datetime(1999, 1, 1), datetime(2000, 1, 1), 
   ...: datetime(2001, 1, 1)])

In [8]: '1999-01-01' in monotonic
Out[8]: True

In [9]: '2000-01-01' in monotonic
Out[9]: True

In [10]: '2001-01-01' in monotonic
Out[10]: True

Is it worth discussing with them to see what their recommended fix is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, this issue, and the behavior I described below, #1252 (comment), seem to be influenced by the fact that for simplicity, I omitted logic in _partial_date_slice that was specific to monotonic indexes.

For example, if I override _partial_date_slice, in DatetimeIndex, omitting the is_monotonic logic, I end up with the behavior that prompted me to try this hack (see this gist). Should I consider trying to mimic the is_monotonic logic in DatetimeIndex's version of _partial_date_slice, or would you recommend keeping things simple?

# the first element in index (if we leave things as a list,
# np.any([0]) is False).
# Also coerces things to scalar coords in xarray if possible,
# which is consistent with the behavior with a DatetimeIndex.
if len(result) == 1:
return result[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does datetimeindex actually do exactly this? It's pretty messy behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct; DatetimeIndex doesn't do exactly this:

In [1]: import xarray as xr

In [2]: import pandas as pd

In [3]: rng = pd.date_range('2000-01-01', '2001-01-01', freq='M')

In [4]: da = xr.DataArray(rng, coords=[rng], dims=['time'])

In [5]: da.sel(time='2000-01')
Out[5]:
<xarray.DataArray (time: 1)>
array(['2000-01-31T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-31

In [6]: da.sel(time='2000-01-31')
Out[6]:
<xarray.DataArray ()>
array(949276800000000000L, dtype='datetime64[ns]')
Coordinates:
    time     datetime64[ns] 2000-01-31

I'll try and simplify the logic here using your suggestion above.

else:
return result
else:
return pd.Index.get_loc(self, key, method=method,
tolerance=tolerance)

def _maybe_cast_slice_bound(self, label, side, kind):
if isinstance(label, pd.compat.string_types):
parsed, resolution = _parse_iso8601_with_reso(self.date_type,
label)
start, end = _parsed_string_to_bounds(self.date_type, resolution,
parsed)
if self.is_monotonic_decreasing and len(self):
return end if side == 'left' else start
return start if side == 'left' else end
else:
return label

# TODO: Add ability to use integer range outside of iloc?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is messy indeed!

# e.g. series[1:5].
def get_value(self, series, key):
if not isinstance(key, slice):
return series.iloc[self.get_loc(key)]
else:
return series.iloc[self.slice_indexer(
key.start, key.stop, key.step)]

def __contains__(self, key):
try:
result = self.get_loc(key)
return isscalar(result) or type(result) == slice or np.any(result)
except (KeyError, TypeError, ValueError):
return False
Loading