-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Allow setting (or skipping) new indexes in open_dataset #8051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@pydata/xarray any thoughts on which option among those above (top comment) would be best? |
I vote for this. |
Is there any way for packages to say they support V1 or V2 of an entrypoint? I just realized that if we turned off indexes by default, it would be a big win for the open_mfdataset case in many cases. |
Doesn't help with the problem at this moment, but could we add having |
Agreed, adding **kwargs to the standard would help! However, to be honest I find it already a bit confusing how kwargs are handled in |
This looks great to me! I agree with adding this into For what it's worth, I think it's OK to require backend developers to update their code more frequently -- we don't need the same level of stabily that we need for user level APIs. |
I would love to see this merged so that I can try this out! |
be aware that merging now will break compatibility with any 3rd party backend, which I believe is not something we should do, even if we think that the transition window can be shorter than usual. I my eyes the easiest way forward would be:
We don't have an easy way to contact all backend developers, unfortunately. Edit: let's discuss in the meeting today |
In the meeting just now we decided to inspect the signature of the backend's We should still change the spec to require |
xarray/backends/store.py
Outdated
if set_indexes: | ||
coords = coord_vars | ||
else: | ||
# explicit Coordinates object with no index passed | ||
coords = Coordinates(coord_vars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we discussed this in the raster index meeting a bit with @dcherian and @benbovy, and if I remember correctly we agreed that instead of passing through an option to the backend it might be cleaner to always create the backend dataset without indexes, and create the indexes in open_dataset
:
if set_indexes: | |
coords = coord_vars | |
else: | |
# explicit Coordinates object with no index passed | |
coords = Coordinates(coord_vars) | |
# explicit Coordinates object with no index passed | |
coords = Coordinates(coord_vars) |
Most known backends use the BackendStoreEntrypoint
to assemble the backend dataset, so these backends will work automatically, and for all other backends we'll have to figure out a way to warn.
If we also change the data model to cleanly separate I/O from decoding variables (what zarr
calls "ArrayToArray" codecs, and what our current decoders do) and datasets (this would be where you'd create indexes), then we could warn about backend datasets with indexes. I think this would be a breaking change, although we could add a deprecation cycle to make the transition smoother.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleaner to always create the backend dataset without indexes
I fully support this. I think it's the natural way to generalize. Creating an in-memory index from a lazily-loaded array is totally unrelated from creating the lazy arrays in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed my mind. In reality, we are seeing that people are writing backends not just for "reading arrays", they are creating datasets with extra metadata (e.g. rioxarray might choose to add RasterIndex, xwrf might do something else). So perhaps we add a create_default_indexes
that sets the default PandasIndex for any dimension coordinate that does not have an index, when returned from the backend?
EDIT: I see that's what you've done. nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you look at the current state of the PR, that's exactly my intention
Edit: although in an ideal world we'd add support for "dataset decoders", and then people would use that to create the indexes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So perhaps we add a
create_default_indexes
that sets the default PandasIndex for any dimension coordinate that does not have an index, when returned from the backend?
The general version of this would be to make the index-adding step independently pluggable, i.e. pass in a callable, which defaults to your create_default_indexes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: although in an ideal world we'd add support for "dataset decoders", and then people would use that to create the indexes
This is basically what I'm suggesting - I think you should make the API something like
@runtime_checkable
class IndexSetter(Protocol):
def __call__(
self,
backend_ds: xr.Dataset,
) -> xr.Dataset: ...
def open_dataset(
...,
index_setter: IndexSetter = create_default_indexes,
) -> xr.Dataset:
if not isinstance(index_setter, IndexSetter):
raise TypeError
# all the normal backend logic
backend_ds = ...
return index_setter(backend_ds)
either adding a kwarg to open_dataset
like that or alternatively you could allow people to use the backends to be the place they choose which IndexSetter
callable they want to use (e.g. to always add domain-specific indexes at read time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would be an option, but instead of a separate parameter I think it might be better to figure out how a dataset decoder API could look, then reuse that for the default indexes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, but then would we deprecate the kwarg you're about to add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pluggable and composable Dataset (de)coders would be nice indeed for setting indexes among other things. I like the proposal made in #8548.
Even with a decoder API it might still make sense to have a special case (kwarg) for default indexes, though, since those are also created in Dataset.__init__()
, DataArray.__init__()
, Coordinates.__init__()
, assign_coords()
, etc.
data_vars[name] = var | ||
|
||
# explicit Coordinates object with no index passed | ||
coords = Coordinates(coord_vars, indexes={}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine given our rules for propagating coordinate variables when extracting DataArrays but it is potentially confusing with create_default_indexes=False, decode_coordinates=False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
whats-new.rst
api.rst
This PR introduces a new boolean parameter
set_indexes=True
toxr.open_dataset()
, which may be used to skip the creation of default (pandas) indexes when opening a dataset.Currently works with the Zarr backend:
I'll add it to the other Xarray backends as well, but I'd like to get your thoughts about the API first.
xr.open_dataset()
? There are already many...BackendEntrypoint.open_dataset()
API?xr.open_dataset()
set_indexes
in the signature in addition to thedrop_variables
parameter, this is a breaking change for all existing 3rd-party backends. Or should we groupset_indexes
with the other xarray decoder kwargs? This would feel a bit odd to me as setting indexes is different from decoding data.Currently 1 and 2 are implemented in this PR, although as I write this comment I think that I would prefer 3. I guess this depends on whether we prefer
open_***
vs.xr.open_dataset(engine="***")
and unless I missed something there is still no real consensus about that? (e.g., #7496).