Allow setting (or skipping) new indexes in open_dataset #8051

benbovy · 2023-08-07T10:53:46Z

Closes Opening dataset without loading any indexes? #6633
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

This PR introduces a new boolean parameter set_indexes=True to xr.open_dataset(), which may be used to skip the creation of default (pandas) indexes when opening a dataset.

Currently works with the Zarr backend:

import numpy as np
import xarray as xr

# example dataset (real dataset may be much larger)
arr = np.random.random(size=1_000_000)
xr.Dataset({"x": arr}).to_zarr("dataset.zarr")


xr.open_dataset("dataset.zarr", set_indexes=False, engine="zarr")
# <xarray.Dataset>
# Dimensions:  (x: 1000000)
# Coordinates:
#     x        (x) float64 ...
# Data variables:
#     *empty*


xr.open_zarr("dataset.zarr", set_indexes=False)
# <xarray.Dataset>
# Dimensions:  (x: 1000000)
# Coordinates:
#     x        (x) float64 ...
# Data variables:
#     *empty*

I'll add it to the other Xarray backends as well, but I'd like to get your thoughts about the API first.

Do we want to add yet another keyword parameter to xr.open_dataset()? There are already many...
Do we want to add this parameter to the BackendEntrypoint.open_dataset() API?

I'm afraid we must do it if we want this parameter in xr.open_dataset()
this would also make it possible skipping the creation of custom indexes (if any) in custom IO backends
con: if we require set_indexes in the signature in addition to the drop_variables parameter, this is a breaking change for all existing 3rd-party backends. Or should we group set_indexes with the other xarray decoder kwargs? This would feel a bit odd to me as setting indexes is different from decoding data.

Or should we leave this up to the backends?

pros: no breaking change, more flexible (3rd party backends may want to offer more control like choosing between custom indexes and default pandas indexes or skipping the creation of indexes by default)
cons: less discoverable, consistency is not enforced across 3rd party backends (although for such advanced case this is probably OK), not available by default in every backend.

Currently 1 and 2 are implemented in this PR, although as I write this comment I think that I would prefer 3. I guess this depends on whether we prefer open_*** vs. xr.open_dataset(engine="***") and unless I missed something there is still no real consensus about that? (e.g., #7496).

benbovy · 2023-08-22T14:38:18Z

@pydata/xarray any thoughts on which option among those above (top comment) would be best?

rabernat · 2023-08-23T14:56:10Z

Do we want to add yet another keyword parameter to xr.open_dataset()?

I vote for this.

dcherian · 2023-11-13T20:15:46Z

Is there any way for packages to say they support V1 or V2 of an entrypoint?

I just realized that if we turned off indexes by default, it would be a big win for the open_mfdataset case in many cases.

max-sixty · 2023-11-13T20:19:57Z

con: if we require set_indexes in the signature in addition to the drop_variables parameter, this is a breaking change for all existing 3rd-party backends

Doesn't help with the problem at this moment, but could we add having **kwargs to the standard, so we can add parameters in future without breaking existing backends?

benbovy · 2023-11-14T12:15:58Z

Agreed, adding **kwargs to the standard would help! However, to be honest I find it already a bit confusing how kwargs are handled in xarray.backends.api. #8447 may eventually help having a clearer separation between common and backend-specific options.

shoyer · 2023-11-22T16:54:56Z

This looks great to me!

I agree with adding this into xarray.open_dataset() and BackendEntrypoint.open_dataset().

For what it's worth, I think it's OK to require backend developers to update their code more frequently -- we don't need the same level of stabily that we need for user level APIs.

TomNicholas · 2023-12-05T20:14:09Z

I just realized that if we turned off indexes by default, it would be a big win for the open_mfdataset case in many cases.

I would love to see this merged so that I can try this out!

keewis · 2023-12-05T20:23:54Z

be aware that merging now will break compatibility with any 3rd party backend, which I believe is not something we should do, even if we think that the transition window can be shorter than usual.

I my eyes the easiest way forward would be:

extend Backend.open_dataset with **kwargs as suggested by @max-sixty
wait a couple of versions (2?) to give backends the time to update and release
add the new option

We don't have an easy way to contact all backend developers, unfortunately.

Edit: let's discuss in the meeting today

keewis · 2023-12-06T17:07:21Z

In the meeting just now we decided to inspect the signature of the backend's open_dataset method and not pass the new option if it doesn't support it nor accepts **kwargs.

We should still change the spec to require **kwargs (in a new PR), and maybe emit a deprecation warning for all backends that don't have it already.

keewis · 2025-06-29T11:41:50Z

xarray/backends/store.py

+        if set_indexes:
+            coords = coord_vars
+        else:
+            # explicit Coordinates object with no index passed
+            coords = Coordinates(coord_vars)


we discussed this in the raster index meeting a bit with @dcherian and @benbovy, and if I remember correctly we agreed that instead of passing through an option to the backend it might be cleaner to always create the backend dataset without indexes, and create the indexes in open_dataset:

Suggested change

if set_indexes:

coords = coord_vars

else:

# explicit Coordinates object with no index passed

coords = Coordinates(coord_vars)

# explicit Coordinates object with no index passed

coords = Coordinates(coord_vars)

Most known backends use the BackendStoreEntrypoint to assemble the backend dataset, so these backends will work automatically, and for all other backends we'll have to figure out a way to warn.

If we also change the data model to cleanly separate I/O from decoding variables (what zarr calls "ArrayToArray" codecs, and what our current decoders do) and datasets (this would be where you'd create indexes), then we could warn about backend datasets with indexes. I think this would be a breaking change, although we could add a deprecation cycle to make the transition smoother.

I've taken the liberty of pushing directly to this PR (the failing flaky test is real but unrelated, and pre-commit also fails on main), so the state in this PR should be closer to what I'm proposing above.

cc @dcherian, @benbovy

cleaner to always create the backend dataset without indexes

I fully support this. I think it's the natural way to generalize. Creating an in-memory index from a lazily-loaded array is totally unrelated from creating the lazy arrays in the first place.

I've changed my mind. In reality, we are seeing that people are writing backends not just for "reading arrays", they are creating datasets with extra metadata (e.g. rioxarray might choose to add RasterIndex, xwrf might do something else). So perhaps we add a create_default_indexes that sets the default PandasIndex for any dimension coordinate that does not have an index, when returned from the backend?

EDIT: I see that's what you've done. nice!

if you look at the current state of the PR, that's exactly my intention

Edit: although in an ideal world we'd add support for "dataset decoders", and then people would use that to create the indexes

So perhaps we add a create_default_indexes that sets the default PandasIndex for any dimension coordinate that does not have an index, when returned from the backend?

The general version of this would be to make the index-adding step independently pluggable, i.e. pass in a callable, which defaults to your create_default_indexes

Edit: although in an ideal world we'd add support for "dataset decoders", and then people would use that to create the indexes

This is basically what I'm suggesting - I think you should make the API something like

@runtime_checkable class IndexSetter(Protocol): def __call__( self, backend_ds: xr.Dataset, ) -> xr.Dataset: ... def open_dataset( ..., index_setter: IndexSetter = create_default_indexes, ) -> xr.Dataset: if not isinstance(index_setter, IndexSetter): raise TypeError # all the normal backend logic backend_ds = ... return index_setter(backend_ds)

either adding a kwarg to open_dataset like that or alternatively you could allow people to use the backends to be the place they choose which IndexSetter callable they want to use (e.g. to always add domain-specific indexes at read time).

that would be an option, but instead of a separate parameter I think it might be better to figure out how a dataset decoder API could look, then reuse that for the default indexes

That makes sense, but then would we deprecate the kwarg you're about to add?

Pluggable and composable Dataset (de)coders would be nice indeed for setting indexes among other things. I like the proposal made in #8548.

Even with a decoder API it might still make sense to have a special case (kwarg) for default indexes, though, since those are also created in Dataset.__init__(), DataArray.__init__(), Coordinates.__init__(), assign_coords(), etc.

xarray/backends/store.py

xarray/backends/api.py

dcherian · 2025-07-08T16:02:14Z

xarray/backends/store.py

+                data_vars[name] = var
+
+        # explicit Coordinates object with no index passed
+        coords = Coordinates(coord_vars, indexes={})


I think this is fine given our rules for propagating coordinate variables when extracting DataArrays but it is potentially confusing with create_default_indexes=False, decode_coordinates=False.

dcherian

Very nice!

xarray/backends/api.py

benbovy added 2 commits August 7, 2023 12:03

add set_indexes parameter to open_dataset

c703ebc

implement set_indexes in (zarr) backend store

6f54cd5

github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Aug 7, 2023

benbovy added the topic-indexing label Aug 7, 2023

Merge branch 'main' into backend-set-indexes

f77aac7

github-actions bot removed the topic-indexing label Nov 13, 2023

Merge branch 'main' into backend-set-indexes

eae983b

itcarroll mentioned this pull request Feb 23, 2024

Opening dataset without loading any indexes? #6633

Closed

keewis mentioned this pull request Jun 4, 2025

Unconstrained forwarding of backend keyword arguments #10377

Open

Merge branch 'main' into backend-set-indexes

dfe6496

github-actions bot added the topic-documentation label Jun 29, 2025

keewis reviewed Jun 29, 2025

View reviewed changes

keewis and others added 6 commits June 30, 2025 23:03

replace set_indexes with create_default_indexes

145ae1c

make sure indexes set by the backend survive

192c367

also add the parameter to open_datatree

f5823a7

share the implementation of the default indexes creation

2ff8402

Merge branch 'main' into backend-set-indexes

82d629b

Merge branch 'main' into backend-set-indexes

de5ce26

benbovy commented Jul 1, 2025

View reviewed changes

xarray/backends/store.py Outdated Show resolved Hide resolved

benbovy commented Jul 1, 2025

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

xarray/backends/api.py Outdated Show resolved Hide resolved

keewis and others added 9 commits July 2, 2025 00:04

check that the store backend entrypoint does not create default indexes

294b2f7

actually do not create default indexes in the backends

5c3a843

rename the helper

08939de

Merge branch 'main' into backend-set-indexes

0f281b1

Merge branch 'main' into backend-set-indexes

4620490

move the handling of create_default_indexes up the call stack

95dbf8e

Merge branch 'main' into backend-set-indexes

eb4f866

what's new

d7e6daa

Merge branch 'main' into backend-set-indexes

687f0c2

benbovy mentioned this pull request Jul 8, 2025

More builtin indexes xarray-contrib/xarray-indexes#20

Open

4 tasks

dcherian reviewed Jul 8, 2025

View reviewed changes

dcherian approved these changes Jul 8, 2025

View reviewed changes

Fix

3d483d3

dcherian reviewed Jul 8, 2025

View reviewed changes

xarray/backends/api.py Show resolved Hide resolved

dcherian requested a review from keewis July 8, 2025 16:28

fix again

741564e

dcherian added the plan to merge Final call for comments label Jul 8, 2025

keewis and others added 5 commits July 8, 2025 10:47

also create default indexes without chunks

8889eda

also copy _close

804db4c

reuse the code for copying _close

75c1dd6

refactor

a0d94fb

Merge branch 'main' into backend-set-indexes

bbc263b

dcherian enabled auto-merge (squash) July 8, 2025 18:55

dcherian merged commit 3679a5d into pydata:main Jul 8, 2025
35 of 37 checks passed

Uh oh!

Allow setting (or skipping) new indexes in open_dataset #8051

Allow setting (or skipping) new indexes in open_dataset #8051

Conversation

benbovy commented Aug 7, 2023 • edited by dcherian Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benbovy commented Aug 22, 2023

Uh oh!

rabernat commented Aug 23, 2023

Uh oh!

dcherian commented Nov 13, 2023

Uh oh!

max-sixty commented Nov 13, 2023

Uh oh!

benbovy commented Nov 14, 2023

Uh oh!

shoyer commented Nov 22, 2023

Uh oh!

TomNicholas commented Dec 5, 2023

Uh oh!

keewis commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keewis commented Dec 6, 2023

Uh oh!

keewis Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

keewis Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

TomNicholas Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

dcherian Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keewis Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomNicholas Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

TomNicholas Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keewis Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

TomNicholas Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benbovy Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dcherian Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

dcherian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benbovy commented Aug 7, 2023 •

edited by dcherian

Loading

keewis commented Dec 5, 2023 •

edited

Loading

dcherian Jul 1, 2025 •

edited

Loading

keewis Jul 1, 2025 •

edited

Loading

TomNicholas Jul 1, 2025 •

edited

Loading

TomNicholas Jul 1, 2025 •

edited

Loading