Skip to content

Persisting datasets containing arrays with a single chunk fails #4882

Closed
@sjperkins

Description

@sjperkins

What happened:

Persisting a dataset containing dask arrays with a single chunk fails.

What you expected to happen:

This should succeed!

Minimal Complete Verifiable Example:

import dask
import dask.array as da
import xarray as xr

data_vars = {
    "DATA": (("x", "y"), da.ones((1000, 1000), chunks=(1000, 1000))),
    "TIME": (("x",), da.ones((1000,), chunks=1000)),
}

dask.persist(xr.Dataset(data_vars))
 python ~/tmp/python/dask/test_xarray_persist_fail.py 
Traceback (most recent call last):
  File "/home/sperkins/tmp/python/dask/test_xarray_persist_fail.py", line 10, in <module>
    dask.persist(xr.Dataset(data_vars))
  File "/home/sperkins/venv/dask-ms/lib/python3.6/site-packages/dask/base.py", line 770, in persist
    results2 = [r({k: d[k] for k in ks}, *s) for r, ks, s in postpersists]
  File "/home/sperkins/venv/dask-ms/lib/python3.6/site-packages/dask/base.py", line 770, in <listcomp>
    results2 = [r({k: d[k] for k in ks}, *s) for r, ks, s in postpersists]
  File "/home/sperkins/venv/dask-ms/lib/python3.6/site-packages/xarray/core/dataset.py", line 877, in _dask_postpersist
    name = args2[1][0]
IndexError: tuple index out of range

Anything else we need to know?:

This occurred in the new dask 2021.02.0 and the immediate cause is probably dask/dask#7142, where the collection key name is no longer passed as the first argument by default. It's likely xarray was relying on convention here. This may be solvable by explicitly passing the dask collection name as an extra argument to the *_persist, *_compute methods so that keys unrelated to the collection can be excised appropriately.

/cc @crusaderky @JSKenyon

Environment:

Ubuntu 18.04
dask: 2021.02.0

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
python-bits: 64
OS: Linux
OS-release: 5.4.0-64-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.16.2
pandas: 0.25.0
numpy: 1.19.5
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.6.1
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.02.0
distributed: 2.8.1
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 41.2.0
pip: 20.1
conda: None
pytest: 6.1.0
IPython: 7.11.1
sphinx: 1.8.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions