Skip to content

Implicit use of dask feature #4164

Closed
@inakleinbottle

Description

@inakleinbottle

What happened:
I tried to use the to_netcdf function to store a dataset into a NetCDF file, but the following exception was raised

Traceback (most recent call last):
  File "dask-error.py", line 27, in <module>
    ds.to_netcdf("test.nc")
  File "/home/sam/dev/xarray-test/.venv/lib/python3.8/site-packages/xarray/core/dataset.py", line 1544, in to_netcdf
    return to_netcdf(
  File "/home/sam/dev/xarray-test/.venv/lib/python3.8/site-packages/xarray/backends/api.py", line 1051, in to_netcdf
    scheduler = _get_scheduler()
  File "/home/sam/dev/xarray-test/.venv/lib/python3.8/site-packages/xarray/backends/locks.py", line 79, in _get_scheduler
    actual_get = dask.base.get_scheduler(get, collection)
AttributeError: module 'dask' has no attribute 'base'

This code sample works perfectly as expected when the dask package is not installed in the environment, and the method works as expected. However, we dask is installed the _get_scheduler function is called and produces the error (this can be found here)

actual_get = dask.base.get_scheduler(get, collection)

After a little digging through, the problem is that the base module in the dask package depends on the toolz package, which is not a default dependency of dask and so causes a silent import failure when dask initialises its namespace (https://github.com/dask/dask/blob/416d348f7174a302815758cb87dbf6983226ddc5/dask/__init__.py#L10). As a result, the base package is not importable form the dask top level, and importing it separately gives as follows

from dask import base

raises a ModuleNotFoundError.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sam/dev/xarray-test/.venv/lib/python3.8/site-packages/dask/base.py", line 13, in <module>
    from tlz import merge, groupby, curry, identity
ModuleNotFoundError: No module named 'tlz'

I recommend the following fix. At the following line in the _get_scheduler function

import dask # noqa: F401

replace the import with the following

from dask.base import get_scheduler

and remove dask.base from the later call.

I should, however, point out that get_scheduler does not appear to be part of the Dask public API.

What you expected to happen:
The to_netcdf method should have exited silently and created a new file in the working directory with the contents of the data set.

Minimal Complete Verifiable Example:
This code is basically the "Toy weather data" example from the documentation, except for the last line.

import numpy as np
import pandas as pd

import xarray as xr

np.random.seed(123)

xr.set_options(display_style="html")

times = pd.date_range("2000-01-01", "2001-12-31", name="time")
annual_cycle = np.sin(2 * np.pi * (times.dayofyear.values / 365.25 - 0.28))

base = 10 + 15 * annual_cycle.reshape(-1, 1)
tmin_values = base + 3 * np.random.randn(annual_cycle.size, 3)
tmax_values = base + 10 + 3 * np.random.randn(annual_cycle.size, 3)

ds = xr.Dataset(
    {
        "tmin": (("time", "location"), tmin_values),
        "tmax": (("time", "location"), tmax_values),
    },
    {"time": times, "location": ["IA", "IN", "IL"]},
)

ds.to_netcdf("test.nc") ## error here

Anything else we need to know?:
As mentioned above, the error on manifests when the dask package with no extras installed is present in the environment. (Many of the extras require the toolz package, at which time the import error goes away.)

Environment:
In a clean virtual environment, install the following packages.

pip install xarray netCDF4 dask

The package versions installed are as followed (generated by pip freeze):

cftime==1.1.3
dask==2.18.1
netCDF4==1.5.3
numpy==1.18.5
pandas==1.0.5
python-dateutil==2.8.1
pytz==2020.1
PyYAML==5.3.1
six==1.15.0
xarray==0.15.1

(Also running python3.8.2 on Debian Linux, not that I suppose this matters.)

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.8.2+ (heads/3.8:882a7f44da, Apr 26 2020, 19:31:38) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.4.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.3

xarray: 0.15.1
pandas: 1.0.5
numpy: 1.18.5
scipy: None
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.1.3
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.18.1
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.2.0
pip: 19.2.3
conda: None
pytest: None
IPython: None
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions