Skip to content

Commit 756c941

Browse files
authored
Refactor concat to use merge for non-concatenated variables (#3239)
* Add compat = 'override' and data_vars/coords='sensible' * concat tests. * Update docstring. * Begin merge, combine. * Merge non concatenated variables. * Fix tests. * Fix tests 2 * Fix test 3 * Cleanup: reduce number of times we loop over datasets. * unique_variable does minimum number of loads: fixes dask test * docstrings for compat='override' * concat compat docstring. * remove the sensible option. * reduce silly changes. * fix groupby order test. * cleanup: var names + remove one loop through datasets. * Add whats-new entry. * Add note in io.rst * fix warning. * Update netcdf multi-file dataset section in io.rst. * Update mfdataset in dask.rst. * simplify parse_datasets. * Avoid using merge_variables. unique_variable instead. * small stuff. * Update docs. * minor fix. * minor fix. * lint. * Better error message. * rename to shorter variable names. * Cleanup: fillna preserves attrs now. * Look for concat dim in data_vars also. * Update xarray/core/merge.py Co-Authored-By: Stephan Hoyer <[email protected]> * avoid unnecessary computes. * minor cleanups.
1 parent b65ce86 commit 756c941

File tree

12 files changed

+402
-222
lines changed

12 files changed

+402
-222
lines changed

doc/dask.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -75,13 +75,14 @@ entirely equivalent to opening a dataset using ``open_dataset`` and then
7575
chunking the data using the ``chunk`` method, e.g.,
7676
``xr.open_dataset('example-data.nc').chunk({'time': 10})``.
7777

78-
To open multiple files simultaneously, use :py:func:`~xarray.open_mfdataset`::
78+
To open multiple files simultaneously in parallel using Dask delayed,
79+
use :py:func:`~xarray.open_mfdataset`::
7980

80-
xr.open_mfdataset('my/files/*.nc')
81+
xr.open_mfdataset('my/files/*.nc', parallel=True)
8182

8283
This function will automatically concatenate and merge dataset into one in
8384
the simple cases that it understands (see :py:func:`~xarray.auto_combine`
84-
for the full disclaimer). By default, ``open_mfdataset`` will chunk each
85+
for the full disclaimer). By default, :py:func:`~xarray.open_mfdataset` will chunk each
8586
netCDF file into a single Dask array; again, supply the ``chunks`` argument to
8687
control the size of the resulting Dask arrays. In more complex cases, you can
8788
open each file individually using ``open_dataset`` and merge the result, as

doc/io.rst

Lines changed: 147 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,9 @@ netCDF
9999
The recommended way to store xarray data structures is `netCDF`__, which
100100
is a binary file format for self-described datasets that originated
101101
in the geosciences. xarray is based on the netCDF data model, so netCDF files
102-
on disk directly correspond to :py:class:`~xarray.Dataset` objects.
102+
on disk directly correspond to :py:class:`~xarray.Dataset` objects (more accurately,
103+
a group in a netCDF file directly corresponds to a to :py:class:`~xarray.Dataset` object.
104+
See :ref:`io.netcdf_groups` for more.)
103105

104106
NetCDF is supported on almost all platforms, and parsers exist
105107
for the vast majority of scientific programming languages. Recent versions of
@@ -121,7 +123,7 @@ read/write netCDF V4 files and use the compression options described below).
121123
__ https://github.com/Unidata/netcdf4-python
122124

123125
We can save a Dataset to disk using the
124-
:py:attr:`Dataset.to_netcdf <xarray.Dataset.to_netcdf>` method:
126+
:py:meth:`~Dataset.to_netcdf` method:
125127

126128
.. ipython:: python
127129
@@ -147,19 +149,6 @@ convert the ``DataArray`` to a ``Dataset`` before saving, and then convert back
147149
when loading, ensuring that the ``DataArray`` that is loaded is always exactly
148150
the same as the one that was saved.
149151

150-
NetCDF groups are not supported as part of the
151-
:py:class:`~xarray.Dataset` data model. Instead, groups can be loaded
152-
individually as Dataset objects.
153-
To do so, pass a ``group`` keyword argument to the
154-
``open_dataset`` function. The group can be specified as a path-like
155-
string, e.g., to access subgroup 'bar' within group 'foo' pass
156-
'/foo/bar' as the ``group`` argument.
157-
In a similar way, the ``group`` keyword argument can be given to the
158-
:py:meth:`~xarray.Dataset.to_netcdf` method to write to a group
159-
in a netCDF file.
160-
When writing multiple groups in one file, pass ``mode='a'`` to ``to_netcdf``
161-
to ensure that each call does not delete the file.
162-
163152
Data is always loaded lazily from netCDF files. You can manipulate, slice and subset
164153
Dataset and DataArray objects, and no array values are loaded into memory until
165154
you try to perform some sort of actual computation. For an example of how these
@@ -195,6 +184,24 @@ It is possible to append or overwrite netCDF variables using the ``mode='a'``
195184
argument. When using this option, all variables in the dataset will be written
196185
to the original netCDF file, regardless if they exist in the original dataset.
197186

187+
188+
.. _io.netcdf_groups:
189+
190+
Groups
191+
~~~~~~
192+
193+
NetCDF groups are not supported as part of the :py:class:`~xarray.Dataset` data model.
194+
Instead, groups can be loaded individually as Dataset objects.
195+
To do so, pass a ``group`` keyword argument to the
196+
:py:func:`~xarray.open_dataset` function. The group can be specified as a path-like
197+
string, e.g., to access subgroup ``'bar'`` within group ``'foo'`` pass
198+
``'/foo/bar'`` as the ``group`` argument.
199+
In a similar way, the ``group`` keyword argument can be given to the
200+
:py:meth:`~xarray.Dataset.to_netcdf` method to write to a group
201+
in a netCDF file.
202+
When writing multiple groups in one file, pass ``mode='a'`` to
203+
:py:meth:`~xarray.Dataset.to_netcdf` to ensure that each call does not delete the file.
204+
198205
.. _io.encoding:
199206

200207
Reading encoded data
@@ -203,7 +210,7 @@ Reading encoded data
203210
NetCDF files follow some conventions for encoding datetime arrays (as numbers
204211
with a "units" attribute) and for packing and unpacking data (as
205212
described by the "scale_factor" and "add_offset" attributes). If the argument
206-
``decode_cf=True`` (default) is given to ``open_dataset``, xarray will attempt
213+
``decode_cf=True`` (default) is given to :py:func:`~xarray.open_dataset`, xarray will attempt
207214
to automatically decode the values in the netCDF objects according to
208215
`CF conventions`_. Sometimes this will fail, for example, if a variable
209216
has an invalid "units" or "calendar" attribute. For these cases, you can
@@ -247,6 +254,130 @@ will remove encoding information.
247254
import os
248255
os.remove('saved_on_disk.nc')
249256
257+
258+
.. _combining multiple files:
259+
260+
Reading multi-file datasets
261+
...........................
262+
263+
NetCDF files are often encountered in collections, e.g., with different files
264+
corresponding to different model runs or one file per timestamp.
265+
xarray can straightforwardly combine such files into a single Dataset by making use of
266+
:py:func:`~xarray.concat`, :py:func:`~xarray.merge`, :py:func:`~xarray.combine_nested` and
267+
:py:func:`~xarray.combine_by_coords`. For details on the difference between these
268+
functions see :ref:`combining data`.
269+
270+
Xarray includes support for manipulating datasets that don't fit into memory
271+
with dask_. If you have dask installed, you can open multiple files
272+
simultaneously in parallel using :py:func:`~xarray.open_mfdataset`::
273+
274+
xr.open_mfdataset('my/files/*.nc', parallel=True)
275+
276+
This function automatically concatenates and merges multiple files into a
277+
single xarray dataset.
278+
It is the recommended way to open multiple files with xarray.
279+
For more details on parallel reading, see :ref:`combining.multi`, :ref:`dask.io` and a
280+
`blog post`_ by Stephan Hoyer.
281+
:py:func:`~xarray.open_mfdataset` takes many kwargs that allow you to
282+
control its behaviour (for e.g. ``parallel``, ``combine``, ``compat``, ``join``, ``concat_dim``).
283+
See its docstring for more details.
284+
285+
286+
.. note::
287+
288+
A common use-case involves a dataset distributed across a large number of files with
289+
each file containing a large number of variables. Commonly a few of these variables
290+
need to be concatenated along a dimension (say ``"time"``), while the rest are equal
291+
across the datasets (ignoring floating point differences). The following command
292+
with suitable modifications (such as ``parallel=True``) works well with such datasets::
293+
294+
xr.open_mfdataset('my/files/*.nc', concat_dim="time",
295+
data_vars='minimal', coords='minimal', compat='override')
296+
297+
This command concatenates variables along the ``"time"`` dimension, but only those that
298+
already contain the ``"time"`` dimension (``data_vars='minimal', coords='minimal'``).
299+
Variables that lack the ``"time"`` dimension are taken from the first dataset
300+
(``compat='override'``).
301+
302+
303+
.. _dask: http://dask.pydata.org
304+
.. _blog post: http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/
305+
306+
Sometimes multi-file datasets are not conveniently organized for easy use of :py:func:`~xarray.open_mfdataset`.
307+
One can use the ``preprocess`` argument to provide a function that takes a dataset
308+
and returns a modified Dataset.
309+
:py:func:`~xarray.open_mfdataset` will call ``preprocess`` on every dataset
310+
(corresponding to each file) prior to combining them.
311+
312+
313+
If :py:func:`~xarray.open_mfdataset` does not meet your needs, other approaches are possible.
314+
The general pattern for parallel reading of multiple files
315+
using dask, modifying those datasets and then combining into a single ``Dataset`` is::
316+
317+
def modify(ds):
318+
# modify ds here
319+
return ds
320+
321+
322+
# this is basically what open_mfdataset does
323+
open_kwargs = dict(decode_cf=True, decode_times=False)
324+
open_tasks = [dask.delayed(xr.open_dataset)(f, **open_kwargs) for f in file_names]
325+
tasks = [dask.delayed(modify)(task) for task in open_tasks]
326+
datasets = dask.compute(tasks) # get a list of xarray.Datasets
327+
combined = xr.combine_nested(datasets) # or some combination of concat, merge
328+
329+
330+
As an example, here's how we could approximate ``MFDataset`` from the netCDF4
331+
library::
332+
333+
from glob import glob
334+
import xarray as xr
335+
336+
def read_netcdfs(files, dim):
337+
# glob expands paths with * to a list of files, like the unix shell
338+
paths = sorted(glob(files))
339+
datasets = [xr.open_dataset(p) for p in paths]
340+
combined = xr.concat(dataset, dim)
341+
return combined
342+
343+
combined = read_netcdfs('/all/my/files/*.nc', dim='time')
344+
345+
This function will work in many cases, but it's not very robust. First, it
346+
never closes files, which means it will fail one you need to load more than
347+
a few thousands file. Second, it assumes that you want all the data from each
348+
file and that it can all fit into memory. In many situations, you only need
349+
a small subset or an aggregated summary of the data from each file.
350+
351+
Here's a slightly more sophisticated example of how to remedy these
352+
deficiencies::
353+
354+
def read_netcdfs(files, dim, transform_func=None):
355+
def process_one_path(path):
356+
# use a context manager, to ensure the file gets closed after use
357+
with xr.open_dataset(path) as ds:
358+
# transform_func should do some sort of selection or
359+
# aggregation
360+
if transform_func is not None:
361+
ds = transform_func(ds)
362+
# load all data from the transformed dataset, to ensure we can
363+
# use it after closing each original file
364+
ds.load()
365+
return ds
366+
367+
paths = sorted(glob(files))
368+
datasets = [process_one_path(p) for p in paths]
369+
combined = xr.concat(datasets, dim)
370+
return combined
371+
372+
# here we suppose we only care about the combined mean of each file;
373+
# you might also use indexing operations like .sel to subset datasets
374+
combined = read_netcdfs('/all/my/files/*.nc', dim='time',
375+
transform_func=lambda ds: ds.mean())
376+
377+
This pattern works well and is very robust. We've used similar code to process
378+
tens of thousands of files constituting 100s of GB of data.
379+
380+
250381
.. _io.netcdf.writing_encoded:
251382

252383
Writing encoded data
@@ -817,84 +948,3 @@ For CSV files, one might also consider `xarray_extras`_.
817948
.. _xarray_extras: https://xarray-extras.readthedocs.io/en/latest/api/csv.html
818949

819950
.. _IO tools: http://pandas.pydata.org/pandas-docs/stable/io.html
820-
821-
822-
.. _combining multiple files:
823-
824-
825-
Combining multiple files
826-
------------------------
827-
828-
NetCDF files are often encountered in collections, e.g., with different files
829-
corresponding to different model runs. xarray can straightforwardly combine such
830-
files into a single Dataset by making use of :py:func:`~xarray.concat`,
831-
:py:func:`~xarray.merge`, :py:func:`~xarray.combine_nested` and
832-
:py:func:`~xarray.combine_by_coords`. For details on the difference between these
833-
functions see :ref:`combining data`.
834-
835-
.. note::
836-
837-
Xarray includes support for manipulating datasets that don't fit into memory
838-
with dask_. If you have dask installed, you can open multiple files
839-
simultaneously using :py:func:`~xarray.open_mfdataset`::
840-
841-
xr.open_mfdataset('my/files/*.nc')
842-
843-
This function automatically concatenates and merges multiple files into a
844-
single xarray dataset.
845-
It is the recommended way to open multiple files with xarray.
846-
For more details, see :ref:`combining.multi`, :ref:`dask.io` and a
847-
`blog post`_ by Stephan Hoyer.
848-
849-
.. _dask: http://dask.pydata.org
850-
.. _blog post: http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/
851-
852-
For example, here's how we could approximate ``MFDataset`` from the netCDF4
853-
library::
854-
855-
from glob import glob
856-
import xarray as xr
857-
858-
def read_netcdfs(files, dim):
859-
# glob expands paths with * to a list of files, like the unix shell
860-
paths = sorted(glob(files))
861-
datasets = [xr.open_dataset(p) for p in paths]
862-
combined = xr.concat(dataset, dim)
863-
return combined
864-
865-
combined = read_netcdfs('/all/my/files/*.nc', dim='time')
866-
867-
This function will work in many cases, but it's not very robust. First, it
868-
never closes files, which means it will fail one you need to load more than
869-
a few thousands file. Second, it assumes that you want all the data from each
870-
file and that it can all fit into memory. In many situations, you only need
871-
a small subset or an aggregated summary of the data from each file.
872-
873-
Here's a slightly more sophisticated example of how to remedy these
874-
deficiencies::
875-
876-
def read_netcdfs(files, dim, transform_func=None):
877-
def process_one_path(path):
878-
# use a context manager, to ensure the file gets closed after use
879-
with xr.open_dataset(path) as ds:
880-
# transform_func should do some sort of selection or
881-
# aggregation
882-
if transform_func is not None:
883-
ds = transform_func(ds)
884-
# load all data from the transformed dataset, to ensure we can
885-
# use it after closing each original file
886-
ds.load()
887-
return ds
888-
889-
paths = sorted(glob(files))
890-
datasets = [process_one_path(p) for p in paths]
891-
combined = xr.concat(datasets, dim)
892-
return combined
893-
894-
# here we suppose we only care about the combined mean of each file;
895-
# you might also use indexing operations like .sel to subset datasets
896-
combined = read_netcdfs('/all/my/files/*.nc', dim='time',
897-
transform_func=lambda ds: ds.mean())
898-
899-
This pattern works well and is very robust. We've used similar code to process
900-
tens of thousands of files constituting 100s of GB of data.

doc/whats-new.rst

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ New functions/methods
9393
By `Deepak Cherian <https://github.com/dcherian>`_ and `David Mertz
9494
<http://github.com/DavidMertz>`_.
9595

96-
- Dataset plotting API for visualizing dependencies between two `DataArray`s!
96+
- Dataset plotting API for visualizing dependencies between two DataArrays!
9797
Currently only :py:meth:`Dataset.plot.scatter` is implemented.
9898
By `Yohai Bar Sinai <https://github.com/yohai>`_ and `Deepak Cherian <https://github.com/dcherian>`_
9999

@@ -103,11 +103,30 @@ New functions/methods
103103
Enhancements
104104
~~~~~~~~~~~~
105105

106-
- Added ``join='override'``. This only checks that index sizes are equal among objects and skips
107-
checking indexes for equality. By `Deepak Cherian <https://github.com/dcherian>`_.
106+
- Multiple enhancements to :py:func:`~xarray.concat` and :py:func:`~xarray.open_mfdataset`.
108107

109-
- :py:func:`~xarray.concat` and :py:func:`~xarray.open_mfdataset` now support the ``join`` kwarg.
110-
It is passed down to :py:func:`~xarray.align`. By `Deepak Cherian <https://github.com/dcherian>`_.
108+
- Added ``compat='override'``. When merging, this option picks the variable from the first dataset
109+
and skips all comparisons.
110+
111+
- Added ``join='override'``. When aligning, this only checks that index sizes are equal among objects
112+
and skips checking indexes for equality.
113+
114+
- :py:func:`~xarray.concat` and :py:func:`~xarray.open_mfdataset` now support the ``join`` kwarg.
115+
It is passed down to :py:func:`~xarray.align`.
116+
117+
- :py:func:`~xarray.concat` now calls :py:func:`~xarray.merge` on variables that are not concatenated
118+
(i.e. variables without ``concat_dim`` when ``data_vars`` or ``coords`` are ``"minimal"``).
119+
:py:func:`~xarray.concat` passes its new ``compat`` kwarg down to :py:func:`~xarray.merge`.
120+
(:issue:`2064`)
121+
122+
Users can avoid a common bottleneck when using :py:func:`~xarray.open_mfdataset` on a large number of
123+
files with variables that are known to be aligned and some of which need not be concatenated.
124+
Slow equality comparisons can now be avoided, for e.g.::
125+
126+
data = xr.open_mfdataset(files, concat_dim='time', data_vars='minimal',
127+
coords='minimal', compat='override', join='override')
128+
129+
By `Deepak Cherian <https://github.com/dcherian>`_:
111130

112131
- In :py:meth:`~xarray.Dataset.to_zarr`, passing ``mode`` is not mandatory if
113132
``append_dim`` is set, as it will automatically be set to ``'a'`` internally.

xarray/backends/api.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -761,7 +761,7 @@ def open_mfdataset(
761761
`xarray.auto_combine` is used, but in the future this behavior will
762762
switch to use `xarray.combine_by_coords` by default.
763763
compat : {'identical', 'equals', 'broadcast_equals',
764-
'no_conflicts'}, optional
764+
'no_conflicts', 'override'}, optional
765765
String indicating how to compare variables of the same name for
766766
potential conflicts when merging:
767767
* 'broadcast_equals': all values must be equal when variables are
@@ -772,6 +772,7 @@ def open_mfdataset(
772772
* 'no_conflicts': only values which are not null in both datasets
773773
must be equal. The returned dataset then contains the combination
774774
of all non-null values.
775+
* 'override': skip comparing and pick variable from first dataset
775776
preprocess : callable, optional
776777
If provided, call this function on each dataset prior to concatenation.
777778
You can find the file-name from which each dataset was loaded in

0 commit comments

Comments
 (0)