Skip to content

Commit f8f5e82

Browse files
committed
Merge branch 'main' into async.load
2 parents a074a25 + ef180b8 commit f8f5e82

File tree

11 files changed

+238
-12
lines changed

11 files changed

+238
-12
lines changed

doc/user-guide/hierarchical-data.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -453,6 +453,8 @@ The result is a new tree, containing only the nodes matching the condition.
453453

454454
(Yes, under the hood :py:meth:`~xarray.DataTree.filter` is just syntactic sugar for the pattern we showed you in :ref:`iterating over trees` !)
455455

456+
If you want to filter out empty nodes you can use :py:meth:`~xarray.DataTree.prune`.
457+
456458
.. _Tree Contents:
457459

458460
Tree Contents

doc/whats-new.rst

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ v2025.07.2 (unreleased)
1212

1313
New Features
1414
~~~~~~~~~~~~
15+
- Added :py:meth:`DataTree.prune` method to remove empty nodes while preserving tree structure.
16+
Useful for cleaning up DataTree after time-based filtering operations (:issue:`10590`, :pull:`10598`).
17+
By `Alfonso Ladino <https://github.com/aladinor>`_.
1518

1619
- Added new asynchronous loading methods :py:meth:`Dataset.load_async`, :py:meth:`DataArray.load_async`, :py:meth:`Variable.load_async`.
1720
Note that users are expected to limit concurrency themselves - xarray does not internally limit concurrency in any way.
@@ -24,6 +27,13 @@ New Features
2427
Breaking changes
2528
~~~~~~~~~~~~~~~~
2629

30+
- When writing to NetCDF files with groups, Xarray no longer redefines dimensions
31+
that have the same size in parent groups (:issue:`10241`). This conforms with
32+
`CF Conventions for group scrope <https://cfconventions.org/cf-conventions/cf-conventions.html#_scope>`_
33+
but may require adjustments for code that consumes NetCDF files produced by
34+
Xarray.
35+
By `Stephan Hoyer <https://github.com/shoyer>`_.
36+
2737

2838
Deprecations
2939
~~~~~~~~~~~~
@@ -58,6 +68,10 @@ Bug fixes
5868
Documentation
5969
~~~~~~~~~~~~~
6070

71+
- Clarify lazy behaviour and eager loading for ``chunks=None`` in :py:func:`~xarray.open_dataset`, :py:func:`~xarray.open_dataarray`, :py:func:`~xarray.open_datatree`, :py:func:`~xarray.open_groups` and :py:func:`~xarray.open_zarr` (:issue:`10612`, :pull:`10627`).
72+
By `Kai Mühlbauer <https://github.com/kmuehlbauer>`_.
73+
74+
6175

6276
Internal Changes
6377
~~~~~~~~~~~~~~~~

xarray/backends/api.py

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -578,8 +578,10 @@ def open_dataset(
578578
579579
- ``chunks="auto"`` will use dask ``auto`` chunking taking into account the
580580
engine preferred chunks.
581-
- ``chunks=None`` skips using dask, which is generally faster for
582-
small arrays.
581+
- ``chunks=None`` skips using dask. This uses xarray's internally private
582+
:ref:`lazy indexing classes <internal design.lazy indexing>`,
583+
but data is eagerly loaded into memory as numpy arrays when accessed.
584+
This can be more efficient for smaller arrays or when large arrays are sliced before computation.
583585
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
584586
- ``chunks={}`` loads the data with dask using the engine's preferred chunk
585587
size, generally identical to the format's chunk size. If not available, a
@@ -819,8 +821,10 @@ def open_dataarray(
819821
820822
- ``chunks='auto'`` will use dask ``auto`` chunking taking into account the
821823
engine preferred chunks.
822-
- ``chunks=None`` skips using dask, which is generally faster for
823-
small arrays.
824+
- ``chunks=None`` skips using dask. This uses xarray's internally private
825+
:ref:`lazy indexing classes <internal design.lazy indexing>`,
826+
but data is eagerly loaded into memory as numpy arrays when accessed.
827+
This can be more efficient for smaller arrays, though results may vary.
824828
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
825829
- ``chunks={}`` loads the data with dask using engine preferred chunks if
826830
exposed by the backend, otherwise with a single chunk for all arrays.
@@ -1044,8 +1048,10 @@ def open_datatree(
10441048
10451049
- ``chunks="auto"`` will use dask ``auto`` chunking taking into account the
10461050
engine preferred chunks.
1047-
- ``chunks=None`` skips using dask, which is generally faster for
1048-
small arrays.
1051+
- ``chunks=None`` skips using dask. This uses xarray's internally private
1052+
:ref:`lazy indexing classes <internal design.lazy indexing>`,
1053+
but data is eagerly loaded into memory as numpy arrays when accessed.
1054+
This can be more efficient for smaller arrays, though results may vary.
10491055
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
10501056
- ``chunks={}`` loads the data with dask using the engine's preferred chunk
10511057
size, generally identical to the format's chunk size. If not available, a
@@ -1288,8 +1294,10 @@ def open_groups(
12881294
12891295
- ``chunks="auto"`` will use dask ``auto`` chunking taking into account the
12901296
engine preferred chunks.
1291-
- ``chunks=None`` skips using dask, which is generally faster for
1292-
small arrays.
1297+
- ``chunks=None`` skips using dask. This uses xarray's internally private
1298+
:ref:`lazy indexing classes <internal design.lazy indexing>`,
1299+
but data is eagerly loaded into memory as numpy arrays when accessed.
1300+
This can be more efficient for smaller arrays, though results may vary.
12931301
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
12941302
- ``chunks={}`` loads the data with dask using the engine's preferred chunk
12951303
size, generally identical to the format's chunk size. If not available, a

xarray/backends/common.py

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -256,6 +256,20 @@ def find_root_and_group(ds):
256256
return ds, group
257257

258258

259+
def collect_ancestor_dimensions(group) -> dict[str, int]:
260+
"""Returns dimensions defined in parent groups.
261+
262+
If dimensions are defined in multiple ancestors, use the size of the closest
263+
ancestor.
264+
"""
265+
dims = {}
266+
while (group := group.parent) is not None:
267+
for k, v in group.dimensions.items():
268+
if k not in dims:
269+
dims[k] = len(v)
270+
return dims
271+
272+
259273
def datatree_from_dict_with_io_cleanup(groups_dict: Mapping[str, Dataset]) -> DataTree:
260274
"""DataTree.from_dict with file clean-up."""
261275
try:
@@ -315,6 +329,9 @@ class AbstractDataStore:
315329
def get_dimensions(self): # pragma: no cover
316330
raise NotImplementedError()
317331

332+
def get_parent_dimensions(self): # pragma: no cover
333+
return {}
334+
318335
def get_attrs(self): # pragma: no cover
319336
raise NotImplementedError()
320337

@@ -570,21 +587,22 @@ def set_dimensions(self, variables, unlimited_dims=None):
570587
if unlimited_dims is None:
571588
unlimited_dims = set()
572589

590+
parent_dims = self.get_parent_dimensions()
573591
existing_dims = self.get_dimensions()
574592

575593
dims = {}
576594
for v in unlimited_dims: # put unlimited_dims first
577595
dims[v] = None
578596
for v in variables.values():
579-
dims.update(dict(zip(v.dims, v.shape, strict=True)))
597+
dims |= v.sizes
580598

581599
for dim, length in dims.items():
582600
if dim in existing_dims and length != existing_dims[dim]:
583601
raise ValueError(
584602
"Unable to update size for existing dimension"
585603
f"{dim!r} ({length} != {existing_dims[dim]})"
586604
)
587-
elif dim not in existing_dims:
605+
elif dim not in existing_dims and length != parent_dims.get(dim):
588606
is_unlimited = dim in unlimited_dims
589607
self.set_dimension(dim, length, is_unlimited)
590608

xarray/backends/h5netcdf_.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
WritableCFDataStore,
1717
_normalize_path,
1818
_open_remote_file,
19+
collect_ancestor_dimensions,
1920
datatree_from_dict_with_io_cleanup,
2021
find_root_and_group,
2122
)
@@ -287,6 +288,9 @@ def get_attrs(self):
287288
def get_dimensions(self):
288289
return FrozenDict((k, len(v)) for k, v in self.ds.dimensions.items())
289290

291+
def get_parent_dimensions(self):
292+
return FrozenDict(collect_ancestor_dimensions(self.ds))
293+
290294
def get_encoding(self):
291295
return {
292296
"unlimited_dims": {

xarray/backends/netCDF4_.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
T_PathFileOrDataStore,
1717
WritableCFDataStore,
1818
_normalize_path,
19+
collect_ancestor_dimensions,
1920
datatree_from_dict_with_io_cleanup,
2021
find_root_and_group,
2122
robust_getitem,
@@ -518,6 +519,9 @@ def get_attrs(self):
518519
def get_dimensions(self):
519520
return FrozenDict((k, len(v)) for k, v in self.ds.dimensions.items())
520521

522+
def get_parent_dimensions(self):
523+
return FrozenDict(collect_ancestor_dimensions(self.ds))
524+
521525
def get_encoding(self):
522526
return {
523527
"unlimited_dims": {

xarray/backends/zarr.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1420,8 +1420,10 @@ def open_zarr(
14201420
14211421
- ``chunks='auto'`` will use dask ``auto`` chunking taking into account the
14221422
engine preferred chunks.
1423-
- ``chunks=None`` skips using dask, which is generally faster for
1424-
small arrays.
1423+
- ``chunks=None`` skips using dask. This uses xarray's internally private
1424+
:ref:`lazy indexing classes <internal design.lazy indexing>`,
1425+
but data is eagerly loaded into memory as numpy arrays when accessed.
1426+
This can be more efficient for smaller arrays, though results may vary.
14251427
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
14261428
- ``chunks={}`` loads the data with dask using engine preferred chunks if
14271429
exposed by the backend, otherwise with a single chunk for all arrays.

xarray/core/datatree.py

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1450,6 +1450,73 @@ def filter_like(self, other: DataTree) -> DataTree:
14501450
other_keys = {key for key, _ in other.subtree_with_keys}
14511451
return self.filter(lambda node: node.relative_to(self) in other_keys)
14521452

1453+
def prune(self, drop_size_zero_vars: bool = False) -> DataTree:
1454+
"""
1455+
Remove empty nodes from the tree.
1456+
1457+
Returns a new tree containing only nodes that contain data variables with actual data.
1458+
Intermediate nodes are kept if they are required to support non-empty children.
1459+
1460+
Parameters
1461+
----------
1462+
drop_size_zero_vars : bool, default False
1463+
If True, also considers variables with zero size as empty.
1464+
If False, keeps nodes with data variables even if they have zero size.
1465+
1466+
Returns
1467+
-------
1468+
DataTree
1469+
A new tree with empty nodes removed.
1470+
1471+
See Also
1472+
--------
1473+
filter
1474+
1475+
Examples
1476+
--------
1477+
>>> dt = xr.DataTree.from_dict(
1478+
... {
1479+
... "/a": xr.Dataset({"foo": ("x", [1, 2])}),
1480+
... "/b": xr.Dataset({"bar": ("x", [])}),
1481+
... "/c": xr.Dataset(),
1482+
... }
1483+
... )
1484+
>>> dt.prune() # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE
1485+
<xarray.DataTree>
1486+
Group: /
1487+
├── Group: /a
1488+
│ Dimensions: (x: 2)
1489+
│ Dimensions without coordinates: x
1490+
│ Data variables:
1491+
│ foo (x) int64 16B 1 2
1492+
└── Group: /b
1493+
Dimensions: (x: 0)
1494+
Dimensions without coordinates: x
1495+
Data variables:
1496+
bar (x) float64 0B...
1497+
1498+
The ``drop_size_zero_vars`` parameter controls whether variables
1499+
with zero size are considered empty:
1500+
1501+
>>> dt.prune(drop_size_zero_vars=True)
1502+
<xarray.DataTree>
1503+
Group: /
1504+
└── Group: /a
1505+
Dimensions: (x: 2)
1506+
Dimensions without coordinates: x
1507+
Data variables:
1508+
foo (x) int64 16B 1 2
1509+
"""
1510+
non_empty_cond: Callable[[DataTree], bool]
1511+
if drop_size_zero_vars:
1512+
non_empty_cond = lambda node: len(node.data_vars) > 0 and any(
1513+
var.size > 0 for var in node.data_vars.values()
1514+
)
1515+
else:
1516+
non_empty_cond = lambda node: len(node.data_vars) > 0
1517+
1518+
return self.filter(non_empty_cond)
1519+
14531520
def match(self, pattern: str) -> DataTree:
14541521
"""
14551522
Return nodes with paths matching pattern.

xarray/tests/test_backends.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1704,6 +1704,17 @@ def test_write_groups(self) -> None:
17041704
with self.open(tmp_file, group="data/2") as actual2:
17051705
assert_identical(data2, actual2)
17061706

1707+
def test_child_group_with_inconsistent_dimensions(self) -> None:
1708+
base = Dataset(coords={"x": [1, 2]})
1709+
child = Dataset(coords={"x": [1, 2, 3]})
1710+
with create_tmp_file() as tmp_file:
1711+
self.save(base, tmp_file)
1712+
self.save(child, tmp_file, group="child", mode="a")
1713+
with self.open(tmp_file) as actual_base:
1714+
assert_identical(base, actual_base)
1715+
with self.open(tmp_file, group="child") as actual_child:
1716+
assert_identical(child, actual_child)
1717+
17071718
@pytest.mark.parametrize(
17081719
"input_strings, is_bytes",
17091720
[

xarray/tests/test_backends_datatree.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -265,6 +265,23 @@ def test_write_subgroup(self, tmpdir):
265265
assert_equal(original_dt, roundtrip_dt)
266266
assert_identical(expected_dt, roundtrip_dt)
267267

268+
@requires_netCDF4
269+
def test_no_redundant_dimensions(self, tmpdir):
270+
# regression test for https://github.com/pydata/xarray/issues/10241
271+
original_dt = DataTree.from_dict(
272+
{
273+
"/": xr.Dataset(coords={"x": [1, 2, 3]}),
274+
"/child": xr.Dataset({"foo": ("x", [4, 5, 6])}),
275+
}
276+
)
277+
filepath = tmpdir / "test.zarr"
278+
original_dt.to_netcdf(filepath, engine=self.engine)
279+
280+
root = nc4.Dataset(str(filepath))
281+
child = root.groups["child"]
282+
assert list(root.dimensions) == ["x"]
283+
assert list(child.dimensions) == []
284+
268285

269286
@requires_netCDF4
270287
class TestNetCDF4DatatreeIO(DatatreeIOBase):

0 commit comments

Comments
 (0)