Skip to content

Commit 261df2e

Browse files
authored
Document Xarray zarr encoding conventions (#4047)
* document zarr encoding * link to zarr spec * fix typo [ci skip]
1 parent f38b0c1 commit 261df2e

File tree

2 files changed

+54
-2
lines changed

2 files changed

+54
-2
lines changed

doc/internals.rst

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,3 +138,53 @@ To help users keep things straight, please `let us know
138138
<https://github.com/pydata/xarray/issues>`_ if you plan to write a new accessor
139139
for an open source library. In the future, we will maintain a list of accessors
140140
and the libraries that implement them on this page.
141+
142+
.. _zarr_encoding:
143+
144+
Zarr Encoding Specification
145+
---------------------------
146+
147+
In implementing support for the `Zarr <https://zarr.readthedocs.io/>`_ storage
148+
format, Xarray developers made some *ad hoc* choices about how to store
149+
NetCDF data in Zarr.
150+
Future versions of the Zarr spec will likely include a more formal convention
151+
for the storage of the NetCDF data model in Zarr; see
152+
`Zarr spec repo <https://github.com/zarr-developers/zarr-specs>`_ for ongoing
153+
discussion.
154+
155+
First, Xarray can only read and write Zarr groups. There is currently no support
156+
for reading / writting individual Zarr arrays. Zarr groups are mapped to
157+
Xarray ``Dataset`` objects.
158+
159+
Second, from Xarray's point of view, the key difference between
160+
NetCDF and Zarr is that all NetCDF arrays have *dimension names* while Zarr
161+
arrays do not. Therefore, in order to store NetCDF data in Zarr, Xarray must
162+
somehow encode and decode the name of each array's dimensions.
163+
164+
To accomplish this, Xarray developers decided to define a special Zarr array
165+
attribute: ``_ARRAY_DIMENSIONS``. The value of this attribute is a list of
166+
dimension names (strings), for example ``["time", "lon", "lat"]``. When writing
167+
data to Zarr, Xarray sets this attribute on all variables based on the variable
168+
dimensions. When reading a Zarr group, Xarray looks for this attribute on all
169+
arrays, raising an error if it can't be found. The attribute is used to define
170+
the variable dimension names and then removed from the attributes dictionary
171+
returned to the user.
172+
173+
Because of these choices, Xarray cannot read arbitrary array data, but only
174+
Zarr data with valid ``_ARRAY_DIMENSIONS`` attributes on each array.
175+
176+
After decoding the ``_ARRAY_DIMENSIONS`` attribute and assigning the variable
177+
dimensions, Xarray proceeds to [optionally] decode each variable using its
178+
standard CF decoding machinery used for NetCDF data (see :py:func:`decode_cf`).
179+
180+
As a concrete example, here we write a tutorial dataset to Zarr and then
181+
re-open it directly with Zarr:
182+
183+
.. ipython:: python
184+
185+
ds = xr.tutorial.load_dataset('rasm')
186+
ds.to_zarr('rasm.zarr', mode='w')
187+
import zarr
188+
zgroup = zarr.open('rasm.zarr')
189+
print(zgroup.tree())
190+
dict(zgroup['Tair'].attrs)

doc/io.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -463,7 +463,7 @@ This is not CF-compliant but again facilitates roundtripping of xarray datasets.
463463
Invalid netCDF files
464464
~~~~~~~~~~~~~~~~~~~~
465465

466-
The library ``h5netcdf`` allows writing some dtypes (booleans, complex, ...) that aren't
466+
The library ``h5netcdf`` allows writing some dtypes (booleans, complex, ...) that aren't
467467
allowed in netCDF4 (see
468468
`h5netcdf documentation <https://github.com/shoyer/h5netcdf#invalid-netcdf-files>`_).
469469
This feature is availabe through :py:meth:`DataArray.to_netcdf` and
@@ -837,7 +837,9 @@ Xarray's Zarr backend allows xarray to leverage these capabilities.
837837
Xarray can't open just any zarr dataset, because xarray requires special
838838
metadata (attributes) describing the dataset dimensions and coordinates.
839839
At this time, xarray can only open zarr datasets that have been written by
840-
xarray. To write a dataset with zarr, we use the :py:attr:`Dataset.to_zarr` method.
840+
xarray. For implementation details, see :ref:`zarr_encoding`.
841+
842+
To write a dataset with zarr, we use the :py:attr:`Dataset.to_zarr` method.
841843
To write to a local directory, we pass a path to a directory
842844

843845
.. ipython:: python

0 commit comments

Comments
 (0)