Store variables in DataTree instead of storing Dataset #2
Description
Should we prefer inheritance or composition when making the node of a datatree behave like an xarray Dataset?
Inheritance
We really want the data-containing nodes of the datatree to behave as much like xarray datasets as possible, as we will likely be calling functions/methods on them, assigning them, extracting from them and saving them as if they were actually xarray.Dataset objects. We could imagine a tree node class which directly inherits from xarray.Dataset
:
class DatasetNode(xarray.Dataset, NodeMixin):
...
This would have all the attributes and API of a Dataset, and pass isinstance()
checks, but also the attributes and methods needed to function as a node in a tree (e.g. .children
, .parent
). We would still need to decorate most inherited methods in order to apply them to all child nodes in the tree though.
Mostly these don't collide, except in the important case of getting/setting children of a node. xarray.Datasets
already use up __getitem__
for variable selection (i.e. ds[var]
) as well as the .some_variable
namespace via property-like access. This means we can't immediately have an API allowing operations like dt.weather = dt.weather.mean('time')
because .weather
is a child of the node, not a dataset variable. (It's possible we could have both behaviours simultaneously by overwriting __getitem__
, but then we might restrict the possible names of children/variables.)
I think this approach would also have the side-effect that accessor methods registered with @register_dataset_accessor
would also be callable on the tree nodes.
Composition
The alternative is instead of each node being a Dataset, each node merely wraps a Dataset. This has the advantage of keeping the Data class and the Node class separate, though they would still share a large API to allow applying a method (e.g. .mean()
to all child nodes in a tree.
The disadvantage is that then all the variables and dataset attributes are behind a .ds
property.
This type of syntax dt.weather = dt.weather.mean('time')
would then be possible (at least if we didn't allow the tree objects to have their own .attrs
, else it would have to be dt['weather'] = dt['weather'].mean('time')
) because we would be calling the method of a DatasetNode (rather than Dataset) and then assigning to a DatasetNode.
Selecting a particular variable from a dataset stored at a particular node would then look like dt['weather'].ds['pressure']
, which has the advantage of clarifying which one is the variable, but the disadvantage of breaking up the path-like structure to get from the root down to the variable. EDIT: As there is no problem with collisions between names of groups and variables, we can actually just override __getitem__
to check in both the data variables and the children, so we can have access like dt['weather']['pressure']
.
(There is also a possible third option described in #4)
For now the second approach seemed better, but I'm looking for other opinions!