Store variables in DataTree instead of storing Dataset

Should we prefer inheritance or composition when making the node of a datatree behave like an xarray Dataset?

### Inheritance

We [really want](https://github.com/pydata/xarray/issues/4118#issuecomment-833535376) the data-containing nodes of the datatree to behave as much like xarray datasets as possible, as we will likely be calling functions/methods on them, assigning them, extracting from them and saving them as if they were actually xarray.Dataset objects. We could imagine a tree node class which directly inherits from `xarray.Dataset`:

```python
class DatasetNode(xarray.Dataset, NodeMixin):
    ...
```

This would have all the attributes and API of a Dataset, and pass `isinstance()` checks, but also the attributes and methods needed to function as a node in a tree (e.g. `.children`, `.parent`). We would still need to decorate most inherited methods in order to apply them to all child nodes in the tree though.

Mostly these don't collide, except in the important case of getting/setting children of a node. `xarray.Datasets` already use up `__getitem__` for variable selection (i.e. `ds[var]`) as well as the `.some_variable` namespace via property-like access. This means we can't immediately have an API allowing [operations like](https://github.com/pydata/xarray/issues/4118#issuecomment-637382925) `dt.weather = dt.weather.mean('time')` because `.weather` is a child of the node, not a dataset variable. (It's possible we could have both behaviours simultaneously by overwriting `__getitem__`, but then we might restrict the possible names of children/variables.)

I think this approach would also have the side-effect that accessor methods registered with `@register_dataset_accessor` would also be callable on the tree nodes.

### Composition

The alternative is instead of each node *being* a Dataset, each node merely *wraps* a Dataset. This has the advantage of keeping the Data class and the Node class separate, though they would still share a large API to allow applying a method (e.g. `.mean()` to all child nodes in a tree.

The disadvantage is that then all the variables and dataset attributes are behind a `.ds` property.


This type of syntax `dt.weather = dt.weather.mean('time')` would then be possible (at least if we didn't allow the tree objects to have their own `.attrs`, else it would have to be `dt['weather'] = dt['weather'].mean('time')`) because we would be calling the method of a DatasetNode (rather than Dataset) and then assigning to a DatasetNode.

Selecting a particular variable from a dataset stored at a particular node would then look like `dt['weather'].ds['pressure']`, which has the advantage of clarifying which one is the variable, but the disadvantage of breaking up the path-like structure to get from the root down to the variable. EDIT: As there is no problem with collisions between names of groups and variables, we can actually just override `__getitem__` to check in both the data variables and the children, so we can have access like `dt['weather']['pressure']`.

---

(There is also a possible third option described in #4)

---

For now the second approach seemed better, but I'm looking for other opinions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Store variables in DataTree instead of storing Dataset #2

Inheritance

Composition

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Store variables in DataTree instead of storing Dataset #2

Description

Inheritance

Composition

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions