Skip to content

Commit 9c0fb6c

Browse files
nbren12shoyer
authored andcommitted
Add methods for combining variables of differing dimensionality (#1597)
* Add stack_cat and unstack_cat methods This partially resolves #1317. Change names of methods stack_cat -> to_stacked_array unstack_cat -> to_unstacked_dataset Test that the dtype of the stacked dimensions is preserved This is not passing at the moment because concatenating None with a dimension that has values upcasts the combined dtype to object Fix dtypes of stacked dimensions This commit ensures that the dtypes of the stacked coordinate match the input dimensions. Use new index variable rather than patching the old one I didn't like the inplace modification of a private member. Handle variable_dim correctly I also fixed 1. f-string formatting issue 2. Use an OrderedDict as @jhamman recommends Add documentation to api.rst and reshaping.rst I also added appropriate See Also sections to the docstrings for to_stacked_array and to_unstacked_dataset. Add changes to whats-new Fixing style errors. Split up lengthy test Remove "physical variable" from docs This is in response to Joe's "nit" * Fix to_stacked_array with new master An error arose when checking for the precence of a dimension in array. The code 'dim in data' no longer works. Replaced this with 'dim in data.dims' * Move entry in whats-new to most recent release * Fix code styling errors It needs to pass `pycodestyle xarray` * Improve docstring of to_stacked_array Added example and additional description. * Move "See Also" section to end of docstring * Doc and comment improvements. * Improve documented example @benbovy pointed out that the old example was confusing. * Add name argument to to_stacked_array and test * Allow level argument to be an int or str * Remove variable_dim argument of to_unstacked_array * Actually removed variable_dim * Change function signature of to_stacked_array Previously, this function was passed a list of dimensions which should be stacked together. However, @benbovy found that the function failed when the _non-stacked_ dimensions were not shared across all variables. Thus, it is easier to specify the dimensions which should remain unchanged, rather than the dimensions to be stacked. The function to_stacked_array now takes an argument ''sample_dim'' which defines these non-stacked dimensions. If these dims are not shared accross all variables than an error is raised. * Fix lint error The line was too long * Fix validation and failing tests 1. the test which stacks a scalar and an array doesn't make sense anymore given the new API. 2. Fixed a bug in the validation code which raised an error almost always. * Fix typo * Improve docs and error messages * Remove extra spaces * Test warning in to_unstacked_dataset * Improve formatting and naming * Fix flake8 error * Respond to @max-sixty's suggestions
1 parent 8890fae commit 9c0fb6c

File tree

7 files changed

+298
-0
lines changed

7 files changed

+298
-0
lines changed

doc/api.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,7 @@ Reshaping and reorganizing
208208
Dataset.transpose
209209
Dataset.stack
210210
Dataset.unstack
211+
Dataset.to_stacked_array
211212
Dataset.shift
212213
Dataset.roll
213214
Dataset.sortby
@@ -381,6 +382,7 @@ Reshaping and reorganizing
381382
DataArray.transpose
382383
DataArray.stack
383384
DataArray.unstack
385+
DataArray.to_unstacked_dataset
384386
DataArray.shift
385387
DataArray.roll
386388
DataArray.sortby

doc/reshaping.rst

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,48 @@ pandas, it does not automatically drop missing values. Compare:
133133
We departed from pandas's behavior here because predictable shapes for new
134134
array dimensions is necessary for :ref:`dask`.
135135

136+
Stacking different variables together
137+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
138+
139+
These stacking and unstacking operations are particularly useful for reshaping
140+
xarray objects for use in machine learning packages, such as `scikit-learn
141+
<http://scikit-learn.org/stable/>`_, that usually require two-dimensional numpy
142+
arrays as inputs. For datasets with only one variable, we only need ``stack``
143+
and ``unstack``, but combining multiple variables in a
144+
:py:class:`xarray.Dataset` is more complicated. If the variables in the dataset
145+
have matching numbers of dimensions, we can call
146+
:py:meth:`~xarray.Dataset.to_array` and then stack along the the new coordinate.
147+
But :py:meth:`~xarray.Dataset.to_array` will broadcast the dataarrays together,
148+
which will effectively tile the lower dimensional variable along the missing
149+
dimensions. The method :py:meth:`xarray.Dataset.to_stacked_array` allows
150+
combining variables of differing dimensions without this wasteful copying while
151+
:py:meth:`xarray.DataArray.to_unstacked_dataset` reverses this operation.
152+
Just as with :py:meth:`xarray.Dataset.stack` the stacked coordinate is
153+
represented by a :py:class:`pandas.MultiIndex` object. These methods are used
154+
like this:
155+
156+
.. ipython:: python
157+
data = xr.Dataset(
158+
data_vars={'a': (('x', 'y'), [[0, 1, 2], [3, 4, 5]]),
159+
'b': ('x', [6, 7])},
160+
coords={'y': ['u', 'v', 'w']}
161+
)
162+
stacked = data.to_stacked_array("z", sample_dims=['x'])
163+
stacked
164+
unstacked = stacked.to_unstacked_dataset("z")
165+
unstacked
166+
167+
In this example, ``stacked`` is a two dimensional array that we can easily pass to a scikit-learn or another generic
168+
numerical method.
169+
170+
.. note::
171+
172+
Unlike with ``stack``, in ``to_stacked_array``, the user specifies the dimensions they **do not** want stacked.
173+
For a machine learning task, these unstacked dimensions can be interpreted as the dimensions over which samples are
174+
drawn, whereas the stacked coordinates are the features. Naturally, all variables should possess these sampling
175+
dimensions.
176+
177+
136178
.. _reshape.set_index:
137179

138180
Set and reset index

doc/whats-new.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,12 @@ What's New
1818
v0.12.3 (unreleased)
1919
--------------------
2020

21+
New functions/methods
22+
~~~~~~~~~~~~~~~~~~~~~
23+
24+
- New methods for reshaping Datasets of variables with different dimensions
25+
(:issue:`1317`). By `Noah Brenowitz <https://github.com/nbren12>`_.
26+
2127
Enhancements
2228
~~~~~~~~~~~~
2329

xarray/core/dataarray.py

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1540,6 +1540,72 @@ def unstack(self, dim: Union[Hashable, Sequence[Hashable], None] = None
15401540
ds = self._to_temp_dataset().unstack(dim)
15411541
return self._from_temp_dataset(ds)
15421542

1543+
def to_unstacked_dataset(self, dim, level=0):
1544+
"""Unstack DataArray expanding to Dataset along a given level of a
1545+
stacked coordinate.
1546+
1547+
This is the inverse operation of Dataset.to_stacked_array.
1548+
1549+
Parameters
1550+
----------
1551+
dim : str
1552+
Name of existing dimension to unstack
1553+
level : int or str
1554+
The MultiIndex level to expand to a dataset along. Can either be
1555+
the integer index of the level or its name.
1556+
label : int, default 0
1557+
Label of the level to expand dataset along. Overrides the label
1558+
argument if given.
1559+
1560+
Returns
1561+
-------
1562+
unstacked: Dataset
1563+
1564+
Examples
1565+
--------
1566+
>>> import xarray as xr
1567+
>>> arr = DataArray(np.arange(6).reshape(2, 3),
1568+
... coords=[('x', ['a', 'b']), ('y', [0, 1, 2])])
1569+
>>> data = xr.Dataset({'a': arr, 'b': arr.isel(y=0)})
1570+
>>> data
1571+
<xarray.Dataset>
1572+
Dimensions: (x: 2, y: 3)
1573+
Coordinates:
1574+
* x (x) <U1 'a' 'b'
1575+
* y (y) int64 0 1 2
1576+
Data variables:
1577+
a (x, y) int64 0 1 2 3 4 5
1578+
b (x) int64 0 3
1579+
>>> stacked = data.to_stacked_array("z", ['y'])
1580+
>>> stacked.indexes['z']
1581+
MultiIndex(levels=[['a', 'b'], [0, 1, 2]],
1582+
labels=[[0, 0, 0, 1], [0, 1, 2, -1]],
1583+
names=['variable', 'y'])
1584+
>>> roundtripped = stacked.to_unstacked_dataset(dim='z')
1585+
>>> data.identical(roundtripped)
1586+
True
1587+
1588+
See Also
1589+
--------
1590+
Dataset.to_stacked_array
1591+
"""
1592+
1593+
idx = self.indexes[dim]
1594+
if not isinstance(idx, pd.MultiIndex):
1595+
raise ValueError("'{}' is not a stacked coordinate".format(dim))
1596+
1597+
level_number = idx._get_level_number(level)
1598+
variables = idx.levels[level_number]
1599+
variable_dim = idx.names[level_number]
1600+
1601+
# pull variables out of datarray
1602+
data_dict = OrderedDict()
1603+
for k in variables:
1604+
data_dict[k] = self.sel({variable_dim: k}).squeeze(drop=True)
1605+
1606+
# unstacked dataset
1607+
return Dataset(data_dict)
1608+
15431609
def transpose(self,
15441610
*dims: Hashable,
15451611
transpose_coords: Optional[bool] = None) -> 'DataArray':

xarray/core/dataset.py

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2767,6 +2767,119 @@ def stack(self, dimensions=None, **dimensions_kwargs):
27672767
result = result._stack_once(dims, new_dim)
27682768
return result
27692769

2770+
def to_stacked_array(self, new_dim, sample_dims, variable_dim='variable',
2771+
name=None):
2772+
"""Combine variables of differing dimensionality into a DataArray
2773+
without broadcasting.
2774+
2775+
This method is similar to Dataset.to_array but does not broadcast the
2776+
variables.
2777+
2778+
Parameters
2779+
----------
2780+
new_dim : str
2781+
Name of the new stacked coordinate
2782+
sample_dims : Sequence[str]
2783+
Dimensions that **will not** be stacked. Each array in the dataset
2784+
must share these dimensions. For machine learning applications,
2785+
these define the dimensions over which samples are drawn.
2786+
variable_dim : str, optional
2787+
Name of the level in the stacked coordinate which corresponds to
2788+
the variables.
2789+
name : str, optional
2790+
Name of the new data array.
2791+
2792+
Returns
2793+
-------
2794+
stacked : DataArray
2795+
DataArray with the specified dimensions and data variables
2796+
stacked together. The stacked coordinate is named ``new_dim``
2797+
and represented by a MultiIndex object with a level containing the
2798+
data variable names. The name of this level is controlled using
2799+
the ``variable_dim`` argument.
2800+
2801+
See Also
2802+
--------
2803+
Dataset.to_array
2804+
Dataset.stack
2805+
DataArray.to_unstacked_dataset
2806+
2807+
Examples
2808+
--------
2809+
>>> data = Dataset(
2810+
... data_vars={'a': (('x', 'y'), [[0, 1, 2], [3, 4, 5]]),
2811+
... 'b': ('x', [6, 7])},
2812+
... coords={'y': ['u', 'v', 'w']}
2813+
... )
2814+
2815+
>>> data
2816+
<xarray.Dataset>
2817+
Dimensions: (x: 2, y: 3)
2818+
Coordinates:
2819+
* y (y) <U1 'u' 'v' 'w'
2820+
Dimensions without coordinates: x
2821+
Data variables:
2822+
a (x, y) int64 0 1 2 3 4 5
2823+
b (x) int64 6 7
2824+
2825+
>>> data.to_stacked_array("z", sample_dims=['x'])
2826+
<xarray.DataArray (x: 2, z: 4)>
2827+
array([[0, 1, 2, 6],
2828+
[3, 4, 5, 7]])
2829+
Coordinates:
2830+
* z (z) MultiIndex
2831+
- variable (z) object 'a' 'a' 'a' 'b'
2832+
- y (z) object 'u' 'v' 'w' nan
2833+
Dimensions without coordinates: x
2834+
2835+
"""
2836+
stacking_dims = tuple(dim for dim in self.dims
2837+
if dim not in sample_dims)
2838+
2839+
for variable in self:
2840+
dims = self[variable].dims
2841+
dims_include_sample_dims = set(sample_dims) <= set(dims)
2842+
if not dims_include_sample_dims:
2843+
raise ValueError(
2844+
"All variables in the dataset must contain the "
2845+
"dimensions {}.".format(dims)
2846+
)
2847+
2848+
def ensure_stackable(val):
2849+
assign_coords = {variable_dim: val.name}
2850+
for dim in stacking_dims:
2851+
if dim not in val.dims:
2852+
assign_coords[dim] = None
2853+
2854+
expand_dims = set(stacking_dims).difference(set(val.dims))
2855+
expand_dims.add(variable_dim)
2856+
# must be list for .expand_dims
2857+
expand_dims = list(expand_dims)
2858+
2859+
return (val.assign_coords(**assign_coords)
2860+
.expand_dims(expand_dims)
2861+
.stack({new_dim: (variable_dim,) + stacking_dims}))
2862+
2863+
# concatenate the arrays
2864+
stackable_vars = [ensure_stackable(self[key])
2865+
for key in self.data_vars]
2866+
data_array = xr.concat(stackable_vars, dim=new_dim)
2867+
2868+
# coerce the levels of the MultiIndex to have the same type as the
2869+
# input dimensions. This code is messy, so it might be better to just
2870+
# input a dummy value for the singleton dimension.
2871+
idx = data_array.indexes[new_dim]
2872+
levels = ([idx.levels[0]]
2873+
+ [level.astype(self[level.name].dtype)
2874+
for level in idx.levels[1:]])
2875+
new_idx = idx.set_levels(levels)
2876+
data_array[new_dim] = IndexVariable(new_dim, new_idx)
2877+
2878+
if name is not None:
2879+
data_array.name = name
2880+
2881+
return data_array
2882+
27702883
def _unstack_once(self, dim):
27712884
index = self.get_index(dim)
27722885
# GH2619. For MultiIndex, we need to call remove_unused.

xarray/tests/test_dataarray.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1798,6 +1798,12 @@ def test_stack_nonunique_consistency(self):
17981798
expected = DataArray(orig.to_pandas().stack(), dims='z')
17991799
assert_identical(expected, actual)
18001800

1801+
def test_to_unstacked_dataset_raises_value_error(self):
1802+
data = DataArray([0, 1], dims='x', coords={'x': [0, 1]})
1803+
with pytest.raises(
1804+
ValueError, match="'x' is not a stacked coordinate"):
1805+
data.to_unstacked_dataset('x', 0)
1806+
18011807
def test_transpose(self):
18021808
da = DataArray(np.random.randn(3, 4, 5), dims=('x', 'y', 'z'),
18031809
coords={'x': range(3), 'y': range(4), 'z': range(5),

xarray/tests/test_dataset.py

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,14 @@ def create_test_multiindex():
110110
return Dataset({}, {'x': mindex})
111111

112112

113+
def create_test_stacked_array():
114+
x = DataArray(pd.Index(np.r_[:10], name='x'))
115+
y = DataArray(pd.Index(np.r_[:20], name='y'))
116+
a = x * y
117+
b = x * y * y
118+
return a, b
119+
120+
113121
class InaccessibleVariableDataStore(backends.InMemoryDataStore):
114122
def __init__(self):
115123
super().__init__()
@@ -2482,6 +2490,61 @@ def test_stack_unstack_slow(self):
24822490
actual = stacked.isel(z=slice(None, None, -1)).unstack('z')
24832491
assert actual.identical(ds[['b']])
24842492

2493+
def test_to_stacked_array_invalid_sample_dims(self):
2494+
data = xr.Dataset(
2495+
data_vars={'a': (('x', 'y'), [[0, 1, 2], [3, 4, 5]]),
2496+
'b': ('x', [6, 7])},
2497+
coords={'y': ['u', 'v', 'w']}
2498+
)
2499+
with pytest.raises(ValueError):
2500+
data.to_stacked_array("features", sample_dims=['y'])
2501+
2502+
def test_to_stacked_array_name(self):
2503+
name = 'adf9d'
2504+
2505+
# make a two dimensional dataset
2506+
a, b = create_test_stacked_array()
2507+
D = xr.Dataset({'a': a, 'b': b})
2508+
sample_dims = ['x']
2509+
2510+
y = D.to_stacked_array('features', sample_dims, name=name)
2511+
assert y.name == name
2512+
2513+
def test_to_stacked_array_dtype_dims(self):
2514+
# make a two dimensional dataset
2515+
a, b = create_test_stacked_array()
2516+
D = xr.Dataset({'a': a, 'b': b})
2517+
sample_dims = ['x']
2518+
y = D.to_stacked_array('features', sample_dims)
2519+
assert y.indexes['features'].levels[1].dtype == D.y.dtype
2520+
assert y.dims == ('x', 'features')
2521+
2522+
def test_to_stacked_array_to_unstacked_dataset(self):
2523+
# make a two dimensional dataset
2524+
a, b = create_test_stacked_array()
2525+
D = xr.Dataset({'a': a, 'b': b})
2526+
sample_dims = ['x']
2527+
y = D.to_stacked_array('features', sample_dims)\
2528+
.transpose("x", "features")
2529+
2530+
x = y.to_unstacked_dataset("features")
2531+
assert_identical(D, x)
2532+
2533+
# test on just one sample
2534+
x0 = y[0].to_unstacked_dataset("features")
2535+
d0 = D.isel(x=0)
2536+
assert_identical(d0, x0)
2537+
2538+
def test_to_stacked_array_to_unstacked_dataset_different_dimension(self):
2539+
# test when variables have different dimensionality
2540+
a, b = create_test_stacked_array()
2541+
sample_dims = ['x']
2542+
D = xr.Dataset({'a': a, 'b': b.isel(y=0)})
2543+
2544+
y = D.to_stacked_array('features', sample_dims)
2545+
x = y.to_unstacked_dataset('features')
2546+
assert_identical(D, x)
2547+
24852548
def test_update(self):
24862549
data = create_test_data(seed=0)
24872550
expected = data.copy()

0 commit comments

Comments
 (0)