Skip to content

Stack + to_array before to_xarray is much faster that a simple to_xarray #2459

Closed
@max-sixty

Description

@max-sixty

I was seeing some slow performance around to_xarray() on MultiIndexed series, and found that unstacking one of the dimensions before running to_xarray(), and then restacking with to_array() was ~30x faster. This time difference is consistent with larger data sizes.

To reproduce:

Create a series with a MultiIndex, ensuring the MultiIndex isn't a simple product:

s = pd.Series(
    np.random.rand(100000), 
    index=pd.MultiIndex.from_product([
        list('abcdefhijk'),
        list('abcdefhijk'),
        pd.DatetimeIndex(start='2000-01-01', periods=1000, freq='B'),
    ]))

cropped = s[::3]
cropped.index=pd.MultiIndex.from_tuples(cropped.index, names=list('xyz'))

cropped.head()

# x  y  z         
# a  a  2000-01-03    0.993989
#      2000-01-06    0.850518
#      2000-01-11    0.068944
#      2000-01-14    0.237197
#      2000-01-19    0.784254
# dtype: float64

Two approaches for getting this into xarray;
1 - Simple .to_xarray():

# current_method = cropped.to_xarray()

<xarray.DataArray (x: 10, y: 10, z: 1000)>
array([[[0.993989,      nan, ...,      nan, 0.721663],
        [     nan,      nan, ..., 0.58224 ,      nan],
        ...,
        [     nan, 0.369382, ...,      nan,      nan],
        [0.98558 ,      nan, ...,      nan, 0.403732]],

       [[     nan,      nan, ..., 0.493711,      nan],
        [     nan, 0.126761, ...,      nan,      nan],
        ...,
        [0.976758,      nan, ...,      nan, 0.816612],
        [     nan,      nan, ..., 0.982128,      nan]],

       ...,

       [[     nan, 0.971525, ...,      nan,      nan],
        [0.146774,      nan, ...,      nan, 0.419806],
        ...,
        [     nan,      nan, ..., 0.700764,      nan],
        [     nan, 0.502058, ...,      nan,      nan]],

       [[0.246768,      nan, ...,      nan, 0.079266],
        [     nan,      nan, ..., 0.802297,      nan],
        ...,
        [     nan, 0.636698, ...,      nan,      nan],
        [0.025195,      nan, ...,      nan, 0.629305]]])
Coordinates:
  * x        (x) object 'a' 'b' 'c' 'd' 'e' 'f' 'h' 'i' 'j' 'k'
  * y        (y) object 'a' 'b' 'c' 'd' 'e' 'f' 'h' 'i' 'j' 'k'
  * z        (z) datetime64[ns] 2000-01-03 2000-01-04 ... 2003-10-30 2003-10-31

This takes 536 ms

2 - unstack in pandas first, and then use to_array to do the equivalent of a restack:

proposed_version = (
    cropped
    .unstack('y')
    .to_xarray()
    .to_array('y')
)

This takes 17.3 ms

To confirm these are identical:

proposed_version_adj = (
    proposed_version
    .assign_coords(y=proposed_version['y'].astype(object))
    .transpose(*current_version.dims)
)

proposed_version_adj.equals(current_version)
# True

Problem description

A default operation is much slower than a (potentially) equivalent operation that's not the default.

I need to look more at what's causing the issues. I think it's to do with the .reindex(full_idx), but I'm unclear why it's so much faster in the alternative route, and whether there's a fix that we can make to make the default path fast.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.14.final.0 python-bits: 64 OS: Linux OS-release: 4.9.93-linuxkit-aufs machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.utf8 LOCALE: None.None

xarray: 0.10.9
pandas: 0.23.4
numpy: 1.15.2
scipy: 1.1.0
netCDF4: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
PseudonetCDF: None
rasterio: None
iris: None
bottleneck: 1.2.1
cyordereddict: None
dask: None
distributed: None
matplotlib: 2.2.3
cartopy: 0.16.0
seaborn: 0.9.0
setuptools: 40.4.3
pip: 18.0
conda: None
pytest: 3.8.1
IPython: 5.8.0
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions