Skip to content

groupby+map performance regression on MultiIndex dataset #7376

Closed
@ravwojdyla

Description

@ravwojdyla

What happened?

We have upgraded to 2022.12.0 version, and noticed a significant performance regression (orders of magnitude) in a code that involves a groupby+map. This seems to be the issue since the 2022.6.0 release, which I understand had a number of changes (including to the groupby code paths) (release notes).

What did you expect to happen?

Fix the performance regression.

Minimal Complete Verifiable Example

import contextlib
import os
import time
from collections.abc import Iterator

import numpy as np
import pandas as pd
import xarray as xr


@contextlib.contextmanager
def log_time(label: str) -> Iterator[None]:
    """Logs execution time of the context block"""
    t_0 = time.time()
    yield
    print(f"{label} took {time.time() - t_0} seconds")


def main() -> None:
    m = 100_000
    with log_time("creating df"):
        df = pd.DataFrame(
            {
                "i1": [1] * m + [2] * m + [3] * m + [4] * m,
                "i2": list(range(m)) * 4,
                "d3": np.random.randint(0, 2, 4 * m).astype(bool),
            }
        )

        ds = df.to_xarray().set_coords(["i1", "i2"]).set_index(index=["i1", "i2"])

    with log_time("groupby"):

        def per_grp(da: xr.DataArray) -> xr.DataArray:
            return da

        (ds.assign(x=lambda ds: ds["d3"].groupby("i1").map(per_grp)))


if __name__ == "__main__":
    main()

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

xarray current main `2022.12.1.dev7+g021c73e1`, but affects all version since 2022.6.0 (inclusive). 

> creating df took 0.10657930374145508 seconds
> groupby took 129.5521149635315 seconds

<hr>

xarray 2022.3.0:

> creating df took 0.09968900680541992 seconds
> groupby took 0.19161295890808105 seconds

Anything else we need to know?

No response

Environment

Environment of the version installed from source (2022.12.1.dev7+g021c73e1):

INSTALLED VERSIONS

commit: None
python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:25:29) [Clang 14.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 22.1.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2022.12.1.dev7+g021c73e1
pandas: 1.5.2
numpy: 1.23.5
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 65.5.1
pip: 22.3.1
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions