Description
What happened?
We have upgraded to 2022.12.0 version, and noticed a significant performance regression (orders of magnitude) in a code that involves a groupby+map. This seems to be the issue since the 2022.6.0 release, which I understand had a number of changes (including to the groupby code paths) (release notes).
What did you expect to happen?
Fix the performance regression.
Minimal Complete Verifiable Example
import contextlib
import os
import time
from collections.abc import Iterator
import numpy as np
import pandas as pd
import xarray as xr
@contextlib.contextmanager
def log_time(label: str) -> Iterator[None]:
"""Logs execution time of the context block"""
t_0 = time.time()
yield
print(f"{label} took {time.time() - t_0} seconds")
def main() -> None:
m = 100_000
with log_time("creating df"):
df = pd.DataFrame(
{
"i1": [1] * m + [2] * m + [3] * m + [4] * m,
"i2": list(range(m)) * 4,
"d3": np.random.randint(0, 2, 4 * m).astype(bool),
}
)
ds = df.to_xarray().set_coords(["i1", "i2"]).set_index(index=["i1", "i2"])
with log_time("groupby"):
def per_grp(da: xr.DataArray) -> xr.DataArray:
return da
(ds.assign(x=lambda ds: ds["d3"].groupby("i1").map(per_grp)))
if __name__ == "__main__":
main()
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
xarray current main `2022.12.1.dev7+g021c73e1`, but affects all version since 2022.6.0 (inclusive).
> creating df took 0.10657930374145508 seconds
> groupby took 129.5521149635315 seconds
<hr>
xarray 2022.3.0:
> creating df took 0.09968900680541992 seconds
> groupby took 0.19161295890808105 seconds
Anything else we need to know?
No response
Environment
Environment of the version installed from source (2022.12.1.dev7+g021c73e1
):
INSTALLED VERSIONS
commit: None
python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:25:29) [Clang 14.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 22.1.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None
xarray: 2022.12.1.dev7+g021c73e1
pandas: 1.5.2
numpy: 1.23.5
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 65.5.1
pip: 22.3.1
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None