Skip to content
This repository was archived by the owner on Apr 10, 2024. It is now read-only.
This repository was archived by the owner on Apr 10, 2024. It is now read-only.

Improving groupby-apply microperformance #8

Open
@wesm

Description

@wesm

Consider the case of a DataFrame with a large number of distinct groups:

import numpy as np
arr = np.random.randn(5000000)
df = pd.DataFrame({'group': arr.astype('str').repeat(2)})
df['values'] = np.random.randn(len(df))
df.groupby('group').apply(lambda g: len(g))

I have

In [17]: %time result = df.groupby('group').apply(lambda g: len(g))
CPU times: user 6.45 s, sys: 68 ms, total: 6.52 s
Wall time: 6.51 s

The per-group overhead is fairly fixed -- with 5 million groups we have:

In [22]: %time result = df.groupby('group').apply(lambda g: len(g))
CPU times: user 31 s, sys: 108 ms, total: 31.1 s
Wall time: 31.1 s

It would be interesting to see if, but pushing down the disassembly-reassembly of DataFrame objects into C++ whether we can take the overhead from the current ~6 microseconds to under a microsecond or even less.

Note that the effects of bad memory locality are also a factor. We could look into tricks like using a background thread which "prepares" groups (up to a certain size / buffer threshold) while user apply functions are executing, to at least mitigate the time aspect of the groupby evaluation.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions