Improving groupby-apply microperformance

Consider the case of a DataFrame with a large number of distinct groups:

```
import numpy as np
arr = np.random.randn(5000000)
df = pd.DataFrame({'group': arr.astype('str').repeat(2)})
df['values'] = np.random.randn(len(df))
df.groupby('group').apply(lambda g: len(g))
```

I have

```
In [17]: %time result = df.groupby('group').apply(lambda g: len(g))
CPU times: user 6.45 s, sys: 68 ms, total: 6.52 s
Wall time: 6.51 s
```

The per-group overhead is fairly fixed -- with 5 million groups we have:

```
In [22]: %time result = df.groupby('group').apply(lambda g: len(g))
CPU times: user 31 s, sys: 108 ms, total: 31.1 s
Wall time: 31.1 s
```

It would be interesting to see if, but pushing down the disassembly-reassembly of DataFrame objects into C++ whether we can take the overhead from the current ~6 microseconds to under a microsecond or even less. 

Note that the effects of bad memory locality are also a factor. We could look into tricks like using a background thread which "prepares" groups (up to a certain size / buffer threshold) while user apply functions are executing, to at least mitigate the time aspect of the groupby evaluation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improving groupby-apply microperformance #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improving groupby-apply microperformance #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions