Skip to content

BUG: GH45912 breaks important functionality in rolling apply #47494

@DRudel

Description

@DRudel

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000', '3/13/2000', '3/13/2000'])
my_df = pd.DataFrame({
    'date': s,
    'values': [10, 3, 11, 12, 3]
}, index=list(range(len(s))))

my_df['date'] = pd.to_datetime(my_df['date'])

def my_custom_function(ser: pd.Series):
    print(ser.index)
    return ser.index[0]

results = my_df.rolling('2d', on='date').apply(my_custom_function)
print(results)

Issue Description

Starting in 1.4.1 and after, the code throws a No Numeric Types to Aggregate error because an undocumented change was made to rolling in which using on=col parameter in rolling() causes a later apply() to re-index to the col column, making the index of the series sent to the function the wrong index.

The specific code doing this re-indexing is in RollingExpandingMixin:

        def apply_func(values, begin, end, min_periods, raw=raw):
            if not raw:
                # GH 45912
                values = Series(values, index=self._on)
            return window_func(values, begin, end, min_periods)

        return apply_func

This loses important functionality because currently rolling().apply() can only process one column at a time, so if you want to be able to recover what the windows were in able to do some operation spanning all columns, you need to notate the values of the indexes for the windows passed in [using a closure or global variable] and use those indexes to reconstruct later what the windows were. With the possibility of repeated values, that may mean creating an integer "primary key" index to use as the index and using on=my_date_column to specify the windowing.

The current code assumes that the user wants to use the on column for indexing, but if that were the case, the user could have simply re-indexed in the calling code.

Expected Behavior

In 1.3.1 the code above produces expected behavior: a 2-column table with numeric index and the the final column indicating the first value of that index that starts each of the 5 windows [0, 0, 1, 1, 1]


    date  values

0 2000-03-11 0.0
1 2000-03-12 0.0
2 2000-03-13 1.0
3 2000-03-13 1.0
4 2000-03-13 1.0


Installed Versions

1.4.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    ApplyApply, Aggregate, Transform, MapDocsWindowrolling, ewma, expanding

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions