Skip to content

MondrianCP can't handle Pandas dataframe #526

@lennartvandeguchte

Description

@lennartvandeguchte

Describe the bug

When using the new MondrianCP class I'm unable to fit my estimator with a Pandas dataframe, while using the standard MapieRegressor this works fine. Since I'm using a sklearn pipeline that contains some column transformers that use the pandas column name, I can't transform my data into a numpy array first because then sklearn gives me an error when fitting the estimator.

To Reproduce
Below the code to reproduce my problem.

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor, ColumnTransformer
from sklearn.preprocessing import  RobustScaler, OneHotEncoder
from mapie.regression import MapieRegressor
from mapie.mondrian import MondrianCP
from lightgbm import LGBMRegressor
import pandas as pd
from sklearn.model_selection import train_test_split

# Create some dummy data
data = pd.DataFrame(np.random.rand(100, 5), columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5'])
data['categorical_feature'] = np.random.choice(['A', 'B', 'C'], size=100)
y = pd.Series(np.random.rand(100))

# Create bins for the partition
data['BIN'] = pd.cut(y, bins=3, labels=[1, 2, 3])

# Split the data into a train and calibration set
data_train, data_calib, y_train, y_calib = train_test_split(data, y, test_size=0.2, random_state=42)

model = LGBMRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    min_child_samples=10,
    num_leaves=31,
    random_state=42
)

ct = ColumnTransformer([
    ("site", OneHotEncoder(), ['categorical_feature']),
    ("features", RobustScaler(), ['feature1', 'feature2', 'feature3', 'feature4', 'feature5']),
    ])
estimators = [('transformers',ct), ('model',  model)]
pre_pipe = Pipeline(estimators)
pipe = TransformedTargetRegressor(regressor=pre_pipe, transformer=RobustScaler())
pipe.fit(data_train, y_train)

strategy = "mondrian"
if strategy == "mondrian":    
    mapie_regressor = MondrianCP(MapieRegressor(pipe, cv='prefit'))
    mapie_regressor.fit(data_calib, y_calib, partition=data_calib['BIN'])
if strategy == "mondrian_numpy":    
    mapie_regressor = MondrianCP(MapieRegressor(pipe, cv='prefit'))
    mapie_regressor.fit(data_calib.to_numpy(), y_calib, partition=data_calib['BIN'])
else:
    mapie_regressor = MapieRegressor(estimator=pipe, cv='prefit')
    mapie_regressor = mapie_regressor.fit(data_calib, y_calib)

By changing the strategy to mondrian_numpy you can also reproduce the sklearn error I receive.

Expected behavior
Be able to use a Pandas dataframe as input data for MondrianCP class.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BacklogThis is in the MAPIE team development backlog, yet to be prioritised.RegressionRelated to regression (excluding time series)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions