Skip to content

AttributeError when using dask_ml.model_selection.kfold object  #849

Open
@makquel

Description

@makquel

What happened:
I'm currently trying to create a pipeline for model training using LogisticRegression and Nested cross-validation. I've got an unexpected AttributeError during the pipeline execution.

Exception: AttributeError("'numpy.ndarray' object has no attribute 'chunks'")

What you expected to happen:
I wasn't expecting that since, I double-checked that all the objects are dask.array. The following MWE shows what my pipeline looks like.

Minimal Complete Verifiable Example:

from typing import Tuple, Any
from dask.array.core import Array
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from dask_ml.model_selection import GridSearchCV, KFold
from dask_ml.linear_model import LogisticRegression
from dask_ml.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.base import is_classifier
from dask.distributed import Client, progress
import dask.array as da

import numpy as np
import joblib
from dask_ml.datasets import make_classification as dask_make_classification

# import warnings filter
from warnings import simplefilter

# ignore all future warnings
simplefilter(action="ignore", category=FutureWarning)


def fake_dataset() -> Tuple[Array, Array]:
    X, y = dask_make_classification(
        n_samples=1000,
        n_features=20,
        random_state=1,
        n_informative=10,
        n_redundant=10,
        chunks=1000 // 20,
    )
    return X, y


def train_model(X: Array, y: Array) -> None:
    n_outer_splits = 2
    n_inner_splits = 2
    param_grid = [
        {
            "classifier": [LogisticRegression()],
            "classifier__penalty": ["l1", "l2"],
            "classifier__C": np.logspace(-4, 4, 20),
            "classifier__solver": ["liblinear"],
        },
    ]
    # define the model
    pipeline = Pipeline([("classifier", LogisticRegression())])
    # XXX: check that is a proper model
    try:
        if not is_classifier(pipeline["classifier"]):
            raise Exception("Not valid classification algorithm")
    except Exception as e:
        print(f"Be aware of: {e}")
    finally:
        pass
    # set-up the nested cross-validation procedure
    cv_outer = KFold(n_splits=n_outer_splits, shuffle=True, random_state=1)
    # enumerate splits
    outer_results = list()
    for kth_fold, (train_ix, test_ix) in enumerate(cv_outer.split(X)):
        print(f"Running {kth_fold} Fold")
        # split data
        X_train, X_test = X[train_ix, :], X[test_ix, :]
        y_train, y_test = y[train_ix], y[test_ix]
        # setup inner cross-validation procedure
        cv_inner = KFold(n_splits=n_inner_splits, shuffle=True, random_state=1)

        # define search
        search = GridSearchCV(
            estimator=pipeline,
            param_grid=param_grid,
            scoring="accuracy",
            cv=cv_inner,
            refit=True,
        )
        with joblib.parallel_backend("dask"):
            result = search.fit(X_train, y_train)

    return None


if __name__ == "__main__":
    client = Client(
        processes=False, threads_per_worker=1, n_workers=4, memory_limit="10GB"
    )

    X, y = fake_dataset()
    train_model(X, y)

Anything else we need to know?:
Further debugging showed that the error comes from fit operation, however, there are not atrbiutes using np.ndarray objects.

Environment:

  • Dask version: 1.9.0
  • Python version: 3.8
  • Operating System: Ubuntu 20.04 LTS
  • Install method (conda, pip, source): conda

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions