Open
Description
What happened:
I'm currently trying to create a pipeline for model training using LogisticRegression
and Nested cross-validation. I've got an unexpected AttributeError
during the pipeline execution.
Exception: AttributeError("'numpy.ndarray' object has no attribute 'chunks'")
What you expected to happen:
I wasn't expecting that since, I double-checked that all the objects are dask.array
. The following MWE shows what my pipeline looks like.
Minimal Complete Verifiable Example:
from typing import Tuple, Any
from dask.array.core import Array
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from dask_ml.model_selection import GridSearchCV, KFold
from dask_ml.linear_model import LogisticRegression
from dask_ml.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.base import is_classifier
from dask.distributed import Client, progress
import dask.array as da
import numpy as np
import joblib
from dask_ml.datasets import make_classification as dask_make_classification
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action="ignore", category=FutureWarning)
def fake_dataset() -> Tuple[Array, Array]:
X, y = dask_make_classification(
n_samples=1000,
n_features=20,
random_state=1,
n_informative=10,
n_redundant=10,
chunks=1000 // 20,
)
return X, y
def train_model(X: Array, y: Array) -> None:
n_outer_splits = 2
n_inner_splits = 2
param_grid = [
{
"classifier": [LogisticRegression()],
"classifier__penalty": ["l1", "l2"],
"classifier__C": np.logspace(-4, 4, 20),
"classifier__solver": ["liblinear"],
},
]
# define the model
pipeline = Pipeline([("classifier", LogisticRegression())])
# XXX: check that is a proper model
try:
if not is_classifier(pipeline["classifier"]):
raise Exception("Not valid classification algorithm")
except Exception as e:
print(f"Be aware of: {e}")
finally:
pass
# set-up the nested cross-validation procedure
cv_outer = KFold(n_splits=n_outer_splits, shuffle=True, random_state=1)
# enumerate splits
outer_results = list()
for kth_fold, (train_ix, test_ix) in enumerate(cv_outer.split(X)):
print(f"Running {kth_fold} Fold")
# split data
X_train, X_test = X[train_ix, :], X[test_ix, :]
y_train, y_test = y[train_ix], y[test_ix]
# setup inner cross-validation procedure
cv_inner = KFold(n_splits=n_inner_splits, shuffle=True, random_state=1)
# define search
search = GridSearchCV(
estimator=pipeline,
param_grid=param_grid,
scoring="accuracy",
cv=cv_inner,
refit=True,
)
with joblib.parallel_backend("dask"):
result = search.fit(X_train, y_train)
return None
if __name__ == "__main__":
client = Client(
processes=False, threads_per_worker=1, n_workers=4, memory_limit="10GB"
)
X, y = fake_dataset()
train_model(X, y)
Anything else we need to know?:
Further debugging showed that the error comes from fit operation, however, there are not atrbiutes using np.ndarray
objects.
Environment:
- Dask version: 1.9.0
- Python version: 3.8
- Operating System: Ubuntu 20.04 LTS
- Install method (conda, pip, source): conda
Metadata
Metadata
Assignees
Labels
No labels