-
-
Notifications
You must be signed in to change notification settings - Fork 260
Open
Description
What happened:
When running dask_ml.datasets.make_classification_df
with a date range, the generated date column only has a single value, duplicated to length chunks
(not n_samples
as expected)
This seems to all be related to this line here where it creates an array of length len(X_df)
from a single value generated by datasets.random_date
What you expected to happen:
we should be seeing
- many randomly selected dates from within the date range
dates
- the date column should be of length
n_samples
, notchunks
Minimal Complete Verifiable Example:
from dask_ml.datasets import make_classification_df
from datetime import date
X, y = make_classification_df(
n_samples=100,
n_features=5,
random_state=123,
chunks=10,
dates=(date(2020, 1, 1), date(2021, 1, 1)),
)
X["date"].compute().value_counts()
returns
2020-02-27 10
Name: date, dtype: int64
Anything else we need to know?:
Environment:
- Dask version: 2021.7.1
- Python version: 3.9.6
- Operating System: linux (ubuntu)
- Install method (conda, pip, source): pip
Metadata
Metadata
Assignees
Labels
No labels