Skip to content

make_classification_df date functionality has single repeated value #845

@ScottMGustafson

Description

@ScottMGustafson

What happened:
When running dask_ml.datasets.make_classification_df with a date range, the generated date column only has a single value, duplicated to length chunks (not n_samples as expected)

This seems to all be related to this line here where it creates an array of length len(X_df) from a single value generated by datasets.random_date

What you expected to happen:
we should be seeing

  • many randomly selected dates from within the date range dates
  • the date column should be of length n_samples, not chunks

Minimal Complete Verifiable Example:

from dask_ml.datasets import make_classification_df
from datetime import date

X, y = make_classification_df(
    n_samples=100,
    n_features=5,
    random_state=123,
    chunks=10,
    dates=(date(2020, 1, 1), date(2021, 1, 1)),
)

X["date"].compute().value_counts()

returns

2020-02-27    10
Name: date, dtype: int64

Anything else we need to know?:

Environment:

  • Dask version: 2021.7.1
  • Python version: 3.9.6
  • Operating System: linux (ubuntu)
  • Install method (conda, pip, source): pip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions