Skip to content
This repository was archived by the owner on Feb 1, 2022. It is now read-only.
This repository was archived by the owner on Feb 1, 2022. It is now read-only.

A potential refinement on document #123

@0as1s

Description

@0as1s

When I started to deploy xgboost-operator on my kubeflow cluster, I referred to https://github.com/kubeflow/xgboost-operator/blob/master/config/samples/xgboost-dist/utils.py#L47 to implement my own version to read my own data. It's very common I follow this function to read parts of the whole data according to the rank manually.

However, I found that dmatrix already has an internal logic to only read parts of data when it detects distributed mode. Then my manual data reading causes each rank to only read 1/N*N instead of 1/N data.

I think it could be better if adding a comment in that function to guide the users to rewrite it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions