Skip to content

Question: Guaranteed zero-copy round-trip from numpy? #27211

Closed
@amueller

Description

@amueller

This is for informing a scikit-learn design decision, I had briefly talked with @jorisvandenbossche about this a bit ago.

The question is whether we can rely on having zero-copy wrapping and unwrapping of numpy arrays into pandas dataframes, i.e. is it future proof to assume something like

X = np.array(...)
X_df = pd.DataFrame(X)
X_again = np.asarray(X_df)

doesn't result in a copy of the data and X_again shares the memory of X?

Context: We want to attach some meta-data to our numpy arrays, in particular I'm interested in column names. Pandas is an obvious candidate for doing that, but core sklearn works on numpy arrays.
So if we want to use pandas, we need to make sure that there's no overhead in wrapping and unwrapping.
And this is a design decision that's very hard to undo, so I want to make sure that it's reasonably future-proof.

@jorisvandenbossche had mentioned that there were thoughts about making pandas a column store, which sounds like it would break the zero copy requirement.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions