Description
This is for informing a scikit-learn design decision, I had briefly talked with @jorisvandenbossche about this a bit ago.
The question is whether we can rely on having zero-copy wrapping and unwrapping of numpy arrays into pandas dataframes, i.e. is it future proof to assume something like
X = np.array(...)
X_df = pd.DataFrame(X)
X_again = np.asarray(X_df)
doesn't result in a copy of the data and X_again
shares the memory of X
?
Context: We want to attach some meta-data to our numpy arrays, in particular I'm interested in column names. Pandas is an obvious candidate for doing that, but core sklearn works on numpy arrays.
So if we want to use pandas, we need to make sure that there's no overhead in wrapping and unwrapping.
And this is a design decision that's very hard to undo, so I want to make sure that it's reasonably future-proof.
@jorisvandenbossche had mentioned that there were thoughts about making pandas a column store, which sounds like it would break the zero copy requirement.
Thanks!