From c991dce789c773a6f619ee4e3c7fe760dbe6d3bf Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Fri, 6 Dec 2019 19:53:49 +0100 Subject: [PATCH 1/4] initial writeup of the slep --- slep012/proposal.rst | 66 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) create mode 100644 slep012/proposal.rst diff --git a/slep012/proposal.rst b/slep012/proposal.rst new file mode 100644 index 0000000..bdde582 --- /dev/null +++ b/slep012/proposal.rst @@ -0,0 +1,66 @@ +.. _slep_012: + +========== +InputArray +========== + +This proposal suggests adding a new data structure, called ``InputArray``, +which wraps a data matrix with some added information about the data. This was +motivated when working on input and output feature names. Since we expect the +feature names to be attached to the data given to an estimator, there are a few +approaches we can take: + +- ``pandas`` in, ``pandas`` out: this means we expect the user to give the data + as a ``pandas.DataFrame``, and if so, the transformer would output a + ``pandas.DataFrame`` which also includes the [generated] feature names. This + is not a feasible solution since ``pandas`` plans to move to a per column + representation, which means ``pd.DataFrame(np.asarray(df))`` has two + guaranteed memory copies. +- ``XArray``: we could accept a `pandas.DataFrame``, and use + ``xarray.DataArray`` as the output of transformers, including feature names. + However, ``xarray`` depends on ``pandas``, and uses ``pandas.Series`` to + handle row labels and aligns rows when an operation between two + ``xarray.DataArray`` is done. None of these are favorable for our use-case. + +As a result, we need to have another data structure which we'll use to transfer +data related information (such as feature names), which is lightweight and +doesn't interfere with existing user code. + +A main constraint of this data structure is that is should be backward +compatible, *i.e.* code which expects a ``numpy.ndarray`` as the output of a +transformer, would not break. This SLEP focuses on *feature names* as the only +meta-data attached to the data. Support for other meta-data can be added later. + + +Feature Names +************* + +Feature names are an array of strings aligned with the columns. They can be +``None``. + +Operations +********** + +All usual operations (including slicing through ``__getitem__``) return an +``np.ndarray``. The ``__array__`` method also returns the underlying data, w/o +any modifications. This prevents any unwanted computational overhead as a +result of migrating to this data structure. + +The ``select()`` method will act like a ``__getitem__``, except that it +understands feature names and it also returns an ``InputArray``, with the +corresponding meta-data. + +Sparse Arrays +************* + +All of the above applies to sparse arrays. + +Factory Methods +*************** + +There will be factory methods creating an ``InputArray`` given a +``pandas.DataFrame`` or an ``xarray.DataArray`` or simply an ``np.ndarray`` or +an ``sp.SparseMatrix`` and a given set of feature names. + +An ``InputArray`` can also be converted to a `pandas.DataFrame`` using a +``toDataFrame()`` method. From 0b25e2f5e78ba0b7df6f53a2e0c77d2876134965 Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Sat, 7 Dec 2019 10:03:22 -0800 Subject: [PATCH 2/4] clarify on xarray --- slep012/proposal.rst | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/slep012/proposal.rst b/slep012/proposal.rst index bdde582..6896483 100644 --- a/slep012/proposal.rst +++ b/slep012/proposal.rst @@ -18,9 +18,12 @@ approaches we can take: guaranteed memory copies. - ``XArray``: we could accept a `pandas.DataFrame``, and use ``xarray.DataArray`` as the output of transformers, including feature names. - However, ``xarray`` depends on ``pandas``, and uses ``pandas.Series`` to - handle row labels and aligns rows when an operation between two - ``xarray.DataArray`` is done. None of these are favorable for our use-case. + However, ``xarray`` has a hard dependency on ``pandas``, and uses + ``pandas.Index`` to handle row labels and aligns rows when an operation + between two ``xarray.DataArray`` is done, which can be time consuming, and is + not the semantic expected in ``scikit-learn``; we only expect the number of + rows to be equal, and that the rows always correspond to one another in the + same order. As a result, we need to have another data structure which we'll use to transfer data related information (such as feature names), which is lightweight and From ef37bab9c3e7c9bf95b087f2250a8d2db98b33ce Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Fri, 27 Dec 2019 13:16:45 -0800 Subject: [PATCH 3/4] address more comments --- slep012/proposal.rst | 124 +++++++++++++++++++++++++++++++------------ 1 file changed, 91 insertions(+), 33 deletions(-) diff --git a/slep012/proposal.rst b/slep012/proposal.rst index 6896483..b8dcb88 100644 --- a/slep012/proposal.rst +++ b/slep012/proposal.rst @@ -4,59 +4,74 @@ InputArray ========== -This proposal suggests adding a new data structure, called ``InputArray``, -which wraps a data matrix with some added information about the data. This was -motivated when working on input and output feature names. Since we expect the -feature names to be attached to the data given to an estimator, there are a few -approaches we can take: +This proposal results in a solution to propagating feature names through +transformers, pipelines, and the column transformer. Ideally, we would have:: -- ``pandas`` in, ``pandas`` out: this means we expect the user to give the data - as a ``pandas.DataFrame``, and if so, the transformer would output a - ``pandas.DataFrame`` which also includes the [generated] feature names. This - is not a feasible solution since ``pandas`` plans to move to a per column - representation, which means ``pd.DataFrame(np.asarray(df))`` has two - guaranteed memory copies. -- ``XArray``: we could accept a `pandas.DataFrame``, and use - ``xarray.DataArray`` as the output of transformers, including feature names. - However, ``xarray`` has a hard dependency on ``pandas``, and uses - ``pandas.Index`` to handle row labels and aligns rows when an operation - between two ``xarray.DataArray`` is done, which can be time consuming, and is - not the semantic expected in ``scikit-learn``; we only expect the number of - rows to be equal, and that the rows always correspond to one another in the - same order. + df = pd.readcsv('tabular.csv') + # transforming the data in an arbitrary way + transformer0 = ColumnTransformer(...) + # a pipeline preprocessing the data and then a classifier (or a regressor) + clf = make_pipeline(transfoemer0, ..., SVC()) -As a result, we need to have another data structure which we'll use to transfer -data related information (such as feature names), which is lightweight and -doesn't interfere with existing user code. + # now we can investigate features at each stage of the pipeline + clf[-1].input_feature_names_ + +The feature names are propagated throughout the pipeline and the user can +investigate them at each step of the pipeline. + +This proposal suggests adding a new data structure, called ``InputArray``, +which augments the data array ``X`` with additional meta-data. In this proposal +we assume the feature names (and other potential meta-data) are attached to the +data when passed to an estimator. Alternative solutions are discussed later in +this document. A main constraint of this data structure is that is should be backward compatible, *i.e.* code which expects a ``numpy.ndarray`` as the output of a transformer, would not break. This SLEP focuses on *feature names* as the only meta-data attached to the data. Support for other meta-data can be added later. +Backward/NumPy/Pandas Compatibility +*********************************** + +Since currently transformers return a ``numpy`` or a ``scipy`` array, backward +compatibility in this context means the operations which are valid on those +arrays should also be valid on the new data structure. + +All operations are delegated to the *data* part of the container, and the +meta-data is lost immediately after each operation and operations result in a +``numpy.ndarray``. This includes indexing and slicing, *i.e.* to avoid +performance degradation, ``__getitem__`` is not overloaded and if the user +wishes to preserve the meta-data, they shall do so via explicitly calling a +method such as ``select()``. Operations between two ``InpuArray``s will not +try to align rows and/or columns of the two given objects. + +``pandas`` compatibility comes ideally as a ``pd.DataFrame(inputarray)``, for +which ``pandas`` does not provide a clean API at the moment. Alternatively, +``inputarray.todataframe()`` would return a ``pandas.DataFrame`` with the +relevant meta-data attached. Feature Names ************* -Feature names are an array of strings aligned with the columns. They can be -``None``. +Feature names are an object ``ndarray`` of strings aligned with the columns. +They can be ``None``. Operations ********** -All usual operations (including slicing through ``__getitem__``) return an -``np.ndarray``. The ``__array__`` method also returns the underlying data, w/o -any modifications. This prevents any unwanted computational overhead as a -result of migrating to this data structure. +Estimators understand the ``InputArray`` and extract the feature names from the +given data before applying the operations and transformations on the data. -The ``select()`` method will act like a ``__getitem__``, except that it -understands feature names and it also returns an ``InputArray``, with the -corresponding meta-data. +All transformers return an ``InputArray`` with feature names attached to it. +The way feature names are generated is discussed in *SLEP007 - The Style of The +Feature Names*. Sparse Arrays ************* -All of the above applies to sparse arrays. +Ideally sparse arrays follow the same pattern, but since ``scipy.sparse`` does +not provide the kinda of API provided by ``numpy``, we may need to find +compromises. Factory Methods *************** @@ -66,4 +81,47 @@ There will be factory methods creating an ``InputArray`` given a an ``sp.SparseMatrix`` and a given set of feature names. An ``InputArray`` can also be converted to a `pandas.DataFrame`` using a -``toDataFrame()`` method. +``todataframe()`` method. + +``X`` being an ``InputArray``:: + + >>> np.array(X) + >>> X.todataframe() + >>> pd.DataFrame(X) # only if pandas implements the API + +And given ``X`` a ``np.ndarray`` or an ``sp.sparse`` matrix and a set of +feature names, one can make the right ``InputArray`` using:: + + >>> make_inputarray(X, feature_names) + +Alternative Solutions +********************* + +Since we expect the feature names to be attached to the data given to an +estimator, there are a few potential approaches we can take: + +- ``pandas`` in, ``pandas`` out: this means we expect the user to give the data + as a ``pandas.DataFrame``, and if so, the transformer would output a + ``pandas.DataFrame`` which also includes the [generated] feature names. This + is not a feasible solution since ``pandas`` plans to move to a per column + representation, which means ``pd.DataFrame(np.asarray(df))`` has two + guaranteed memory copies. +- ``XArray``: we could accept a `pandas.DataFrame``, and use + ``xarray.DataArray`` as the output of transformers, including feature names. + However, ``xarray`` has a hard dependency on ``pandas``, and uses + ``pandas.Index`` to handle row labels and aligns rows when an operation + between two ``xarray.DataArray`` is done, which can be time consuming, and is + not the semantic expected in ``scikit-learn``; we only expect the number of + rows to be equal, and that the rows always correspond to one another in the + same order. + +As a result, we need to have another data structure which we'll use to transfer +data related information (such as feature names), which is lightweight and +doesn't interfere with existing user code. + +Another alternative to the problem of passing meta-data around is to pass that +as a parameter to ``fit``. This would heavily involve modifying meta-estimators +since they'd need to pass that information, and extract the relevant +information from the estimators to pass that along to the next estimator. Our +prototype implementations showed significant challenges compared to when the +meta-data is attached to the data. From 48ce7f40175c37fbaa95ed36aec1cd640787bcea Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Tue, 18 Feb 2020 20:52:32 +0100 Subject: [PATCH 4/4] add headers --- slep012/proposal.rst | 8 ++++++++ under_review.rst | 1 + 2 files changed, 9 insertions(+) diff --git a/slep012/proposal.rst b/slep012/proposal.rst index b8dcb88..431bacc 100644 --- a/slep012/proposal.rst +++ b/slep012/proposal.rst @@ -4,6 +4,14 @@ InputArray ========== +:Author: Adrin jalali +:Status: Draft +:Type: Standards Track +:Created: 2019-12-20 + +Motivation +********** + This proposal results in a solution to propagating feature names through transformers, pipelines, and the column transformer. Ideally, we would have:: diff --git a/under_review.rst b/under_review.rst index 51d9eab..ff52d4e 100644 --- a/under_review.rst +++ b/under_review.rst @@ -9,3 +9,4 @@ SLEPs under review :maxdepth: 1 slep007/proposal + slep012/proposal