From c991dce789c773a6f619ee4e3c7fe760dbe6d3bf Mon Sep 17 00:00:00 2001
From: adrinjalali <adrin.jalali@gmail.com>
Date: Fri, 6 Dec 2019 19:53:49 +0100
Subject: [PATCH 1/4] initial writeup of the slep

---
 slep012/proposal.rst | 66 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)
 create mode 100644 slep012/proposal.rst

diff --git a/slep012/proposal.rst b/slep012/proposal.rst
new file mode 100644
index 0000000..bdde582
--- /dev/null
+++ b/slep012/proposal.rst
@@ -0,0 +1,66 @@
+.. _slep_012:
+
+==========
+InputArray
+==========
+
+This proposal suggests adding a new data structure, called ``InputArray``,
+which wraps a data matrix with some added information about the data. This was
+motivated when working on input and output feature names. Since we expect the
+feature names to be attached to the data given to an estimator, there are a few
+approaches we can take:
+
+- ``pandas`` in, ``pandas`` out: this means we expect the user to give the data
+  as a ``pandas.DataFrame``, and if so, the transformer would output a
+  ``pandas.DataFrame`` which also includes the [generated] feature names. This
+  is not a feasible solution since ``pandas`` plans to move to a per column
+  representation, which means ``pd.DataFrame(np.asarray(df))`` has two
+  guaranteed memory copies.
+- ``XArray``: we could accept a `pandas.DataFrame``, and use
+  ``xarray.DataArray`` as the output of transformers, including feature names.
+  However, ``xarray`` depends on ``pandas``, and uses ``pandas.Series`` to
+  handle row labels and aligns rows when an operation between two
+  ``xarray.DataArray`` is done. None of these are favorable for our use-case.
+
+As a result, we need to have another data structure which we'll use to transfer
+data related information (such as feature names), which is lightweight and
+doesn't interfere with existing user code.
+
+A main constraint of this data structure is that is should be backward
+compatible, *i.e.* code which expects a ``numpy.ndarray`` as the output of a
+transformer, would not break. This SLEP focuses on *feature names* as the only
+meta-data attached to the data. Support for other meta-data can be added later.
+
+
+Feature Names
+*************
+
+Feature names are an array of strings aligned with the columns. They can be
+``None``.
+
+Operations
+**********
+
+All usual operations (including slicing through ``__getitem__``) return an
+``np.ndarray``. The ``__array__`` method also returns the underlying data, w/o
+any modifications. This prevents any unwanted computational overhead as a
+result of migrating to this data structure.
+
+The ``select()`` method will act like a ``__getitem__``, except that it
+understands feature names and it also returns an ``InputArray``, with the
+corresponding meta-data.
+
+Sparse Arrays
+*************
+
+All of the above applies to sparse arrays.
+
+Factory Methods
+***************
+
+There will be factory methods creating an ``InputArray`` given a
+``pandas.DataFrame`` or an ``xarray.DataArray`` or simply an ``np.ndarray`` or
+an ``sp.SparseMatrix`` and a given set of feature names.
+
+An ``InputArray`` can also be converted to a `pandas.DataFrame`` using a
+``toDataFrame()`` method.

From 0b25e2f5e78ba0b7df6f53a2e0c77d2876134965 Mon Sep 17 00:00:00 2001
From: adrinjalali <adrin.jalali@gmail.com>
Date: Sat, 7 Dec 2019 10:03:22 -0800
Subject: [PATCH 2/4] clarify on xarray

---
 slep012/proposal.rst | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/slep012/proposal.rst b/slep012/proposal.rst
index bdde582..6896483 100644
--- a/slep012/proposal.rst
+++ b/slep012/proposal.rst
@@ -18,9 +18,12 @@ approaches we can take:
   guaranteed memory copies.
 - ``XArray``: we could accept a `pandas.DataFrame``, and use
   ``xarray.DataArray`` as the output of transformers, including feature names.
-  However, ``xarray`` depends on ``pandas``, and uses ``pandas.Series`` to
-  handle row labels and aligns rows when an operation between two
-  ``xarray.DataArray`` is done. None of these are favorable for our use-case.
+  However, ``xarray`` has a hard dependency on ``pandas``, and uses
+  ``pandas.Index`` to handle row labels and aligns rows when an operation
+  between two ``xarray.DataArray`` is done, which can be time consuming, and is
+  not the semantic expected in ``scikit-learn``; we only expect the number of
+  rows to be equal, and that the rows always correspond to one another in the
+  same order.
 
 As a result, we need to have another data structure which we'll use to transfer
 data related information (such as feature names), which is lightweight and

From ef37bab9c3e7c9bf95b087f2250a8d2db98b33ce Mon Sep 17 00:00:00 2001
From: adrinjalali <adrin.jalali@gmail.com>
Date: Fri, 27 Dec 2019 13:16:45 -0800
Subject: [PATCH 3/4] address more comments

---
 slep012/proposal.rst | 124 +++++++++++++++++++++++++++++++------------
 1 file changed, 91 insertions(+), 33 deletions(-)

diff --git a/slep012/proposal.rst b/slep012/proposal.rst
index 6896483..b8dcb88 100644
--- a/slep012/proposal.rst
+++ b/slep012/proposal.rst
@@ -4,59 +4,74 @@
 InputArray
 ==========
 
-This proposal suggests adding a new data structure, called ``InputArray``,
-which wraps a data matrix with some added information about the data. This was
-motivated when working on input and output feature names. Since we expect the
-feature names to be attached to the data given to an estimator, there are a few
-approaches we can take:
+This proposal results in a solution to propagating feature names through
+transformers, pipelines, and the column transformer. Ideally, we would have::
 
-- ``pandas`` in, ``pandas`` out: this means we expect the user to give the data
-  as a ``pandas.DataFrame``, and if so, the transformer would output a
-  ``pandas.DataFrame`` which also includes the [generated] feature names. This
-  is not a feasible solution since ``pandas`` plans to move to a per column
-  representation, which means ``pd.DataFrame(np.asarray(df))`` has two
-  guaranteed memory copies.
-- ``XArray``: we could accept a `pandas.DataFrame``, and use
-  ``xarray.DataArray`` as the output of transformers, including feature names.
-  However, ``xarray`` has a hard dependency on ``pandas``, and uses
-  ``pandas.Index`` to handle row labels and aligns rows when an operation
-  between two ``xarray.DataArray`` is done, which can be time consuming, and is
-  not the semantic expected in ``scikit-learn``; we only expect the number of
-  rows to be equal, and that the rows always correspond to one another in the
-  same order.
+    df = pd.readcsv('tabular.csv')
+    # transforming the data in an arbitrary way
+    transformer0 = ColumnTransformer(...)
+    # a pipeline preprocessing the data and then a classifier (or a regressor)
+    clf = make_pipeline(transfoemer0, ..., SVC())
 
-As a result, we need to have another data structure which we'll use to transfer
-data related information (such as feature names), which is lightweight and
-doesn't interfere with existing user code.
+    # now we can investigate features at each stage of the pipeline
+    clf[-1].input_feature_names_
+
+The feature names are propagated throughout the pipeline and the user can
+investigate them at each step of the pipeline.
+
+This proposal suggests adding a new data structure, called ``InputArray``,
+which augments the data array ``X`` with additional meta-data. In this proposal
+we assume the feature names (and other potential meta-data) are attached to the
+data when passed to an estimator. Alternative solutions are discussed later in
+this document.
 
 A main constraint of this data structure is that is should be backward
 compatible, *i.e.* code which expects a ``numpy.ndarray`` as the output of a
 transformer, would not break. This SLEP focuses on *feature names* as the only
 meta-data attached to the data. Support for other meta-data can be added later.
 
+Backward/NumPy/Pandas Compatibility
+***********************************
+
+Since currently transformers return a ``numpy`` or a ``scipy`` array, backward
+compatibility in this context means the operations which are valid on those
+arrays should also be valid on the new data structure.
+
+All operations are delegated to the *data* part of the container, and the
+meta-data is lost immediately after each operation and operations result in a
+``numpy.ndarray``. This includes indexing and slicing, *i.e.* to avoid
+performance degradation, ``__getitem__`` is not overloaded and if the user
+wishes to preserve the meta-data, they shall do so via explicitly calling a
+method such as ``select()``. Operations between two ``InpuArray``s will not
+try to align rows and/or columns of the two given objects.
+
+``pandas`` compatibility comes ideally as a ``pd.DataFrame(inputarray)``, for
+which ``pandas`` does not provide a clean API at the moment. Alternatively,
+``inputarray.todataframe()`` would return a ``pandas.DataFrame`` with the
+relevant meta-data attached.
 
 Feature Names
 *************
 
-Feature names are an array of strings aligned with the columns. They can be
-``None``.
+Feature names are an object ``ndarray`` of strings aligned with the columns.
+They can be ``None``.
 
 Operations
 **********
 
-All usual operations (including slicing through ``__getitem__``) return an
-``np.ndarray``. The ``__array__`` method also returns the underlying data, w/o
-any modifications. This prevents any unwanted computational overhead as a
-result of migrating to this data structure.
+Estimators understand the ``InputArray`` and extract the feature names from the
+given data before applying the operations and transformations on the data.
 
-The ``select()`` method will act like a ``__getitem__``, except that it
-understands feature names and it also returns an ``InputArray``, with the
-corresponding meta-data.
+All transformers return an ``InputArray`` with feature names attached to it.
+The way feature names are generated is discussed in *SLEP007 - The Style of The
+Feature Names*.
 
 Sparse Arrays
 *************
 
-All of the above applies to sparse arrays.
+Ideally sparse arrays follow the same pattern, but since ``scipy.sparse`` does
+not provide the kinda of API provided by ``numpy``, we may need to find
+compromises.
 
 Factory Methods
 ***************
@@ -66,4 +81,47 @@ There will be factory methods creating an ``InputArray`` given a
 an ``sp.SparseMatrix`` and a given set of feature names.
 
 An ``InputArray`` can also be converted to a `pandas.DataFrame`` using a
-``toDataFrame()`` method.
+``todataframe()`` method.
+
+``X`` being an ``InputArray``::
+
+    >>> np.array(X)
+    >>> X.todataframe()
+    >>> pd.DataFrame(X) # only if pandas implements the API
+
+And given ``X`` a ``np.ndarray`` or an ``sp.sparse`` matrix and a set of
+feature names, one can make the right ``InputArray`` using::
+
+    >>> make_inputarray(X, feature_names)
+
+Alternative Solutions
+*********************
+
+Since we expect the feature names to be attached to the data given to an
+estimator, there are a few potential approaches we can take:
+
+- ``pandas`` in, ``pandas`` out: this means we expect the user to give the data
+  as a ``pandas.DataFrame``, and if so, the transformer would output a
+  ``pandas.DataFrame`` which also includes the [generated] feature names. This
+  is not a feasible solution since ``pandas`` plans to move to a per column
+  representation, which means ``pd.DataFrame(np.asarray(df))`` has two
+  guaranteed memory copies.
+- ``XArray``: we could accept a `pandas.DataFrame``, and use
+  ``xarray.DataArray`` as the output of transformers, including feature names.
+  However, ``xarray`` has a hard dependency on ``pandas``, and uses
+  ``pandas.Index`` to handle row labels and aligns rows when an operation
+  between two ``xarray.DataArray`` is done, which can be time consuming, and is
+  not the semantic expected in ``scikit-learn``; we only expect the number of
+  rows to be equal, and that the rows always correspond to one another in the
+  same order.
+
+As a result, we need to have another data structure which we'll use to transfer
+data related information (such as feature names), which is lightweight and
+doesn't interfere with existing user code.
+
+Another alternative to the problem of passing meta-data around is to pass that
+as a parameter to ``fit``. This would heavily involve modifying meta-estimators
+since they'd need to pass that information, and extract the relevant
+information from the estimators to pass that along to the next estimator. Our
+prototype implementations showed significant challenges compared to when the
+meta-data is attached to the data.

From 48ce7f40175c37fbaa95ed36aec1cd640787bcea Mon Sep 17 00:00:00 2001
From: adrinjalali <adrin.jalali@gmail.com>
Date: Tue, 18 Feb 2020 20:52:32 +0100
Subject: [PATCH 4/4] add headers

---
 slep012/proposal.rst | 8 ++++++++
 under_review.rst     | 1 +
 2 files changed, 9 insertions(+)

diff --git a/slep012/proposal.rst b/slep012/proposal.rst
index b8dcb88..431bacc 100644
--- a/slep012/proposal.rst
+++ b/slep012/proposal.rst
@@ -4,6 +4,14 @@
 InputArray
 ==========
 
+:Author: Adrin jalali
+:Status: Draft
+:Type: Standards Track
+:Created: 2019-12-20
+
+Motivation
+**********
+
 This proposal results in a solution to propagating feature names through
 transformers, pipelines, and the column transformer. Ideally, we would have::
 
diff --git a/under_review.rst b/under_review.rst
index 51d9eab..ff52d4e 100644
--- a/under_review.rst
+++ b/under_review.rst
@@ -9,3 +9,4 @@ SLEPs under review
     :maxdepth: 1
 
     slep007/proposal
+    slep012/proposal