[SPARK-6263][MLLIB] Python MLlib API missing items: Utils #5707

Lewuathe · 2015-04-26T13:34:16Z

Implement missing API in pyspark.

MLUtils

appendBias
loadVectors

kFold is also missing however I am not sure ClassTag can be passed or restored through python.

SparkQA · 2015-04-26T14:33:47Z

Test build #30958 has finished for PR 5707 at commit 2980569.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.
This patch adds the following new dependencies:
- tachyon-0.6.4.jar
- tachyon-client-0.6.4.jar
This patch removes the following dependencies:
- tachyon-0.5.0.jar
- tachyon-client-0.5.0.jar

SparkQA · 2015-04-27T18:19:26Z

Test build #30970 has started for PR 5707 at commit c728046.

jkbradley · 2015-05-06T23:15:38Z

Jenkins test this please

SparkQA · 2015-05-07T01:02:18Z

Test build #32047 has finished for PR 5707 at commit c728046.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-05-07T02:50:28Z

@Lewuathe I'll make a pass now. Sorry about the delay! For k-fold, I think you should be able to do it following the example of the PySpark algorithms which take RDDs and do training in the JVM:

Write a method in PythonMLLibAPI.scala which is a wrapper for MLUtils.kFold.
In your Python implementation, use callMLlibFunc in common.py to call that wrapper method.

I think that should keep you from having the worry about ClassTags.

jkbradley · 2015-05-07T03:05:18Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

@@ -71,6 +71,14 @@ private[python] class PythonMLLibAPI extends Serializable {
      minPartitions: Int): JavaRDD[LabeledPoint] =
    MLUtils.loadLabeledPoints(jsc.sc, path, minPartitions)

+  def appendBias(data: org.apache.spark.mllib.linalg.Vector)


(not needed: see comment for appendBias in util.py)

jkbradley · 2015-05-07T03:05:47Z

@Lewuathe That's all for now!

Lewuathe · 2015-05-09T02:09:10Z

@jkbradley Thank you for reviewing. For k-fold I'll do in separate JIRA.

SparkQA · 2015-05-09T04:23:18Z

Test build #32292 has finished for PR 5707 at commit a353354.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HasCheckpointInterval(Params):
- class ALS(JavaEstimator, HasCheckpointInterval, HasMaxIter, HasPredictionCol, HasRegParam, HasSeed):
- class ALSModel(JavaModel):
- class ChiSqSelectorModel(JavaVectorTransformer):
- class ChiSqSelector(object):
- case class SimpleCatalystConf(caseSensitiveAnalysis: Boolean) extends CatalystConf
- class SimpleCatalog(val conf: CatalystConf) extends Catalog

jkbradley · 2015-05-09T22:57:35Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

+   * @param path file or directory path in any Hadoop-supported file system URI
+   * @return serialized vectors in a RDD
+   */
+  def loadVectors(jsc: JavaSparkContext,


scala style: If this method declaration will fit on 1 line, then please do so. (I think it will.) Otherwise, the parameters should start in the line below the method name and be indented 4 spaces.

jkbradley · 2015-05-09T22:59:46Z

@Lewuathe Thanks for the updates! Minor comments + one possible bug with appendBias. I think the PySpark test failure can be fixed by converting the Vectors in the test to Python native types at the end. (But the tests should test Vectors too; the conversion to Python native arrays can happen at the end.)

SparkQA · 2015-05-10T12:04:25Z

Test build #32336 has finished for PR 5707 at commit 62a9c7e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-05-10T20:36:04Z

It looks like the same PySpark test failure is occurring, and it looks like a valid failure:

======================================================================
ERROR: test_load_vectors (__main__.MLUtilsTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pyspark/mllib/tests.py", line 847, in test_load_vectors
    ret.sort()
TypeError: unorderable types: DenseVector() < DenseVector()

There might be faster turnaround if you ran the tests locally before pushing the update.

jkbradley · 2015-06-22T06:40:35Z

python/pyspark/mllib/util.py

+        if isinstance(vec, SparseVector):
+            l = scipy.sparse.csc_matrix(np.append(vec.toArray(), 1.0))
+            return _convert_to_vector(l.T)
+        elif isinstance(vec, Vector):


Change to else, and indent the return statement to make this easier to read.

jkbradley · 2015-06-22T06:41:21Z

@Lewuathe Thanks for iterating on this with me! Those should be the final touches.

SparkQA · 2015-06-22T12:22:50Z

Test build #35446 has finished for PR 5707 at commit 1d4714b.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-06-22T22:59:00Z

python/pyspark/mllib/util.py

+        """
+        vec = _convert_to_vector(data)
+        if isinstance(vec, SparseVector):
+            entries = dict(zip(vec.indices, vec.values))


I think this will be inefficient. Could you instead do:

newIndices = vec.indices + [len(vec)] newValues = vec.values + [1.0] return SparseVector(len(vec)+1, newIndices, newValues)

SparkQA · 2015-06-23T15:36:35Z

Test build #35551 has finished for PR 5707 at commit 9c329d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BroadcastHint(child: LogicalPlan) extends UnaryNode

jkbradley · 2015-06-23T22:02:06Z

LGTM merging into master
@Lewuathe Thanks very much!

jkbradley · 2015-06-23T22:02:57Z

Uh oh, there are merge conflicts now. Could you please resolve them?

SparkQA · 2015-06-24T12:35:28Z

Test build #35675 has finished for PR 5707 at commit d2aa2a0.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-06-24T21:33:45Z

Jenkins test this please

SparkQA · 2015-06-24T22:45:20Z

Test build #35724 has finished for PR 5707 at commit d2aa2a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-25T14:48:50Z

Test build #35778 has finished for PR 5707 at commit 6084e9c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Lewuathe · 2015-06-27T09:29:21Z

@jkbradley I'm so sorry for keeping bothering you, but could you check it again? Thank you.

SparkQA · 2015-06-29T13:55:30Z

Test build #35983 has finished for PR 5707 at commit 3fc27e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-06-30T06:27:16Z

Jenkins, retest this please.

SparkQA · 2015-06-30T07:15:09Z

Test build #36105 has finished for PR 5707 at commit 3fc27e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-06-30T19:32:20Z

LGTM merging with master
Thank you!

jkbradley · 2015-06-30T19:33:59Z

@Lewuathe I'm sorry, but it looks like there are conflicts again. I'm working on catching up on PRs finally, so I should be able to merge this right after tests pass.

SparkQA · 2015-07-01T13:01:32Z

Test build #36248 has finished for PR 5707 at commit 16863ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-07-01T18:13:45Z

LGTM merging with master
Thank you!

[SPARK-6263] Python MLlib API missing items: Utils

2980569

Fix style

c728046

Lewuathe changed the title ~~[SPARK-6263] Python MLlib API missing items: Utils~~ [SPARK-6263][MLLIB] Python MLlib API missing items: Utils Apr 27, 2015

jkbradley reviewed May 7, 2015
View reviewed changes

Lewuathe added 2 commits May 8, 2015 20:21

Merge branch 'master' into SPARK-6263

64f72ad

Merge branch 'master' into SPARK-6263

44295c2

Remove unnecessary appendBias implementation

a353354

jkbradley reviewed May 9, 2015
View reviewed changes

Lewuathe added 2 commits May 10, 2015 17:22

Merge branch 'master' into SPARK-6263

454c73d

Fix appendBias return type

62a9c7e

jkbradley reviewed Jun 22, 2015
View reviewed changes

Lewuathe added 3 commits June 22, 2015 20:28

Merge branch 'master' into SPARK-6263

e32eb40

Remove scipy dependencies

b29e2bc

Fix style

1d4714b

jkbradley reviewed Jun 22, 2015
View reviewed changes

Lewuathe added 2 commits June 23, 2015 23:01

Merge branch 'master' into SPARK-6263

3a12a2d

Fix efficiency

9c329d8

Resolv conflict

d2aa2a0

Resolv conflict

6084e9c

Merge branch 'master' into SPARK-6263

3fc27e7

Merge master

16863ea

asfgit closed this in 184de91 Jul 1, 2015

[SPARK-6263][MLLIB] Python MLlib API missing items: Utils #5707

[SPARK-6263][MLLIB] Python MLlib API missing items: Utils #5707

Uh oh!

Conversation

Lewuathe commented Apr 26, 2015

Uh oh!

SparkQA commented Apr 26, 2015

Uh oh!

SparkQA commented Apr 27, 2015

Uh oh!

jkbradley commented May 6, 2015

Uh oh!

SparkQA commented May 7, 2015

Uh oh!

jkbradley commented May 7, 2015

Uh oh!

jkbradley May 7, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley commented May 7, 2015

Uh oh!

Lewuathe commented May 9, 2015

Uh oh!

SparkQA commented May 9, 2015

Uh oh!

jkbradley May 9, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley commented May 9, 2015

Uh oh!

SparkQA commented May 10, 2015

Uh oh!

jkbradley commented May 10, 2015

Uh oh!

jkbradley Jun 22, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Jun 22, 2015

Uh oh!

SparkQA commented Jun 22, 2015

Uh oh!

jkbradley Jun 22, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 23, 2015

Uh oh!

jkbradley commented Jun 23, 2015

Uh oh!

jkbradley commented Jun 23, 2015

Uh oh!

SparkQA commented Jun 24, 2015

Uh oh!

jkbradley commented Jun 24, 2015

Uh oh!

SparkQA commented Jun 24, 2015

Uh oh!

SparkQA commented Jun 25, 2015

Uh oh!

Lewuathe commented Jun 27, 2015

Uh oh!

SparkQA commented Jun 29, 2015

Uh oh!

JoshRosen commented Jun 30, 2015

Uh oh!

SparkQA commented Jun 30, 2015

Uh oh!

jkbradley commented Jun 30, 2015

Uh oh!

jkbradley commented Jun 30, 2015

Uh oh!

SparkQA commented Jul 1, 2015

Uh oh!

jkbradley commented Jul 1, 2015

Uh oh!

Uh oh!