Skip to content

[SPARK-6263][MLLIB] Python MLlib API missing items: Utils #5707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 24 commits into from

Conversation

Lewuathe
Copy link
Contributor

Implement missing API in pyspark.

MLUtils

  • appendBias
  • loadVectors

kFold is also missing however I am not sure ClassTag can be passed or restored through python.

@SparkQA
Copy link

SparkQA commented Apr 26, 2015

Test build #30958 has finished for PR 5707 at commit 2980569.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch adds the following new dependencies:
    • tachyon-0.6.4.jar
    • tachyon-client-0.6.4.jar
  • This patch removes the following dependencies:
    • tachyon-0.5.0.jar
    • tachyon-client-0.5.0.jar

@Lewuathe Lewuathe changed the title [SPARK-6263] Python MLlib API missing items: Utils [SPARK-6263][MLLIB] Python MLlib API missing items: Utils Apr 27, 2015
@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #30970 has started for PR 5707 at commit c728046.

@jkbradley
Copy link
Member

Jenkins test this please

@SparkQA
Copy link

SparkQA commented May 7, 2015

Test build #32047 has finished for PR 5707 at commit c728046.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

@Lewuathe I'll make a pass now. Sorry about the delay! For k-fold, I think you should be able to do it following the example of the PySpark algorithms which take RDDs and do training in the JVM:

  • Write a method in PythonMLLibAPI.scala which is a wrapper for MLUtils.kFold.
  • In your Python implementation, use callMLlibFunc in common.py to call that wrapper method.

I think that should keep you from having the worry about ClassTags.

@@ -71,6 +71,14 @@ private[python] class PythonMLLibAPI extends Serializable {
minPartitions: Int): JavaRDD[LabeledPoint] =
MLUtils.loadLabeledPoints(jsc.sc, path, minPartitions)

def appendBias(data: org.apache.spark.mllib.linalg.Vector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not needed: see comment for appendBias in util.py)

@jkbradley
Copy link
Member

@Lewuathe That's all for now!

@Lewuathe
Copy link
Contributor Author

Lewuathe commented May 9, 2015

@jkbradley Thank you for reviewing. For k-fold I'll do in separate JIRA.

@SparkQA
Copy link

SparkQA commented May 9, 2015

Test build #32292 has finished for PR 5707 at commit a353354.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HasCheckpointInterval(Params):
    • class ALS(JavaEstimator, HasCheckpointInterval, HasMaxIter, HasPredictionCol, HasRegParam, HasSeed):
    • class ALSModel(JavaModel):
    • class ChiSqSelectorModel(JavaVectorTransformer):
    • class ChiSqSelector(object):
    • case class SimpleCatalystConf(caseSensitiveAnalysis: Boolean) extends CatalystConf
    • class SimpleCatalog(val conf: CatalystConf) extends Catalog

* @param path file or directory path in any Hadoop-supported file system URI
* @return serialized vectors in a RDD
*/
def loadVectors(jsc: JavaSparkContext,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala style: If this method declaration will fit on 1 line, then please do so. (I think it will.) Otherwise, the parameters should start in the line below the method name and be indented 4 spaces.

@jkbradley
Copy link
Member

@Lewuathe Thanks for the updates! Minor comments + one possible bug with appendBias. I think the PySpark test failure can be fixed by converting the Vectors in the test to Python native types at the end. (But the tests should test Vectors too; the conversion to Python native arrays can happen at the end.)

@SparkQA
Copy link

SparkQA commented May 10, 2015

Test build #32336 has finished for PR 5707 at commit 62a9c7e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

It looks like the same PySpark test failure is occurring, and it looks like a valid failure:

======================================================================
ERROR: test_load_vectors (__main__.MLUtilsTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pyspark/mllib/tests.py", line 847, in test_load_vectors
    ret.sort()
TypeError: unorderable types: DenseVector() < DenseVector()

There might be faster turnaround if you ran the tests locally before pushing the update.

if isinstance(vec, SparseVector):
l = scipy.sparse.csc_matrix(np.append(vec.toArray(), 1.0))
return _convert_to_vector(l.T)
elif isinstance(vec, Vector):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to else, and indent the return statement to make this easier to read.

@jkbradley
Copy link
Member

@Lewuathe Thanks for iterating on this with me! Those should be the final touches.

@SparkQA
Copy link

SparkQA commented Jun 22, 2015

Test build #35446 has finished for PR 5707 at commit 1d4714b.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"""
vec = _convert_to_vector(data)
if isinstance(vec, SparseVector):
entries = dict(zip(vec.indices, vec.values))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be inefficient. Could you instead do:

newIndices = vec.indices + [len(vec)]
newValues = vec.values + [1.0]
return SparseVector(len(vec)+1, newIndices, newValues)

@SparkQA
Copy link

SparkQA commented Jun 23, 2015

Test build #35551 has finished for PR 5707 at commit 9c329d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class BroadcastHint(child: LogicalPlan) extends UnaryNode

@jkbradley
Copy link
Member

LGTM merging into master
@Lewuathe Thanks very much!

@jkbradley
Copy link
Member

Uh oh, there are merge conflicts now. Could you please resolve them?

@SparkQA
Copy link

SparkQA commented Jun 24, 2015

Test build #35675 has finished for PR 5707 at commit d2aa2a0.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Jun 24, 2015

Test build #35724 has finished for PR 5707 at commit d2aa2a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 25, 2015

Test build #35778 has finished for PR 5707 at commit 6084e9c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Lewuathe
Copy link
Contributor Author

@jkbradley I'm so sorry for keeping bothering you, but could you check it again? Thank you.

@SparkQA
Copy link

SparkQA commented Jun 29, 2015

Test build #35983 has finished for PR 5707 at commit 3fc27e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36105 has finished for PR 5707 at commit 3fc27e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM merging with master
Thank you!

@jkbradley
Copy link
Member

@Lewuathe I'm sorry, but it looks like there are conflicts again. I'm working on catching up on PRs finally, so I should be able to merge this right after tests pass.

@SparkQA
Copy link

SparkQA commented Jul 1, 2015

Test build #36248 has finished for PR 5707 at commit 16863ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM merging with master
Thank you!

@asfgit asfgit closed this in 184de91 Jul 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants