-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-6263][MLLIB] Python MLlib API missing items: Utils #5707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #30958 has finished for PR 5707 at commit
|
Test build #30970 has started for PR 5707 at commit |
Jenkins test this please |
Test build #32047 has finished for PR 5707 at commit
|
@Lewuathe I'll make a pass now. Sorry about the delay! For k-fold, I think you should be able to do it following the example of the PySpark algorithms which take RDDs and do training in the JVM:
I think that should keep you from having the worry about ClassTags. |
@@ -71,6 +71,14 @@ private[python] class PythonMLLibAPI extends Serializable { | |||
minPartitions: Int): JavaRDD[LabeledPoint] = | |||
MLUtils.loadLabeledPoints(jsc.sc, path, minPartitions) | |||
|
|||
def appendBias(data: org.apache.spark.mllib.linalg.Vector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not needed: see comment for appendBias in util.py)
@Lewuathe That's all for now! |
@jkbradley Thank you for reviewing. For k-fold I'll do in separate JIRA. |
Test build #32292 has finished for PR 5707 at commit
|
* @param path file or directory path in any Hadoop-supported file system URI | ||
* @return serialized vectors in a RDD | ||
*/ | ||
def loadVectors(jsc: JavaSparkContext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scala style: If this method declaration will fit on 1 line, then please do so. (I think it will.) Otherwise, the parameters should start in the line below the method name and be indented 4 spaces.
@Lewuathe Thanks for the updates! Minor comments + one possible bug with appendBias. I think the PySpark test failure can be fixed by converting the Vectors in the test to Python native types at the end. (But the tests should test Vectors too; the conversion to Python native arrays can happen at the end.) |
Test build #32336 has finished for PR 5707 at commit
|
It looks like the same PySpark test failure is occurring, and it looks like a valid failure:
There might be faster turnaround if you ran the tests locally before pushing the update. |
if isinstance(vec, SparseVector): | ||
l = scipy.sparse.csc_matrix(np.append(vec.toArray(), 1.0)) | ||
return _convert_to_vector(l.T) | ||
elif isinstance(vec, Vector): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to else, and indent the return statement to make this easier to read.
@Lewuathe Thanks for iterating on this with me! Those should be the final touches. |
Test build #35446 has finished for PR 5707 at commit
|
""" | ||
vec = _convert_to_vector(data) | ||
if isinstance(vec, SparseVector): | ||
entries = dict(zip(vec.indices, vec.values)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be inefficient. Could you instead do:
newIndices = vec.indices + [len(vec)]
newValues = vec.values + [1.0]
return SparseVector(len(vec)+1, newIndices, newValues)
Test build #35551 has finished for PR 5707 at commit
|
LGTM merging into master |
Uh oh, there are merge conflicts now. Could you please resolve them? |
Test build #35675 has finished for PR 5707 at commit
|
Jenkins test this please |
Test build #35724 has finished for PR 5707 at commit
|
Test build #35778 has finished for PR 5707 at commit
|
@jkbradley I'm so sorry for keeping bothering you, but could you check it again? Thank you. |
Test build #35983 has finished for PR 5707 at commit
|
Jenkins, retest this please. |
Test build #36105 has finished for PR 5707 at commit
|
LGTM merging with master |
@Lewuathe I'm sorry, but it looks like there are conflicts again. I'm working on catching up on PRs finally, so I should be able to merge this right after tests pass. |
Test build #36248 has finished for PR 5707 at commit
|
LGTM merging with master |
Implement missing API in pyspark.
MLUtils
kFold
is also missing however I am not sureClassTag
can be passed or restored through python.