[SPARK-7045] [MLlib] Avoid intermediate representation when creating model #5748

MechCoder · 2015-04-28T18:35:11Z

Word2Vec used to convert from an Array[Float] representation to a Map[String, Array[Float]] and then back to an Array[Float] through Word2VecModel.

This prevents this conversion while still supporting the older method of supplying a Map.

MechCoder · 2015-04-28T18:35:36Z

cc @mengxr @jkbradley

SparkQA · 2015-04-28T20:12:31Z

Test build #31150 has finished for PR 5748 at commit a17d9c9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

MechCoder · 2015-04-29T04:51:25Z

Btw, I addressed the minor comments in this.

jkbradley · 2015-04-29T17:49:31Z

I'll try to review this before the code cutoff, but it might slip to 1.5. I think that's OK since it's an internal improvement.

MechCoder · 2015-04-29T17:52:53Z

when is the release scheduled?

jkbradley · 2015-04-29T18:04:27Z

The code cutoff is this Friday

MechCoder · 2015-05-08T08:47:06Z

@jkbradley can you have a look at this too? even if it won't be in this release?

MechCoder · 2015-05-21T03:00:49Z

@jkbradley ping?

jkbradley · 2015-06-01T18:34:24Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

-      val vector = new Array[Float](vectorSize)
-      Array.copy(syn0Global, i * vectorSize, vector, 0, vectorSize)
-      word2VecMap += word -> vector
+      wordArray(i) = bcVocab.value(i).word


This is executing on the driver, so it should not use broadcast variables. Use vocab Could be shorter to do:

val wordArray = vocab.map(_.word)

Hmm. I just followed the convention used before.

MechCoder · 2015-06-02T16:30:58Z

@jkbradley fixed!

SparkQA · 2015-06-02T17:51:54Z

Test build #33999 has finished for PR 5748 at commit 14ee596.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MechCoder · 2015-06-09T05:19:04Z

jenkins retest this please

MechCoder · 2015-06-09T05:19:15Z

@jkbradley ping?

SparkQA · 2015-06-09T07:11:53Z

Test build #34486 has finished for PR 5748 at commit 14ee596.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-19T18:20:27Z

Test build #35302 has finished for PR 5748 at commit b1d61c4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

MechCoder · 2015-06-19T19:06:30Z

@jkbradley I just had a proper look at this after a long time.

I think this PR succeeds in preventing the huge Word2Vec map while constructing the Word2Vec model.

However, if the user provides a Word2Vec map by himself to construct the Word2Vec model (in the future, since Word2Vec model is marked as private[mllib]), it creates a huge array of size numWords * numDims. Are we okay with that?

SparkQA · 2015-06-19T20:10:25Z

Test build #35309 has finished for PR 5748 at commit fa04313.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- // class ParentClass(parentField: Int)
- // class ChildClass(childField: Int) extends ParentClass(1)
- // If the class type corresponding to current slot has writeObject() defined,
- // then its not obvious which fields of the class will be serialized as the writeObject()
- case class Md5(child: Expression)

MechCoder · 2015-06-25T13:20:41Z

ping @jkbradley Can you have a look? I think it is one pass away from a merge?

jkbradley · 2015-07-24T02:25:22Z

I'm sorry about the long delay! I'll take a look now.

jkbradley · 2015-07-24T02:49:53Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala


  // wordIndex: Maps each word to an index, which can retrieve the corresponding
  //            vector from wordVectors (see below).
-  private val wordIndex: Map[String, Int] = wordList.zip(0 until model.size).toMap
+  // wordVectors: Array of length numWords * vectorSize, vector corresponding


This doc for wordIndex and wordVectors can go in the class Scala doc and use @param.

But this is not meant to be public at any point of time. Is that okay?

Good question. Do you know if it shows up in the API docs, even though it's private? (I'll check, but it may take a little while since I need to compile them.)

jkbradley · 2015-07-24T02:50:35Z

It looks good, just tiny comments. We can make sure this gets into 1.5.

However, if the user provides a Word2Vec map by himself to construct the Word2Vec model (in the future, since Word2Vec model is marked as private[mllib]), it creates a huge array of size numWords * numDims. Are we okay with that?

I think that's OK, though we could make that constructor public in the future. I think it would only be useful if someone wanted to load a model (created by another library) into MLlib.

MechCoder · 2015-07-24T17:04:27Z

jenkins my friend. retest this please

jkbradley · 2015-07-24T17:31:21Z

Thanks for the updates! It LGTM pending tests. I'm just waiting for the docs to compile to check the param doc question.

SparkQA · 2015-07-24T17:43:50Z

Test build #38375 has finished for PR 5748 at commit 5703116.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-24T17:48:40Z

Test build #90 has finished for PR 5748 at commit 5703116.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-07-24T17:55:30Z

I just checked, and the docs for the private vals won't show up. (I checked the current docs for KMeansModel, which exposes uid but hides parentModel.) Would you mind moving that doc, just to keep things well-organized? That should be it.

MechCoder · 2015-07-24T18:00:18Z

done

jkbradley · 2015-07-24T18:14:12Z

LGTM pending tests.

Test this please

MechCoder · 2015-07-24T19:36:02Z

test this please

SparkQA · 2015-07-24T20:23:35Z

Test build #94 has finished for PR 5748 at commit e308913.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-24T21:29:16Z

Test build #38395 has finished for PR 5748 at commit e308913.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-07-24T21:57:47Z

Merging with master with the first tests passed, and the second one's failure was unrelated.
Thanks!

jkbradley reviewed Jun 1, 2015
View reviewed changes

MechCoder force-pushed the spark-7045 branch from a17d9c9 to 14ee596 Compare June 2, 2015 16:29

MechCoder added 2 commits June 19, 2015 23:42

[SPARK-7045] Avoid intermediate representation when creating model

3b32c8c

better errors and tests

b1d61c4

MechCoder force-pushed the spark-7045 branch from 14ee596 to b1d61c4 Compare June 19, 2015 18:12

style fixes

fa04313

jkbradley reviewed Jul 24, 2015
View reviewed changes

minor

5703116

move docs

e308913

asfgit closed this in a400ab5 Jul 24, 2015

MechCoder deleted the spark-7045 branch July 25, 2015 03:47

[SPARK-7045] [MLlib] Avoid intermediate representation when creating model #5748

[SPARK-7045] [MLlib] Avoid intermediate representation when creating model #5748

Uh oh!

Conversation

MechCoder commented Apr 28, 2015

Uh oh!

MechCoder commented Apr 28, 2015

Uh oh!

SparkQA commented Apr 28, 2015

Uh oh!

MechCoder commented Apr 29, 2015

Uh oh!

jkbradley commented Apr 29, 2015

Uh oh!

MechCoder commented Apr 29, 2015

Uh oh!

jkbradley commented Apr 29, 2015

Uh oh!

MechCoder commented May 8, 2015

Uh oh!

MechCoder commented May 21, 2015

Uh oh!

jkbradley Jun 1, 2015

Choose a reason for hiding this comment

Uh oh!

MechCoder Jun 2, 2015

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Jun 2, 2015

Uh oh!

SparkQA commented Jun 2, 2015

Uh oh!

MechCoder commented Jun 9, 2015

Uh oh!

MechCoder commented Jun 9, 2015

Uh oh!

SparkQA commented Jun 9, 2015

Uh oh!

SparkQA commented Jun 19, 2015

Uh oh!

MechCoder commented Jun 19, 2015

Uh oh!

SparkQA commented Jun 19, 2015

Uh oh!

MechCoder commented Jun 25, 2015

Uh oh!

jkbradley commented Jul 24, 2015

Uh oh!

jkbradley Jul 24, 2015

Choose a reason for hiding this comment

Uh oh!

MechCoder Jul 24, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley Jul 24, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Jul 24, 2015

Uh oh!

MechCoder commented Jul 24, 2015

Uh oh!

jkbradley commented Jul 24, 2015

Uh oh!

SparkQA commented Jul 24, 2015

Uh oh!

SparkQA commented Jul 24, 2015

Uh oh!

jkbradley commented Jul 24, 2015

Uh oh!

MechCoder commented Jul 24, 2015

Uh oh!

jkbradley commented Jul 24, 2015

Uh oh!

MechCoder commented Jul 24, 2015

Uh oh!

SparkQA commented Jul 24, 2015

Uh oh!

SparkQA commented Jul 24, 2015

Uh oh!

jkbradley commented Jul 24, 2015