Skip to content

[SPARK-7045] [MLlib] Avoid intermediate representation when creating model #5748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

MechCoder
Copy link
Contributor

Word2Vec used to convert from an Array[Float] representation to a Map[String, Array[Float]] and then back to an Array[Float] through Word2VecModel.

This prevents this conversion while still supporting the older method of supplying a Map.

@MechCoder
Copy link
Contributor Author

cc @mengxr @jkbradley

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31150 has finished for PR 5748 at commit a17d9c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@MechCoder
Copy link
Contributor Author

Btw, I addressed the minor comments in this.

@jkbradley
Copy link
Member

I'll try to review this before the code cutoff, but it might slip to 1.5. I think that's OK since it's an internal improvement.

@MechCoder
Copy link
Contributor Author

when is the release scheduled?

@jkbradley
Copy link
Member

The code cutoff is this Friday

@MechCoder
Copy link
Contributor Author

@jkbradley can you have a look at this too? even if it won't be in this release?

@MechCoder
Copy link
Contributor Author

@jkbradley ping?

val vector = new Array[Float](vectorSize)
Array.copy(syn0Global, i * vectorSize, vector, 0, vectorSize)
word2VecMap += word -> vector
wordArray(i) = bcVocab.value(i).word
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is executing on the driver, so it should not use broadcast variables. Use vocab Could be shorter to do:

val wordArray = vocab.map(_.word)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I just followed the convention used before.

@MechCoder
Copy link
Contributor Author

@jkbradley fixed!

@SparkQA
Copy link

SparkQA commented Jun 2, 2015

Test build #33999 has finished for PR 5748 at commit 14ee596.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MechCoder
Copy link
Contributor Author

jenkins retest this please

@MechCoder
Copy link
Contributor Author

@jkbradley ping?

@SparkQA
Copy link

SparkQA commented Jun 9, 2015

Test build #34486 has finished for PR 5748 at commit 14ee596.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #35302 has finished for PR 5748 at commit b1d61c4.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MechCoder
Copy link
Contributor Author

@jkbradley I just had a proper look at this after a long time.

I think this PR succeeds in preventing the huge Word2Vec map while constructing the Word2Vec model.

However, if the user provides a Word2Vec map by himself to construct the Word2Vec model (in the future, since Word2Vec model is marked as private[mllib]), it creates a huge array of size numWords * numDims. Are we okay with that?

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #35309 has finished for PR 5748 at commit fa04313.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • // class ParentClass(parentField: Int)
    • // class ChildClass(childField: Int) extends ParentClass(1)
    • // If the class type corresponding to current slot has writeObject() defined,
    • // then its not obvious which fields of the class will be serialized as the writeObject()
    • case class Md5(child: Expression)

@MechCoder
Copy link
Contributor Author

ping @jkbradley Can you have a look? I think it is one pass away from a merge?

@jkbradley
Copy link
Member

I'm sorry about the long delay! I'll take a look now.


// wordIndex: Maps each word to an index, which can retrieve the corresponding
// vector from wordVectors (see below).
private val wordIndex: Map[String, Int] = wordList.zip(0 until model.size).toMap
// wordVectors: Array of length numWords * vectorSize, vector corresponding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc for wordIndex and wordVectors can go in the class Scala doc and use @param.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is not meant to be public at any point of time. Is that okay?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Do you know if it shows up in the API docs, even though it's private? (I'll check, but it may take a little while since I need to compile them.)

@jkbradley
Copy link
Member

It looks good, just tiny comments. We can make sure this gets into 1.5.

However, if the user provides a Word2Vec map by himself to construct the Word2Vec model (in the future, since Word2Vec model is marked as private[mllib]), it creates a huge array of size numWords * numDims. Are we okay with that?

I think that's OK, though we could make that constructor public in the future. I think it would only be useful if someone wanted to load a model (created by another library) into MLlib.

@MechCoder
Copy link
Contributor Author

jenkins my friend. retest this please

@jkbradley
Copy link
Member

Thanks for the updates! It LGTM pending tests. I'm just waiting for the docs to compile to check the param doc question.

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38375 has finished for PR 5748 at commit 5703116.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #90 has finished for PR 5748 at commit 5703116.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

I just checked, and the docs for the private vals won't show up. (I checked the current docs for KMeansModel, which exposes uid but hides parentModel.) Would you mind moving that doc, just to keep things well-organized? That should be it.

@MechCoder
Copy link
Contributor Author

done

@jkbradley
Copy link
Member

LGTM pending tests.

Test this please

@MechCoder
Copy link
Contributor Author

test this please

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #94 has finished for PR 5748 at commit e308913.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38395 has finished for PR 5748 at commit e308913.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Merging with master with the first tests passed, and the second one's failure was unrelated.
Thanks!

@asfgit asfgit closed this in a400ab5 Jul 24, 2015
@MechCoder MechCoder deleted the spark-7045 branch July 25, 2015 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants