-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-7045] [MLlib] Avoid intermediate representation when creating model #5748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #31150 has finished for PR 5748 at commit
|
Btw, I addressed the minor comments in this. |
I'll try to review this before the code cutoff, but it might slip to 1.5. I think that's OK since it's an internal improvement. |
when is the release scheduled? |
The code cutoff is this Friday |
@jkbradley can you have a look at this too? even if it won't be in this release? |
@jkbradley ping? |
val vector = new Array[Float](vectorSize) | ||
Array.copy(syn0Global, i * vectorSize, vector, 0, vectorSize) | ||
word2VecMap += word -> vector | ||
wordArray(i) = bcVocab.value(i).word |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is executing on the driver, so it should not use broadcast variables. Use vocab
Could be shorter to do:
val wordArray = vocab.map(_.word)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I just followed the convention used before.
@jkbradley fixed! |
Test build #33999 has finished for PR 5748 at commit
|
jenkins retest this please |
@jkbradley ping? |
Test build #34486 has finished for PR 5748 at commit
|
Test build #35302 has finished for PR 5748 at commit
|
@jkbradley I just had a proper look at this after a long time. I think this PR succeeds in preventing the huge Word2Vec map while constructing the Word2Vec model. However, if the user provides a Word2Vec map by himself to construct the Word2Vec model (in the future, since Word2Vec model is marked as private[mllib]), it creates a huge array of size numWords * numDims. Are we okay with that? |
Test build #35309 has finished for PR 5748 at commit
|
ping @jkbradley Can you have a look? I think it is one pass away from a merge? |
I'm sorry about the long delay! I'll take a look now. |
|
||
// wordIndex: Maps each word to an index, which can retrieve the corresponding | ||
// vector from wordVectors (see below). | ||
private val wordIndex: Map[String, Int] = wordList.zip(0 until model.size).toMap | ||
// wordVectors: Array of length numWords * vectorSize, vector corresponding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc for wordIndex and wordVectors can go in the class Scala doc and use @param
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is not meant to be public at any point of time. Is that okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Do you know if it shows up in the API docs, even though it's private? (I'll check, but it may take a little while since I need to compile them.)
It looks good, just tiny comments. We can make sure this gets into 1.5.
I think that's OK, though we could make that constructor public in the future. I think it would only be useful if someone wanted to load a model (created by another library) into MLlib. |
jenkins my friend. retest this please |
Thanks for the updates! It LGTM pending tests. I'm just waiting for the docs to compile to check the param doc question. |
Test build #38375 has finished for PR 5748 at commit
|
Test build #90 has finished for PR 5748 at commit
|
I just checked, and the docs for the private vals won't show up. (I checked the current docs for KMeansModel, which exposes uid but hides parentModel.) Would you mind moving that doc, just to keep things well-organized? That should be it. |
done |
LGTM pending tests. Test this please |
test this please |
Test build #94 has finished for PR 5748 at commit
|
Test build #38395 has finished for PR 5748 at commit
|
Merging with master with the first tests passed, and the second one's failure was unrelated. |
Word2Vec used to convert from an Array[Float] representation to a Map[String, Array[Float]] and then back to an Array[Float] through Word2VecModel.
This prevents this conversion while still supporting the older method of supplying a Map.