Skip to content

Commit eef779b

Browse files
Ishiiharamengxr
authored andcommitted
[SPARK-2842][MLlib]Word2Vec documentation
mengxr Documentation for Word2Vec Author: Liquan Pei <[email protected]> Closes apache#2003 from Ishiihara/Word2Vec-doc and squashes the following commits: 4ff11d4 [Liquan Pei] minor fix 8d7458f [Liquan Pei] code reformat 6df0dcb [Liquan Pei] add Word2Vec documentation
1 parent 3c8fa50 commit eef779b

File tree

1 file changed

+62
-1
lines changed

1 file changed

+62
-1
lines changed

docs/mllib-feature-extraction.md

Lines changed: 62 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,65 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Feature Extraction
99

1010
## Word2Vec
1111

12-
## TFIDF
12+
Word2Vec computes distributed vector representation of words. The main advantage of the distributed
13+
representations is that similar words are close in the vector space, which makes generalization to
14+
novel patterns easier and model estimation more robust. Distributed vector representation is
15+
showed to be useful in many natural language processing applications such as named entity
16+
recognition, disambiguation, parsing, tagging and machine translation.
17+
18+
### Model
19+
20+
In our implementation of Word2Vec, we used skip-gram model. The training objective of skip-gram is
21+
to learn word vector representations that are good at predicting its context in the same sentence.
22+
Mathematically, given a sequence of training words `$w_1, w_2, \dots, w_T$`, the objective of the
23+
skip-gram model is to maximize the average log-likelihood
24+
`\[
25+
\frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t)
26+
\]`
27+
where $k$ is the size of the training window.
28+
29+
In the skip-gram model, every word $w$ is associated with two vectors $u_w$ and $v_w$ which are
30+
vector representations of $w$ as word and context respectively. The probability of correctly
31+
predicting word $w_i$ given word $w_j$ is determined by the softmax model, which is
32+
`\[
33+
p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top}v_{w_j})}
34+
\]`
35+
where $V$ is the vocabulary size.
36+
37+
The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$
38+
is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec,
39+
we used hierarchical softmax, which reduced the complexity of computing of $\log p(w_i | w_j)$ to
40+
$O(\log(V))$
41+
42+
### Example
43+
44+
The example below demonstrates how to load a text file, parse it as an RDD of `Seq[String]`,
45+
construct a `Word2Vec` instance and then fit a `Word2VecModel` with the input data. Finally,
46+
we display the top 40 synonyms of the specified word. To run the example, first download
47+
the [text8](http://mattmahoney.net/dc/text8.zip) data and extract it to your preferred directory.
48+
Here we assume the extracted file is `text8` and in same directory as you run the spark shell.
49+
50+
<div class="codetabs">
51+
<div data-lang="scala">
52+
{% highlight scala %}
53+
import org.apache.spark._
54+
import org.apache.spark.rdd._
55+
import org.apache.spark.SparkContext._
56+
import org.apache.spark.mllib.feature.Word2Vec
57+
58+
val input = sc.textFile("text8").map(line => line.split(" ").toSeq)
59+
60+
val word2vec = new Word2Vec()
61+
62+
val model = word2vec.fit(input)
63+
64+
val synonyms = model.findSynonyms("china", 40)
65+
66+
for((synonym, cosineSimilarity) <- synonyms) {
67+
println(s"$synonym $cosineSimilarity")
68+
}
69+
{% endhighlight %}
70+
</div>
71+
</div>
72+
73+
## TFIDF

0 commit comments

Comments
 (0)