@@ -9,4 +9,65 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Feature Extraction
9
9
10
10
## Word2Vec
11
11
12
- ## TFIDF
12
+ Word2Vec computes distributed vector representation of words. The main advantage of the distributed
13
+ representations is that similar words are close in the vector space, which makes generalization to
14
+ novel patterns easier and model estimation more robust. Distributed vector representation is
15
+ showed to be useful in many natural language processing applications such as named entity
16
+ recognition, disambiguation, parsing, tagging and machine translation.
17
+
18
+ ### Model
19
+
20
+ In our implementation of Word2Vec, we used skip-gram model. The training objective of skip-gram is
21
+ to learn word vector representations that are good at predicting its context in the same sentence.
22
+ Mathematically, given a sequence of training words ` $w_1, w_2, \dots, w_T$ ` , the objective of the
23
+ skip-gram model is to maximize the average log-likelihood
24
+ `\[
25
+ \frac{1}{T} \sum_ {t = 1}^{T}\sum_ {j=-k}^{j=k} \log p(w_ {t+j} | w_t)
26
+ \] `
27
+ where $k$ is the size of the training window.
28
+
29
+ In the skip-gram model, every word $w$ is associated with two vectors $u_w$ and $v_w$ which are
30
+ vector representations of $w$ as word and context respectively. The probability of correctly
31
+ predicting word $w_i$ given word $w_j$ is determined by the softmax model, which is
32
+ `\[
33
+ p(w_i | w_j ) = \frac{\exp(u_ {w_i}^{\top}v_ {w_j})}{\sum_ {l=1}^{V} \exp(u_l^{\top}v_ {w_j})}
34
+ \] `
35
+ where $V$ is the vocabulary size.
36
+
37
+ The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$
38
+ is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec,
39
+ we used hierarchical softmax, which reduced the complexity of computing of $\log p(w_i | w_j)$ to
40
+ $O(\log(V))$
41
+
42
+ ### Example
43
+
44
+ The example below demonstrates how to load a text file, parse it as an RDD of ` Seq[String] ` ,
45
+ construct a ` Word2Vec ` instance and then fit a ` Word2VecModel ` with the input data. Finally,
46
+ we display the top 40 synonyms of the specified word. To run the example, first download
47
+ the [ text8] ( http://mattmahoney.net/dc/text8.zip ) data and extract it to your preferred directory.
48
+ Here we assume the extracted file is ` text8 ` and in same directory as you run the spark shell.
49
+
50
+ <div class =" codetabs " >
51
+ <div data-lang =" scala " >
52
+ {% highlight scala %}
53
+ import org.apache.spark._
54
+ import org.apache.spark.rdd._
55
+ import org.apache.spark.SparkContext._
56
+ import org.apache.spark.mllib.feature.Word2Vec
57
+
58
+ val input = sc.textFile("text8").map(line => line.split(" ").toSeq)
59
+
60
+ val word2vec = new Word2Vec()
61
+
62
+ val model = word2vec.fit(input)
63
+
64
+ val synonyms = model.findSynonyms("china", 40)
65
+
66
+ for((synonym, cosineSimilarity) <- synonyms) {
67
+ println(s"$synonym $cosineSimilarity")
68
+ }
69
+ {% endhighlight %}
70
+ </div >
71
+ </div >
72
+
73
+ ## TFIDF
0 commit comments