Word2Vec Embedding Model

Date: 23.05.16

Writer: 9tailwolf : doryeon514@gm.gist.ac.kr


Introduction


One-hot encoding is a method that representate word as vector, and the all vocabulary vectors are 1 at just \(i\)th index(\(\vec{v_{i}} = (0,0,...,1,...,0,0)\)). But in this method, it is hard to representate semantic information between words.

In this chapter, we use Word2Vec model. Word2Vec is a reduced vector dimension word set model that can representate semantic of words.


Continuous Bag-of-Words Model


The architecture of Continuous Bag-of-Words Model is Input-Projection-Output and it is removed hidden layer of NNLM. To make effectively, CBOW reduced computation.

1. Input Layer

At first, we make vector mapped by word. \(N\) of words are encoding each vector of \(1\) to \(\mid V\mid\). And the sequence of vectors are used to Input.

$$\vec{V} = \langle C(w_{t-1}),C(w_{t-2}),...,C(w_{t-n+1}) \rangle$$
2. Projection

CBOW doesn’t go throuth non-linear layer. At projection, Input sequence whose dimension of element is \(N\) mapped with sequence with \(D\) dimension of element by matrix multification \(U \in M_{\mid V\mid,D}\). And the sum of sequence vector becomes result. Following is a fomular of projection.

$$\vec{V'} = \frac{1}{N}\Sigma_{i=1}^{N} V_{i}M$$
3. Output Layer

To find the optimal word, we should calculate score. By matrix multification, we can make vector with score of words. Parameter \(U\) satisfies \(U \in M_{D,\mid V\mid}\).

$$\vec{z} = U\vec{V'}$$

And use Softmax function to find optimal result.

$$y = softmax(\vec{z})$$


Skip-gram Model


Skip-gram model is a reversed CBOW model. It can estimate sequence with input of one word.

1. Input Layer

At first, we make vector mapped by word. \(N\) of words are encoding each vector of \(1\) to \(\mid V\mid\).

2. Projection

Skip-gram expects \(2m\) words. \(m\) words are in front of input and the other \(m\) words are behind of input.

$$\vec{v'} = W\vec{v'}$$
3. Output Layer

To find the optimal word, we should calculate score. By matrix multification, we can make vector with score of words.

$$\vec{z} = W'\vec{v}$$

And use Softmax function to find optimal result. output \(y\) is a vector sequence.

$$y = softmax(\vec{z})$$
4. Training

In training, we should find optimal result, in other words, we should find minimal loss \(J\).

$$\min{J} = - \log{P(w_{c-m},w_{c-m+1},...,w_{c+m-1},w_{c+m}\mid w_{c})}$$
$$ = - \log{\Pi_{i=0,i\neq c}^{2m} P(w_{c-m+i}\mid w_{c})}$$
$$ = - \log{\frac{1}{\Sigma_{k=1}^{V} e^{u_{k}^{T}v_{c}}}\Pi_{i=0,i\neq c}^{2m} exp(u_{c-m+i}^{T}v_{c})}$$
$$ = \Sigma_{i=1}^{V} u_{k}^{T}v_{c} - \log{\Sigma_{i=1}^{2m} e^{u_{k}^{T}v_{c}}} $$


Improved Skip-gram Model


Hierarchical Softmax

Hierarchical Softmax is a method to reduce computation. Words are saved by binary tree. And the probability of the word is calculated by the products with probability of its child node.

$$ P(w\mid w_{I}) = \Pi_{j=1}^{L(w)-1} \sigma(n(w,j+1) = ch(n(w,j)) \centerdot v_{n(w,j)} {}^{\top} v_{w_{I}})$$
Negative Sampling

Noise Contrastive Estimation(Negative Sampling, NCE) is a alternative to the Hierarchical Softmax. It is a method that update word vector just related to center word. We can defined NCE as below.

$$ \log \sigma(v_{w_{WO}}', {}^{\top} v_{w_{I}}) + \Sigma_{i=1}^{k}\mathbb{E}_{w_{i}~P_{n}(w)}[\log \sigma -v_{w_{i}}', {}^{\top} v_{w_{I}}]$$
Subsampling

Subsampling is a method that delete probability word that is usally used. By below probability, the word is deleted.

$$ P(w_{i}) = 1 - \sqrt{\frac{1}{f(w_{i})}}$$