Word2Vec Embedding Model

Date: 23.05.16

Writer: 9tailwolf : doryeon514@gm.gist.ac.kr

Introduction

One-hot encoding is a method that representate word as vector, and the all vocabulary vectors are 1 at just $i$th index($\vec{v_{i}} = (0,0,...,1,...,0,0)$). But in this method, it is hard to representate semantic information between words.

In this chapter, we use Word2Vec model. Word2Vec is a reduced vector dimension word set model that can representate semantic of words.

Continuous Bag-of-Words Model

The architecture of Continuous Bag-of-Words Model is Input-Projection-Output and it is removed hidden layer of NNLM. To make effectively, CBOW reduced computation.

1. Input Layer

At first, we make vector mapped by word. $N$ of words are encoding each vector of $1$ to $\mid V\mid$. And the sequence of vectors are used to Input.

$$\vec{V} = \langle C(w_{t-1}),C(w_{t-2}),...,C(w_{t-n+1}) \rangle$$

2. Projection

CBOW doesn’t go throuth non-linear layer. At projection, Input sequence whose dimension of element is $N$ mapped with sequence with $D$ dimension of element by matrix multification $U \in M_{\mid V\mid,D}$. And the sum of sequence vector becomes result. Following is a fomular of projection.

$$\vec{V'} = \frac{1}{N}\Sigma_{i=1}^{N} V_{i}M$$

3. Output Layer

To find the optimal word, we should calculate score. By matrix multification, we can make vector with score of words. Parameter $U$ satisfies $U \in M_{D,\mid V\mid}$.

$$\vec{z} = U\vec{V'}$$

And use Softmax function to find optimal result.

$$y = softmax(\vec{z})$$

Skip-gram Model

Skip-gram model is a reversed CBOW model. It can estimate sequence with input of one word.

1. Input Layer

At first, we make vector mapped by word. $N$ of words are encoding each vector of $1$ to $\mid V\mid$.

2. Projection

Skip-gram expects $2m$ words. $m$ words are in front of input and the other $m$ words are behind of input.

$$\vec{v'} = W\vec{v'}$$

3. Output Layer

To find the optimal word, we should calculate score. By matrix multification, we can make vector with score of words.

$$\vec{z} = W'\vec{v}$$

And use Softmax function to find optimal result. output $y$ is a vector sequence.

$$y = softmax(\vec{z})$$

4. Training

In training, we should find optimal result, in other words, we should find minimal loss $J$.

$$\min{J} = - \log{P(w_{c-m},w_{c-m+1},...,w_{c+m-1},w_{c+m}\mid w_{c})}$$ $$ = - \log{\Pi_{i=0,i\neq c}^{2m} P(w_{c-m+i}\mid w_{c})}$$ $$ = - \log{\frac{1}{\Sigma_{k=1}^{V} e^{u_{k}^{T}v_{c}}}\Pi_{i=0,i\neq c}^{2m} exp(u_{c-m+i}^{T}v_{c})}$$ $$ = \Sigma_{i=1}^{V} u_{k}^{T}v_{c} - \log{\Sigma_{i=1}^{2m} e^{u_{k}^{T}v_{c}}} $$

Improved Skip-gram Model

Hierarchical Softmax

Hierarchical Softmax is a method to reduce computation. Words are saved by binary tree. And the probability of the word is calculated by the products with probability of its child node.

$$ P(w\mid w_{I}) = \Pi_{j=1}^{L(w)-1} \sigma(n(w,j+1) = ch(n(w,j)) \centerdot v_{n(w,j)} {}^{\top} v_{w_{I}})$$

Negative Sampling

Noise Contrastive Estimation(Negative Sampling, NCE) is a alternative to the Hierarchical Softmax. It is a method that update word vector just related to center word. We can defined NCE as below.

$$ \log \sigma(v_{w_{WO}}', {}^{\top} v_{w_{I}}) + \Sigma_{i=1}^{k}\mathbb{E}_{w_{i}~P_{n}(w)}[\log \sigma -v_{w_{i}}', {}^{\top} v_{w_{I}}]$$

Subsampling

Subsampling is a method that delete probability word that is usally used. By below probability, the word is deleted.

$$ P(w_{i}) = 1 - \sqrt{\frac{1}{f(w_{i})}}$$