Language Model | COSOM Blog

Date: 23.05.04

Writer: 9tailwolf : doryeon514@gm.gist.ac.kr

Introduction

Language Model is a base model that expext the word when words are given.

Language Model Based on Statistics

Conditional Probability

Most simple language model is just calculate probability. For example, about word sequence, the word next to \(W = w_{1}w_{2}...w_{n}\) can calculate \(P(w_{1},w_{2},...,w_{n}) = P(w_{1})P(w_{2}|w_{1})P(w_{3}|w_{2},w_{1})...\). But it is too hard to calculate, and need numerous corpus(training data).

But when we apply Markov Assumption, we can suppose the probability of appear \(w_{i}\) is \(P(w_{i}\mid w_{i-1})\).

And the equation should be \(P(w_{1},w_{2},...,w_{n}) = \Pi_{i=1}^{n} P(w_{i} \mid w_{i-1})\).

N-gram Language Model

N-gram Language Model is a model that solve the far words that is related each other. In this case, Markov Assumption does not work correctly. To solve this problem, we can assume \(n\)th words affect to next word.

But there is a problem. According to \(N\), the performance of N-gram Language Model become better, but over 5, the performance is not improved. And the sentences are too much simillar with copus(overfitting).

Log Probability

by prevent overfitting, language model use log probability. When we apply above equation, \(\log P(w_{1},w_{2},...,w_{n}) = \Sigma_{i=1}^{n} \log P(w_{i} \mid w_{i-1})\).

Smoothing

Setting the probability of words combination that never appeared is called Smoothing. Normaly,\(P(w1\mid w2)\) can be written as \(\frac{P(w2)}{P(w1, w2)}\). But when we apply smoothing, \(P(w1\mid w2) \approx \frac{P(w2)+\alpha}{P(w1, w2)+\alpha V}\). \(V\) is a size of vocabulary.

But there is a problem. By smoothing, generalization considering similarity is hard. To solve this problem, we use below algorithms.

Interpolation : About N-gram language model, calculate 1-gram, 2-gram… N-gram and multiply weight and add all to determine probability.
Back off : About N-gram language model, calculate 1-gram, 2-gram… N-gram determined probability that is not 0 by decreasing order of N.
Good Turning Smoothing
Kneser-Ney Discounting
Witten-Bell Smoothing