Lexical Analysis

Date: 23.04.27

Writer: 9tailwolf : doryeon514@gm.gist.ac.kr


Introduction


Lexical Analysis is a basic of NLP, because without Lexical alanysis, there is no information in word and hard to understand word meaning.


Spacing Correction


Spacing in sentance is essential because spacing is easy way to divide morpheme. Before processing language, correct spacing is important.

Rule Based Spacing Correction

Since numerous rules in language, rule based spacing correction is too hard to apply all sentence. Word Block Bidirectional Algorithm(어절 블록 양방향 알고리즘) is a algorithm that using huristic based on rules.

Statistics Based Spacing Correction

By using numerous data, we can make probability. But collect data is too hard because in korean, there is no believable spacing data in online.


Grammer Correction


Grammer correction is essential due to make exact conveying meaning. Grammer correction must conduct below works.

  • Detect errer in text.
  • Fix errer in text.

And there is a four types of grammer error.

  • Insertion
  • Deletion
  • Substitution
  • Transposition
Rule Based Grammer Correction

The pros and cons are similar as rule based spacing correction because the rule based algorithm has similar problem. by below process, rule based grammer correction can work.

  1. Pre-processing.
  2. Tagging part of speech.
  3. Detecting rules.
  4. Suggest correction result.
Statistics Based Grammer Correction

By using Bayesian inference model, we can apply statistics based grammer correction.

\[\widehat{c} = argmax_{c \in C} P(t | c)P(c)\]

\(\widehat{c}\) is a result of correction, \(c\) is a correct word, \(C\) is a set of correct word, and \(t\) is a incorrect word.


Part of Speech Tagging


Tagging is important to understand meaning in sentence.

Hidden Markov Model, HMM

When the length of sentence is \(N\), we can express \(w_{1,n} = w_{1}, w_{2}, w_{3}, ... w_{n}\), \(w_{i}\) is a word that appear \(i\)th in \(w_{1,n}\). Since we need most usual case, we should find \(c_{1,N}\), which is a sequence of part of speech by using bayesian rule. Following is a process that apply HMM algorithm. By calcunating below, we can find most usual \(c_{1,n}\).

$$argmax_{c_{1,n}} P(c_{1,n}|w_{1,n}) = argmax_{c_{1,n}} P(c_{1,n},w_{1,n})$$
$$P(c_{1,n},w_{1,n}) = P(c_{1},w_{1})\Pi_{i=2}^{N}P(c_{i}|w_{1,i-1}c_{1,i-1})P(w_{i}|w_{1,i-1}c_{1,i})$$
$$= P(c_{1},w_{1})\Pi_{i=2}^{N}P(w_{i}|w_{1,i-1}c_{1,i-1})P(c_{i}|w_{1,i}c_{1,i-1})$$

And when we apply Markov Assumption, which is assumming that future state is determined only the current state, not a past state. Then the fomular can be below.

$$P(c_{i}|w_{1,i-1},c_{1,i-1}) = P(c_{i}|w_{i-1},c_{i-1})$$
$$P(w_{i}|w_{1,i-1},c_{1,i-1}) = P(w_{i}|w_{i-1},c_{i-1})$$


Tagging in Deep Learning

Without dictionary, find rule of sentence by statistics is almost impossible. To solve this problem, Andrew Matteson(2018) suggest algorithm that tagging for every input. It is a algorithm that implement character, and apply Bi-LSTM model. Other will be dealed with later.