Lexical Analysis
Date: 23.04.27
Writer: 9tailwolf : doryeon514@gm.gist.ac.kr
Introduction
Lexical Analysis is a basic of NLP, because without Lexical alanysis, there is no information in word and hard to understand word meaning.
Spacing Correction
Spacing in sentance is essential because spacing is easy way to divide morpheme. Before processing language, correct spacing is important.
Rule Based Spacing Correction
Since numerous rules in language, rule based spacing correction is too hard to apply all sentence. Word Block Bidirectional Algorithm(어절 블록 양방향 알고리즘) is a algorithm that using huristic based on rules.
Statistics Based Spacing Correction
By using numerous data, we can make probability. But collect data is too hard because in korean, there is no believable spacing data in online.
Grammer Correction
Grammer correction is essential due to make exact conveying meaning. Grammer correction must conduct below works.
- Detect errer in text.
- Fix errer in text.
And there is a four types of grammer error.
- Insertion
- Deletion
- Substitution
- Transposition
Rule Based Grammer Correction
The pros and cons are similar as rule based spacing correction because the rule based algorithm has similar problem. by below process, rule based grammer correction can work.
- Pre-processing.
- Tagging part of speech.
- Detecting rules.
- Suggest correction result.
Statistics Based Grammer Correction
By using Bayesian inference model, we can apply statistics based grammer correction.
\[\widehat{c} = argmax_{c \in C} P(t | c)P(c)\]\(\widehat{c}\) is a result of correction, \(c\) is a correct word, \(C\) is a set of correct word, and \(t\) is a incorrect word.
Part of Speech Tagging
Tagging is important to understand meaning in sentence.
Hidden Markov Model, HMM
When the length of sentence is \(N\), we can express \(w_{1,n} = w_{1}, w_{2}, w_{3}, ... w_{n}\), \(w_{i}\) is a word that appear \(i\)th in \(w_{1,n}\). Since we need most usual case, we should find \(c_{1,N}\), which is a sequence of part of speech by using bayesian rule. Following is a process that apply HMM algorithm. By calcunating below, we can find most usual \(c_{1,n}\).
And when we apply Markov Assumption, which is assumming that future state is determined only the current state, not a past state. Then the fomular can be below.
Tagging in Deep Learning
Without dictionary, find rule of sentence by statistics is almost impossible. To solve this problem, Andrew Matteson(2018) suggest algorithm that tagging for every input. It is a algorithm that implement character, and apply Bi-LSTM model. Other will be dealed with later.