Background
Original tweet | xe đón h\(\grave{\hat{{\mathrm{o}}}}\quad\) ngọc hà gây tai nạn kinhh hoàng: sẽ khởi tố tài xế http://fb.me/2MwvznBbj
|
Step 1: Normalization | xe đón h\(\grave{\hat{{\mathrm{o}}}}\) ngọc hà gây tai nạn kinh hoàng: sẽ khởi tố tài xế |
Step 2: Capitalization |
Xe đón H
\(\grave{\hat{{\mathrm{o}}}}\)
Ngọc Hà gây tai nạn kinh hoàng: sẽ khởi tố tài xế |
Step 3: NEs recognition | Xe đón <PER> H\(\grave{\hat{{\mathrm{o}}}}\) Ngọc Hà </PER> gây tai nạn kinh hoàng: sẽ khởi tố tài xế |
Related work
NER
Vietnamese NER
System | Entity types | Precision (%) | Recall (%) | F1 (%) |
---|---|---|---|---|
[19] | PER | 84 | 82.56 | 83.39 |
[36] | PER, ORG, LOC, NA, FA, RE | 92 | 76 | 83 |
[38] | PER, ORG, LOC | 86.05 | 81.11 | 83.51 |
[46] | PER, ORG, LOC | 93.13 | 88.15 | 79.35 |
[48] | PER, ORG, LOC, CUR, NUM, PERC, TIME | 86.44 | 85.86 | 89.12 |
[49] | PER, ORG, LOC, CUR, NUM, PERC, TIME | 89.05 | 86.49 | 87.75 |
[52] | PER, ORG, LOC, CUR, NUM, PERC, TIME, MISC | 83.69 | 87.41 | 85.51 |
NER in tweets
Normalization
Proposed method
The theoretical background
-
Consonants: The Vietnamese language has 27 consonants, i.e., “b,” “ch,” “c,” “d,” “đ,” “gi,” “gh,” “g,” “h,” ’‘kh,” ’‘k,” “l,” “m,” “ngh,” “ng,” “nh,” “n,” “ph,” “q,” “r,” “s,” “th,” “tr,” “t,” “v,” “x,” “p.” In those consonants, there are eight tail consonants, i.e., “c,” “ch,” “n,” “nh,” “ng,” “m,” “p,” and “t.”
-
Syllables: A syllable may be a vowel, a combination of vowels, or a combination of vowels and tail consonants. According to the syllable dictionary of Hoang Phe, the Vietnamese language has 158 syllables, and the vowels in these syllables do not occur consecutively more than once, except for the syllables “ooc” and “oong.”
Normalization
Error detection
Typing errors
-
When using Telex typing, we have the combination of characters to form Vietnamese vowels, such as aa for â, aw for ă, ee for ê, oo for ô, ow for ơ, and uw for ư. Also we have one consonant, dd, for đ. For forming marks, we have s for acute accent, f for grave accent, r for hook accent, x for tilde, and j for heavy accent.
-
Similar to Telex typing, we have the combination of characters in VNI typing, such as: a6 for â, a8 for ă, e6 for ê, o6 for ô, o7 for ơ, u7 for ư, and d9 for đ. To form marks, we have 1 for accent, 2 for grave accent, 3 for hook accent, 4 for tilde, and 5 for heavy accent.
-
With the word, “Nguyễn,” we could have typing errors such as “nguyeenx,” “nguyênx,” or “nguyeenxx” with Telex typing, and “nguye6n4,” “nguyên4,” or “nguye6n44” with VNI typing.
-
With the word, “người”, we could type the following incorrect words: “ngươif,” “ngươfi,” “nguowfi,” “nguowif,” “nguofwi,” “nguofiw,” “nguoifw,” “nguoiwf,” or “nguowff” with Telex typing, and “nguwowi2,” “ngươ2i,” “nguo72i,” “nguo7i2,” “nguo27i,” “nguo2i7,” “nguoi27,” or “nguoi72” with VNI typing.
-
“án”: “asn,” “ans,” “a1n,” or “an1”
-
“àn”: “afn,” “anf,” “a2n,” or “an2”
-
“ản”: “arn,” “anr,” “a3n,” or “an3”
-
“ãn”: “axn,” “anx,” “a4n,” or “an4”
-
“ạn”: “ajn,” “anj,” “a5n,” or “an5”
Spelling errors
-
Error due to using the wrong mark: “quyển sách” (book) to “quyễn sách”
-
Initial consonant error: “bóng chuyền” (volleyball) to “bóng truyền”
-
End consonant error: “bài hát” (song) to “bài hác”
-
Region error: “tìm kiếm” (find) to “tìm kím”
Error correction
a. Similarity of two morphosyllables
Dice coefficient
-
\(\mid {\text {bigram}}_{w_i} \mid\) and \(\mid {\text {bigram}}_{w_j}\mid\) are the total bigrams of \(w_i\) and \(w_j\)
-
\(\mid {\text {bigram}}_{w_i}\mid \bigcap \mid {\text {bigram}}_{w_j}\mid\) are the number of bigrams which appear in \(w_i\) and \(w_j\) at the same time.
Proposed method to improve the Dice coefficient
Error morphosyllable | Correct morphosyllable | Dice | fDice |
---|---|---|---|
rat | rất | 0 |
0.333
|
rat | rác | 0 | 0 |
Nguễn | Nguyễn | 0.667 |
0.727
|
Nguễn | Nguy | 0.571 | 0.571 |
tượg | Tượng | 0.571 |
0.667
|
tượg | Tương | 0.286 |
0.444
|
b. Similarity of two sentences
Spelling error tweets | Normalized tweets |
---|---|
xe đón hồ ngọc hà gây tai nạn kinhh hoàng: sẽ khởi tố tài xế http://fb.me/2MwvznBbj
| xe đón hồ ngọc hà gây tai nạn kinh hoàng: sẽ khởi tố tài xế (the car picked up ho ngoc ha caused a terrible accident: the driver will be prosecuted) |
hôm nay, siinh viên ddaijj học tôn dduwcss thắng được nghỉ học | hôm nay, sinh viên đại học tôn đức thắng được nghỉ học (today, students of ton duc thang university were allowed to absent) |
Capitalization classifier
Tweets before capitalization | Tweets after capitalization classifier |
---|---|
xe đón h
\(\grave{\hat{{\mathrm{o}}}}\)
ngọc hà gây tai nạn kinh hoàng: sẽ khởi tố tài xế | xe đón H
\(\grave{\hat{{\mathrm{o}}}}\)
Ngọc Hà gây tai nạn kinh hoàng: sẽ khởi tố tài xế(the car picked up Ho Ngoc Ha caused a terrible accident: the driver will be prosecuted) |
hôm nay, sinh viên đại học tôn đức thắng được nghı̉ học | hôm nay, sinh viên Đại học Tôn Đức Thắng được nghı̉ học (today, students of Ton Duc Thang university were allowed to absent) |
Word segmentation and part of speech (POS) tagging
Extraction of features
-
I: current morphosyllable is inside of a named entity (NE).
-
O: current morphosyllable is outside of an NE.
-
B: current morphosyllable is the beginning of an NE.
-
Word position The position of words in a sentence.
-
POS POS tag of the current word.
-
Orthographic Capitalization of first character, capitalization of all letters, lowercase, punctuation, numbers.
-
Gazetteer We build several gazetteer lists, such as person, location, organization, and prefixes. These gazetteer lists consist of more than 50,000 names of people, nearly 12,000 names of locations, and 7000 names of organizations.
-
Prefix, Suffix The first and the second character; the last and the next to the last character of the current word.
-
POS Prefix, POS Suffix POS tags of two previous words and POS tags of two following words of the current word.
Label | Value | Meaning |
---|---|---|
O | [1] | Outside a named entity |
B-PER | [2] | Beginning morphosyllable of a NE belongs to a Person class |
I-PER | [3] | Inside morphosyllable of a NE belongs to Person class |
B-LOC | [4] | Beginning morphosyllable of a NE belongs to Location class |
I-LOC | [5] | Inside morphosyllable of a NE belongs to Location class |
B-ORG | [6] | Beginning morphosyllable of a NE belongs to Organization class |
I-ORG | [7] | Inside morphosyllable of a NE belongs to Organization class |
Evaluation
Data using for normalization
NER training set
-
The noun prefix for people such as you, sister, uncle, and president.
-
The noun prefix for organizations such as company, firm, and corporation.
-
The noun prefix for locations such as province, city, and district.
-
List of dictionary for states, provinces of Vietnam, and others.
Tweets | Tweets after assigning labels |
---|---|
xe đón Hồ Ngọc Hà gây tai nạn kinh hoàng: sẽ khởi tố tài xế | xe đón <PER> Hồ Ngọc Hà </PER> gây tai nạn kinh hoàng: sẽ khởi tố tài xề (the car picked up Ho Ngoc Ha caused a terrible accident: the driver will be prosecuted) |
hôm nay, sinh viên Đại học Tôn Đức Thắng được nghı̉ học | hôm nay, sinh viên <ORG> Đại học Tôn Đức Thắng </ORG> được nghı̉ học (today, students of Ton Duc Thang university were allowed to absent) |
-
<label>: value from 1 to 7 according to 7 labels (O, B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG).
-
<index>:<value>: order of feature and value corresponding to feature of a word, respectively.
Entity type | Number of named entities |
---|---|
PER | 10,842 |
LOC | 19,037 |
ORG | 12,311 |
Experiments
-
Precision (P): number of correctly fixed errors divided by the total number of errors detected.
-
Recall (R): number of correctly fixed errors divided by the total error.
-
Balance F-measure (F1): \(F_1= \frac{2*P*R}{p+R}\)
Method | Precision (%) | Recall (%) | F-Measure (%) |
---|---|---|---|
Dice | 83.85 | 82.76 | 83.30 |
fDice | 89.66 | 88.50 | 89.08 |
-
Precision (P): the number of correctly recognized named entities divided by the total number of named entities recognized by the NER system.
-
Recall (R): the number of correctly recognized named entities divided by the total number of named entities in the test set.
-
Balance F-Measure (F1): \(F_1= \frac{2*P*R}{p+R}\)
Case | # NEs in testing set | # recognized NEs | # correctly recognized NEs | # wrong recognized NEs | P (%) | R (%) | F1 (%) |
---|---|---|---|---|---|---|---|
1 | 3186 | 2593 | 2163 | 430 | 83.41 | 67.89 | 74.86 |
2 | 3186 | 2982 | 2533 | 449 | 84.94 | 79.50 | 82.13 |