Abstract
Word segmentation is an important support of semantic analysis, Machine Translation, QA, knowledge mapping research work, mainly used in information retrieval, text processing, data processing and many other areas of Natural Language Processing. Therefore, the realization of word segmentation is a very meaningful work. The method of this paper is to segment the syllables of the text corpus of Lao language and complete the maximal matching of syllables and dictionaries. Then match the results of the word segmentation and the error dictionary, and correct some wrong words by the error dictionary. Finally, we use regular expressions to match the corresponding word strings in segmentation results and correct the wrong words by some artificially formulated rules of the alphabet, numbers, etc. in the Lao language. It can improve the efficiency and accuracy rate of Laos Word Segmentation.
Highlights
Words are the smallest meaningful units in natural language
We found the Lao language dictionary which contains 15768 commonly used words in total and a large number of English-Chinese dictionaries and Chinese-Laos dictionaries in the Lao language network and English Network
Using the method of syllable segmentation to segment the syllable of the Lao language text which contains 30000 words and complete the longest matching based on syllables
Summary
Words are the smallest meaningful units in natural language. The goal of the word segmentation is to divide the sentence into words. Word segmentation is the most basic work in the Natural Language Processing. In the aspect of text similarity computation, text clustering, search engine and information retrieval can not be separated from the work of word segmentation. These information processing work is an indispensable part of this powerful Internet Age. The Lao language is composed of syllables, and the syllables are composed of letters. At present there are several commonly used methods of word segmentation: the method of word segmentation based on semantics[1] and the method of word segmentation based on dictionary[2]. The method of word segmentation based on dictionary is learning language knowledge[3,4] from large-scale corpus and getting a model of word segmentation, using this model to complete the word segmentation
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have