Abstract

Word segmentation is an important support of semantic analysis, Machine Translation, QA, knowledge mapping research work, mainly used in information retrieval, text processing, data processing and many other areas of Natural Language Processing. Therefore, the realization of word segmentation is a very meaningful work. The method of this paper is to segment the syllables of the text corpus of Lao language and complete the maximal matching of syllables and dictionaries. Then match the results of the word segmentation and the error dictionary, and correct some wrong words by the error dictionary. Finally, we use regular expressions to match the corresponding word strings in segmentation results and correct the wrong words by some artificially formulated rules of the alphabet, numbers, etc. in the Lao language. It can improve the efficiency and accuracy rate of Laos Word Segmentation.

Highlights

  • Words are the smallest meaningful units in natural language

  • We found the Lao language dictionary which contains 15768 commonly used words in total and a large number of English-Chinese dictionaries and Chinese-Laos dictionaries in the Lao language network and English Network

  • Using the method of syllable segmentation to segment the syllable of the Lao language text which contains 30000 words and complete the longest matching based on syllables

Read more

Summary

Introduction

Words are the smallest meaningful units in natural language. The goal of the word segmentation is to divide the sentence into words. Word segmentation is the most basic work in the Natural Language Processing. In the aspect of text similarity computation, text clustering, search engine and information retrieval can not be separated from the work of word segmentation. These information processing work is an indispensable part of this powerful Internet Age. The Lao language is composed of syllables, and the syllables are composed of letters. At present there are several commonly used methods of word segmentation: the method of word segmentation based on semantics[1] and the method of word segmentation based on dictionary[2]. The method of word segmentation based on dictionary is learning language knowledge[3,4] from large-scale corpus and getting a model of word segmentation, using this model to complete the word segmentation

Get a matching dictionary
The syllable rule of Lao language
Syllable segmentation
Longest syllable matching
The acquisition and rule making of the error dictionary
Experimental results and analysis
Word segmentation by different dictionary structure
The influence of error dictionary on word segmentation
The influence of rule correction on word segmentation
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call