Abstract

Pattern matching is widely used in various fields such as information retrieval, natural language processing (NLP), data mining and network security. In Uyghur (a typical agglutinative, low-resource language with complex morphology, spoken by the ethnic Uyghur group in Xinjiang, China), research on pattern matching is also ongoing. Due to the language characteristics, the pattern matching using characters and words as basic units has insufficient performance. There are two problems for pattern matching: (1) vowel weakening and (2) morphological changes caused by suffixes. In view of the above problems, this paper proposes a Boyer–Moore-U (BM-U) algorithm and a retrievable syllable coding format based on the syllable features of the Uyghur language and the improvement of the Boyer–Moore (BM) algorithm. This algorithm uses syllable features to perform pattern matching, which effectively solves the problem of weakening vowels, and it can better match words with stem shape changes. Finally, in the pattern matching experiments based on character-encoded text and syllable-encoded text for vowel-weakened words, the BM-U algorithm precision, recall, F1-measure and accuracy are improved by 4%, 55%, 33%, 25% and 10%, 52%, 38%, 38% compared to the BM algorithm.

Highlights

  • Pattern matching refers to a given string T with length n, and another string P with length m (m ≤ n)

  • The Boyer–Moore algorithm was improved by combining the syllable feature information of the morphological changes of words, and pattern matching was performed on the ordinary text and the syllable-encoded text proposed in this paper

  • The Tz format proposed in this paper is a searchable compressed text format based on syllable encoding

Read more

Summary

Introduction

Pattern matching refers to a given string (hereinafter referred to as text) T with length n, and another string (hereinafter referred to as pattern) P with length m (m ≤ n). The agglutinative and morphological complexity of Uyghur language is one of the main difficulties in its pattern matching research. In languages such as English, Chinese, and Uyghur, characters and words are constituent units of different granularities in the language and are often used as the basic unit of pattern matching research. The Boyer–Moore algorithm was improved by combining the syllable feature information of the morphological changes of words, and pattern matching was performed on the ordinary text and the syllable-encoded text proposed in this paper. (1) Our research on the structural features of Uyghur words and syllables, and the proposed searchable compression format based on syllables, will help improve the performance of existing pattern matching algorithms. Through the limited expansion of pattern matching sequences, the problem of mismatch caused by morphological changes is solved, and the semantically similar matching effect and recall, precision, accuracy, and F1 values have improved significantly. (3) The research on pattern matching in this paper is applicable to other syllabic agglutinative languages and can serve as a useful reference for the pattern matching research of other languages of the same type

Related Research
Uyghur Alphabet
Morphological Changes of Words
Syllable-Encoded Text
Basic Concepts
Retrieval Parameters and Calculation Formulas
Preparation of Experimental Corpus
Matching of Existing Algorithms
Analysis
Solutions
Results
Improvement of BM Algorithm
Experiment and Analysis
BM-U Word Morphology Matching Ability
Experimental Results
Analysis of Experimental Results
Matching Experiments on Natural Language Sentences
Monosyllabic and Non-syllabic Retrieval
Comparison with Other Related Studies
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call