Research on Uyghur Pattern Matching Based on Syllable Features

Wayit Abliz,Aishan Wumaier,Kahaerjiang Abiderexiti,Hao Wu,Maihemuti Maimaiti,Tuergen Yibulayin,Jiamila Wushouer

doi:10.3390/info11050248

Abstract

Pattern matching is widely used in various fields such as information retrieval, natural language processing (NLP), data mining and network security. In Uyghur (a typical agglutinative, low-resource language with complex morphology, spoken by the ethnic Uyghur group in Xinjiang, China), research on pattern matching is also ongoing. Due to the language characteristics, the pattern matching using characters and words as basic units has insufficient performance. There are two problems for pattern matching: (1) vowel weakening and (2) morphological changes caused by suffixes. In view of the above problems, this paper proposes a Boyer–Moore-U (BM-U) algorithm and a retrievable syllable coding format based on the syllable features of the Uyghur language and the improvement of the Boyer–Moore (BM) algorithm. This algorithm uses syllable features to perform pattern matching, which effectively solves the problem of weakening vowels, and it can better match words with stem shape changes. Finally, in the pattern matching experiments based on character-encoded text and syllable-encoded text for vowel-weakened words, the BM-U algorithm precision, recall, F1-measure and accuracy are improved by 4%, 55%, 33%, 25% and 10%, 52%, 38%, 38% compared to the BM algorithm.

Highlights

Pattern matching refers to a given string T with length n, and another string P with length m (m ≤ n)
The Boyer–Moore algorithm was improved by combining the syllable feature information of the morphological changes of words, and pattern matching was performed on the ordinary text and the syllable-encoded text proposed in this paper
The Tz format proposed in this paper is a searchable compressed text format based on syllable encoding

Summary

Introduction

Pattern matching refers to a given string (hereinafter referred to as text) T with length n, and another string (hereinafter referred to as pattern) P with length m (m ≤ n). The agglutinative and morphological complexity of Uyghur language is one of the main difficulties in its pattern matching research. In languages such as English, Chinese, and Uyghur, characters and words are constituent units of different granularities in the language and are often used as the basic unit of pattern matching research. The Boyer–Moore algorithm was improved by combining the syllable feature information of the morphological changes of words, and pattern matching was performed on the ordinary text and the syllable-encoded text proposed in this paper. (1) Our research on the structural features of Uyghur words and syllables, and the proposed searchable compression format based on syllables, will help improve the performance of existing pattern matching algorithms. Through the limited expansion of pattern matching sequences, the problem of mismatch caused by morphological changes is solved, and the semantically similar matching effect and recall, precision, accuracy, and F1 values have improved significantly. (3) The research on pattern matching in this paper is applicable to other syllabic agglutinative languages and can serve as a useful reference for the pattern matching research of other languages of the same type

Related Research

Uyghur Alphabet

Morphological Changes of Words

Syllable-Encoded Text

Basic Concepts

Retrieval Parameters and Calculation Formulas

Preparation of Experimental Corpus

Matching of Existing Algorithms

Analysis

Solutions

Results

Improvement of BM Algorithm

Experiment and Analysis

BM-U Word Morphology Matching Ability

Experimental Results

Analysis of Experimental Results

Matching Experiments on Natural Language Sentences

Monosyllabic and Non-syllabic Retrieval

Comparison with Other Related Studies

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Research on Uyghur Pattern Matching Based on Syllable Features

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information

Lead the way for us

Journal: Information	Publication Date: May 2, 2020
License type: CC BY 4.0

Similar Papers

Text Filtering through Multi-Pattern Matching: A Case Study of Wu–Manber–Uy on the Language of Uyghur
Turdi Tohti ... Xing Tan
Information | VOL. 10
Turdi Tohti, et. al.Turdi Tohti ... Xing Tan
24 Jul 2019
Information | VOL. 10

Comparison of Knuth Morris Pratt and Boyer Moore algorithms for a web-based dictionary of computer terms
Ali Khumaidi ... Yusuf Aras Ronisah
Jurnal Informatika | VOL. 14
Ali Khumaidi, et. al.Ali Khumaidi ... Yusuf Aras Ronisah
01 Jan 2020
Jurnal Informatika | VOL. 14

Performance assessment of dead-zone single keyword pattern matching
Melanie Mauch ... Tinus Strauss
-
Melanie Mauch, et. al.Melanie Mauch ... Tinus Strauss
01 Oct 2012
01 Oct 2012

Information Retrieval in Biomedicine: Natural Language Processing for Knowledge Integration
Martha F Earl
Journal of the Medical Library Association | VOL. 98
Martha F EarlMartha F Earl
01 Apr 2010
Journal of the Medical Library Association | VOL. 98

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Research on Uyghur Pattern Matching Based on Syllable Features

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information