Parallel String Matching for Urdu Language Text

Mirza Baber Baig,Taoshen S Li

doi:10.1007/978-981-13-6052-7_32

Abstract

String matching is one of the essential problems in computer science. The language used in Pakistan is Urdu. For Urdu language texts, its characters are encoded by utf-8, and the utf-8 is a length-variable encoding. If we implement string matching algorithms for Urdu language texts by ASCII encoding, the correct matched positions may not be obtained. This paper analyzes the characteristics of Urdu language texts and studies the character encoding presentation for Urdu language texts and recognizes that the correct matched positions can be obtained when the wchar_t type and Unicode encoding is used to process Urdu language texts, then, this paper implements parallel algorithms for Boyer-Moore string matching, Knuth-Morris-Pratt string matching, and Sunday string matching for Urdu language texts and evaluate the execution performance of these four string matching algorithms on a large number of Urdu language patterns and text strings via experimental testing.

Full Text