Abstract

The Internet is a significant source of textual data, with users generating vast amounts of information through social media and news agencies daily. The extraction of meaningful information from large datasets is a challenging and costly process. Text pre-processing is a crucial initial step in any Natural Language Processing (NLP) task, as it can impact the overall performance of the study. The main objective of text pre-processing is to transform unstructured text into a linguistically meaningful (standard form) format, making extracting information for any text-processing task easier. This paper introduces TPTS, a model for text pre-processing in the Sindhi language. TPTS performs essential NLP tasks such as text tokenization, normalization, stop-word removal, stemming, and POS tagging for the Sindhi language. The Sindhi Text Corpus (STC), consisting of 1.5k Sindhi text documents collected from various online news websites, is used for experimentation. The TF-IDF approach is employed to identify high-frequency stop-words in the Sindhi language. Furthermore, a rule-based system tags words with their part of speech in Sindhi input text. The ROUGE evaluation metric is used to assess the effectiveness of the proposed TPTS technique, achieving 89% accuracy on the STC corpus. The Sindhi language is spoken by over 30 million people globally, and the lack of adequate NLP tools and resources limits the development of technology and natural language applications that can benefit Sindhi speakers. The proposed TPTS model can aid in developing such applications, making it beneficial not only for text pre-processing tasks but also for other Sindhi language text-processing tasks such as text summarization, sentiment analysis, speech-processing applications, text mining, and information retrieval systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call