Development of Sindhi text corpus

Mazhar Ali Dootio,Asim Imdad Wagan

doi:10.1016/j.jksuci.2019.02.002

Abstract

Sindhi language is a rich language with plenty of literary and general texts. There are number of books, newspapers, magazines and internet material available to develop Sindhi text corpus but yet proper and useful text corpus could not be developed and presented online for research, language features analysis, linguistics analysis and information retrieval systems. The lack of resources for research on computational linguistics and NLP applications for Sindhi language are challenging tasks at this stage. However, we have developed Sindhi text corpora in order to provide text resources to computational linguists, Natural Languages process (NLP) experts and researchers. Online books, newspapers, magazines, blogs and social websites are utilized to build Sindhi text corpus. Sindhi sentiment based text corpus is developed and analyzed with Document Term Matrix and TF-IDF models using 2-gram technique of n-gram model. The corpus may be useful for research on language variation analysis, sentiment analysis, aspect based sentiment analysis, semantic analysis, machine translation, information retrieval, Word2Vec, topic modeling and cluster analysis.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of King Saud University - Computer and Information Sciences	Publication Date: Feb 11, 2019
Citations: 7	License type: cc-by-nc-nd

R Discovery Prime

R Discovery Prime

Development of Sindhi text corpus

Abstract

Talk to us

Similar Papers

More From: Journal of King Saud University - Computer and Information Sciences

Lead the way for us

Similar Papers

An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning Methods
Mazhar Ali ... Asim Imdad Wagan
Mehran University Research Journal of Engineering and Technology | VOL. 38
Mazhar Ali, et. al.Mazhar Ali ... Asim Imdad Wagan
01 Jan 2019
Mehran University Research Journal of Engineering and Technology | VOL. 38

TPTS: Text pre-processing Techniques for Sindhi Language
Ali Nawaz ... Muhammad Khalid
Pakistan Journal of Emerging Science and Technologies (PJEST) | VOL. 4
Ali Nawaz, et. al.Ali Nawaz ... Muhammad Khalid
28 Jun 2023
Pakistan Journal of Emerging Science and Technologies (PJEST) | VOL. 4

Natural Language Processing in Strategy and Implementation
Tankiso Moloi ... Tshilidzi Marwala
-
Tankiso Moloi, et. al.Tankiso Moloi ... Tshilidzi Marwala
01 Jan 2020
01 Jan 2020

A Survey of Transformer and GNN for Aspect-based Sentiment Analysis
Wenqing Luo ... Wei Zhang
-
Wenqing Luo, et. al.Wenqing Luo ... Wei Zhang
01 Sep 2021
01 Sep 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Development of Sindhi text corpus

Abstract

Talk to us

Similar Papers

More From: Journal of King Saud University - Computer and Information Sciences