Abstract

Textual data is increasing exponentially and to extract the required information from the text, different techniques are being researched. Some of these techniques require the data to be presented in the tabular or matrix format. The proposed approach designs the Document Term Matrix for Marathi (DTMM) corpus and converts unstructured data into a tabular format. This approach has been called DTMM in this paper and it fails to consider the semantics of the terms. We propose another approach that forms synsets and in turn reduces dimensions to formulate a Document Synset Matrix for Marathi (DSMM) corpus. This also helps in better capturing the semantics and hence is context-based. We abbreviate and call this approach as DSMM and carry out experiments for document-similarity measurement on a corpus consisting of more than 1200 documents, consisting of both verses as well as proses, of Marathi language of India. Marathi text processing has been largely an untouched area. The precision, recall, accuracy, F1-score and error rate are used to prove the betterment of the proposed technique.

Highlights

  • India is a diverse country having around 23 different official languages and this has opened a wide area for natural language processing researchers

  • Marathi text gets generated day by day due to multilingual options provided by different websites

  • The similarity of more than 1000 Marathi documents including proses and verses is calculated using Document Synset Matrix for Marathi (DSMM). It uses the semantic relationship between words and forms a synset group of similar terms to form DSMM

Read more

Summary

Introduction

India is a diverse country having around 23 different official languages and this has opened a wide area for natural language processing researchers. Marathi text gets generated day by day due to multilingual options provided by different websites To process this data, natural language processing (NLP) techniques [18] along with machine learning algorithms are available in the literature. Proses and verses act as a guide to children about their behavior and Document Similarity determines how close the two text pieces are in a semantic and lexical way. Qualitative deals with the sentiment, general meaning of the corpus Numerical measures such as the total number of tokens, size of the document, are considered in the quantitative approach.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call