Marathi Text Analysis using Unsupervised Learning and Word Cloud

Prafulla B Bafna, ,Jatinderkumar R Saini

doi:10.35940/ijeat.c4727.029320

Abstract

Managing a large number of textual documents is a critical and significant task and supports many applications ranging from information retrieval to clustering search engine results. Marathi is one of the oldest of the regional languages in the Indo-Aryan language family, dating from about AD 1000. Abundance of Marathi literature has generated a big corpus and need of summarization of information. The objective of this study is to overcome the scalability problem while managing the documents and summarize the Marathi corpus by extracting tokens. The work is better in terms of scalability and supports the consistent quality of cluster for incremental data set. Most of the past and contemporary research works have targeted English corpus document management. Marathi corpus has been mostly exploited by the researchers for exploring stemming, single-document summarization and classifier design on Marathi corpus. Implementing unsupervised learning on the Marathi corpus for summarization of multiple documents through Word Cloud is still an untouched area. Technically speaking, the current work is an application of TF-IDF, cosine-based document similarity measures and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities. Entropy and precision are used to evaluate the experiments carried on different datasets and results prove the robustness of the proposed approach for Marathi Corpus.

Highlights

In online and offline systems, documents are continuously generated, stored, and accessed every day in large volumes
It’s clear that Fuzzy K-means gives better entropy for incremental data size that is for 300 documents entropy produced by Fuzzy k-means (FKM) is improved by 40 %
The current study achieves the summarization of the clusters of Marathi corpus, unlike the other published research works which have focused only on single-document summarization

Summary

Introduction

In online and offline systems, documents are continuously generated, stored, and accessed every day in large volumes. Information technology generated huge data on the internet [44]. The maximum work is done in document management focuses on English corpus, but text in Marathi on the web has come of age since the advent of Unicode standards in Indic languages. This data is mainly in English language so majority of data mining research work is on the English text documents. As the internet usage increased, data in other languages like Marathi, Tamil, Telugu and Punjabi etc. In case of India, Maharashtra,Gujarat and Tamil Nadu are highly industrialized states. The largest recent indigenous empire of India was the Maratha empire

Objectives

Methods

Results

Conclusion