Hindi Multi-document Word Cloud based Summarization through Unsupervised Learning

Prafulla B Bafna,Jatinderkumar R Saini

doi:10.1109/icetet-sip-1946815.2019.9092259

Abstract

Managing documents is a critical and significant task and supports many applications ranging from information retrieval to clustering search engine results. The multilinguistic facility provided by websites makes Hindi as a major language in the digital domain of information technology today. This work focuses on document management and summarization of Hindi corpus. The objective is to manage the documents and summarize Hindi corpus by applying extracting tokens and document clustering. The work is better in terms of scalability and supports consistent quality of cluster for incremental data set. Most of the past and contemporary research works have targeted English corpus document management. Hindi corpus has been mostly exploited by the researchers for exploring stemming, single- document summarization and classifier design on Hindi corpus. Implementing unsupervised learning on Hindi corpus for summarization of multiple documents through Word Cloud is still an untouched area. Technically speaking, the current work is an application of TF-IDF, cosine-based document similarity measures and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities. Entropy and precision are used to evaluate the experiments carried on different live and available/tested datasets and results

Full Text