Abstractive Text Summarization (ATS) is a task to create a novel summary by generating fresh sentences incorporating new words or rephrasing the article. It is a complex task as the model needs to understand the semantic similarity between the sentences of the text. To fulfill this, there is a need for a large annotated benchmark dataset, which is available for resource-rich languages such as English and non-indic languages. In contrast, for the less-resourced languages, such as Indic languages, the available datasets are limited and involve very short summary sentences. Hence, a language-specific abstractive summarization dataset called HindiSumm was introduced for Hindi, consisting of 570,000 text-summary pairs from Navbharat Times across 21 domains. The HindiSumm dataset’s efficiency is evaluated extrinsically and intrinsically by using various metrics. Furthermore, two recent multilingual-cased pre-trained models are fine-tuned on the HindiSumm dataset individually. In addition, an ensembled approach using weighted averaging is also incorporated to check the efficacy of the proposed dataset. The model is tested with the in-house created dataset, and results are evaluated on ROUGE scores and show significant improvements of around 13.2% for the proposed HindiSumm compared with other benchmark datasets. In the future, the HindiSumm dataset will promote the progress of ATS for the Indian language.
Read full abstract