Abstract
Abstractive Text Summarization (ATS) is a task to create a novel summary by generating fresh sentences incorporating new words or rephrasing the article. It is a complex task as the model needs to understand the semantic similarity between the sentences of the text. To fulfill this, there is a need for a large annotated benchmark dataset, which is available for resource-rich languages such as English and non-indic languages. In contrast, for the less-resourced languages, such as Indic languages, the available datasets are limited and involve very short summary sentences. Hence, a language-specific abstractive summarization dataset called HindiSumm was introduced for Hindi, consisting of 570,000 text-summary pairs from Navbharat Times across 21 domains. The HindiSumm dataset’s efficiency is evaluated extrinsically and intrinsically by using various metrics. Furthermore, two recent multilingual-cased pre-trained models are fine-tuned on the HindiSumm dataset individually. In addition, an ensembled approach using weighted averaging is also incorporated to check the efficacy of the proposed dataset. The model is tested with the in-house created dataset, and results are evaluated on ROUGE scores and show significant improvements of around 13.2% for the proposed HindiSumm compared to other benchmark datasets. In the future, the HindiSumm dataset will promote the progress of ATS for the Indian language.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: ACM Transactions on Asian and Low-Resource Language Information Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.