Abstract

AbstractAutomatic summarization of text documents is a widely researched domain in natural language processing. A lot of research is carried out on the most commonly spoken languages in the world. Automatic text summarization needs to be explored to include some of the less popular languages in the world to help sustain such languages and promote their use. A language-independent summarization system that can be effortlessly extended to other such languages, which could have a limited number of resources to carry out such research is required. In this paper, we examine the efficiency of supervised linear regression models for the performing single document extractive automatic text summarization on Konkani language folktales dataset. We use 13 language-independent features and linear regression models to learn feature weights. These weights are then used to calculate a sentence’s score; top ranking sentences are then chosen for summary generation. We employ a k-fold evaluation strategy to evaluate the system-generated summary against a human-generated summary using ROUGE evaluation toolkit. Additionally, we also evaluate the use of L1 and L2 regularization on the summarization task. The work represents early attempts in automatic text summarization pertaining to Konkani language, and the dataset employed in these experiments is unique and devised particularly to facilitate research in this domain. The language-independent features used can be readily extended to other low-resource languages. The systems implemented in this work performed better as compared to an unsupervised system based on k-means approach and also beat the baseline systems.KeywordsSupervised machine learningText summarizationKonkaniRegressionExtractiveNatural language processingLow resourceLanguage-independent features

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call