Abstractive text summarization techniques for Malayalam language is still in its infancy. The lack of benchmarked datasets for this task is one of the constraints in developing and testing good models. Malayalam has seven nominal case forms, two nominal number forms, and three gender forms. It is subjected to extreme agglutination and inflection. Due to this, the translation of other text summarization datasets to Malayalam may not capture these case forms effectively. Therefore curation of datasets from scratch is highly demanded for specific text-processing applications in Malayalam. This paper introduces a novel dataset designed specifically for advancing the field of automatic abstractive text summarization in Malayalam language. The dataset is curated to address the unique linguistic characteristics of the Malayalam language. It is named as Social-sum-Mal dataset, capable of addressing three different types of summarization tasks- long, extreme, and query-based summarizations. In addition, Social-sum-Mal can be extended for other applications like text classification, multi-document summarization, and question answering. To enhance the dataset transparency, a datasheet is created for Social-sum-Mal. Data accuracy and annotator biases are evaluated using proper testing strategies including Jaccard, cosine, and overlap similarities. The correctness of the dataset is further evaluated by comparing it with a deep-learning-based text summarization model.
Read full abstract