An Efficient Training Dataset Generation Method for Extractive Text Summarization

Esther Hannah,Saswati Mukherjee

doi:10.1007/978-81-322-1602-5_101

Abstract

The work presents a method to automatically generate a training dataset for the purpose of summarizing text documents with the help of feature extraction technique. The goal of this approach is to design a dataset which will help to perform the task of summarization very much like a human. A document summary is a text that is produced from one or more texts that conveys important information in the original texts. The proposed system consists of methods such as pre-processing, feature extraction, and generation of training dataset. For implementing the system, 50 test documents from DUC2002 is used. Each document is cleaned by pre-processing techniques such as sentence segmentation, tokenization, removing stop word, and word stemming. Eight important features are extracted for each sentence, and are converted as attributes for the training dataset. A high quality, proper training dataset is needed for achieving good quality in document summarization, and the proposed system aims in generating a well-defined training dataset that is sufficiently large enough and noise free for performing text summarization. The training dataset utilizes a set of features which are common that can be used for all subtasks of data mining. Primary subjective evaluation shows that our training is effective, efficient, and the performance of the system is promising.

Full Text