Identifying High Quality Document–Summary Pairs through Text Matching

Yongshuai Hou,Qingcai Chen,Xiaolong Wang,Buzhou Tang,Yang Xiang,Fangze Zhu

doi:10.3390/info8020064

Abstract

Text summarization namely, automatically generating a short summary of a given document, is a difficult task in natural language processing. Nowadays, deep learning as a new technique has gradually been deployed for text summarization, but there is still a lack of large-scale high quality datasets for this technique. In this paper, we proposed a novel deep learning method to identify high quality document–summary pairs for building a large-scale pairs dataset. Concretely, a long short-term memory (LSTM)-based model was designed to measure the quality of document–summary pairs. In order to leverage information across all parts of each document, we further proposed an improved LSTM-based model by removing the forget gate in the LSTM unit. Experiments conducted on the training set and the test set built upon Sina Weibo (a Chinese microblog website similar to Twitter) showed that the LSTM-based models significantly outperformed baseline models with regard to the area under receiver operating characteristic curve (AUC) value.

Highlights

In the era of the Internet, we as humans share our experiences or transfer information between each other through multimedia processes, such as instant messaging, question-answering communities, and microblogs
The experiments were arranged as follows: (1) We compared the area under receiver operating characteristic curve (AUC) values to the baselines and the proposed methods to examine the learning ability of long short-term memory (LSTM)-based models on the training set and the test set; (2) As size of the labeled training set were limited, we considered how the training size would affect the AUC values so that we could further annotate more samples to enhance the current results; (3) For the document–summary pairs identification problem being considered as binary classification problem in our study, we compared the models trained with two-classes dataset and five-classes dataset; (4) As the LSTM units have the ability to retain information as state in Section 4.3, we expected how the results would change with the addition of the dimension of character embeddings
The performance of the method Recall-oriented understudy of gisting evaluation (ROUGE) was better than supporting vector machines (SVM) of 5.74% and Convolutional Neural Networks (CNN) of 2.97% on the testing set, and even better than LSTM-I of 0.19% on the training set, which showed that the number of common words between document and its summary was an effective feature for identifying document–summary pairs

Summary

Introduction

In the era of the Internet, we as humans share our experiences or transfer information between each other through multimedia processes, such as instant messaging, question-answering communities, and microblogs. Many of us can receive hundreds or thousands of messages posted by official news agencies, professional commentators, or people we pay attention to daily, most of which offer valuable information to us. It is a heavy load to process all this information, so it becomes more important to examine how to capture the brief ideas that the messages convey. Automatic text summarization is the task of generating short summaries from the given long documents [1]. A summary includes the main idea of a given document, which is essentially what this document aims to express. A weibo is a microblog message, from one of the most popular Chinese microblog websites, Sina Weibo (http://www.weibo.com)

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Information	Publication Date: Jun 12, 2017
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Identifying High Quality Document–Summary Pairs through Text Matching

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information

Lead the way for us

Similar Papers

Abstractive Arabic Text Summarization Based on Deep Learning.
Y.M Wazery ... Marwa E Saleh
Computational Intelligence and Neuroscience | VOL. 2022
Y.M Wazery, et. al.Y.M Wazery ... Marwa E Saleh
11 Jan 2022
Computational Intelligence and Neuroscience | VOL. 2022

Sentence matching for question answering with neural networks

-

01 May 2019
01 May 2019

CNO-LSTM: A Chaotic Neural Oscillatory Long Short-Term Memory Model for Text Classification
Nuobei Shi ... Raymond S T Lee
IEEE Access | VOL. 10
Nuobei Shi, et. al.Nuobei Shi ... Raymond S T Lee
01 Jan 2021
IEEE Access | VOL. 10

Sentic LSTM: a Hybrid Network for Targeted Aspect-Based Sentiment Analysis
Yukun Ma ... Haiyun Peng
Cognitive Computation | VOL. 10
Yukun Ma, et. al.Yukun Ma ... Haiyun Peng
14 Mar 2018
Cognitive Computation | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Identifying High Quality Document–Summary Pairs through Text Matching

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information