Text Mining Based on the Lexicon-Constrained Network in the Context of Big Data

Boyan Wan,Mishal Sohail

doi:10.1155/2022/8703100

Abstract

Unstructured textual news data is produced every day; analyzing them using an abstractive summarization algorithm provides advanced analytics to decision-makers. Deep learning network with copy mechanism is finding increasing use in abstractive summarization, because copy mechanism allows sequence-to-sequence models to choose words from the input and put them directly into the output. However, since there is no explicit delimiter in Chinese sentences, most existing models for Chinese abstractive summarization can only perform character copy, resulting in inefficiency. To solve this problem, we propose a lexicon-constrained copying network that models multigranularity in both encoder and decoder. On the source side, words and characters are aggregated into the same input memory using a Transformer-based encoder. On the target side, the decoder can copy either a character or a multicharacter word at each time step, and the decoding process is guided by a word-enhanced search algorithm which facilitates the parallel computation and encourages the model to copy more words. Moreover, we adopt a word selector to integrate keyword information. Experiment results on a Chinese social media dataset show that our model can work standalone or with the word selector. Both forms can outperform previous character-based models and achieve competitive performances.

Highlights

In recent years, abstractive summarization [1] has made impressive progress with the development of sequence-tosequence framework [2, 3]
The gap between our lexicon-constrained copying network (LCCN) and vanilla Transformer is further widened to 1.8 ROUGE-1, 2.1 ROUGE-2, and 2.5 ROUGE-L, which asserts the superiority of lexicon-constrained copying over character-based copying
Compared to other latest models, our LCCN can achieve state-of-the-art performance in terms of ROUGE-1 and ROUGE-2 and is second only to the Keyword and Generated Word Attention (KGWA) in terms of ROUGE-L

Summary

Introduction

Abstractive summarization [1] has made impressive progress with the development of sequence-tosequence (seq2seq) framework [2, 3]. This framework is composed by an encoder and a decoder. The encoder processes the source text and extracts the necessary information for the decoder, which predicts each word in the summary. Thanks to their generative nature, abstractive summaries can include novel expressions never seen in the source text, but at the same time, abstractive summaries are more difficult to produce compared with extractive summaries [4, 5] which are formed by directly selecting a subset of the source text. The idea is to allow the decoder to generate a summary from scratch and copy words from the source text

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Wireless Communications and Mobile Computing	Publication Date: Mar 12, 2022
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Text Mining Based on the Lexicon-Constrained Network in the Context of Big Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Wireless Communications and Mobile Computing

Lead the way for us

Similar Papers

Finding the best trade-off between performance and interpretability in predicting hospital length of stay using structured and unstructured data.
Franck Jaotombo ... Badih Ghattas
PLOS ONE | VOL. 18
Franck Jaotombo, et. al.Franck Jaotombo ... Badih Ghattas
30 Nov 2023
PLOS ONE | VOL. 18

Research on Key Technologies of Deep Learning Techniques in Unstructured Data Processing
Guorong Zhang ... Huiqin Zhou
Applied Mathematics and Nonlinear Sciences | VOL. 9
Guorong Zhang, et. al.Guorong Zhang ... Huiqin Zhou
01 Jan 2024
Applied Mathematics and Nonlinear Sciences | VOL. 9

Identification and Prediction of Human Behavior through Mining of Unstructured Textual Data
Mohammad Reza Davahli ... Edgar Gutierrez
Symmetry | VOL. 12
Mohammad Reza Davahli, et. al.Mohammad Reza Davahli ... Edgar Gutierrez
19 Nov 2020
Symmetry | VOL. 12

On the Power of Massive Text Data
Jiawei Han
-
Jiawei HanJiawei Han
02 Feb 2018
02 Feb 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Text Mining Based on the Lexicon-Constrained Network in the Context of Big Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Wireless Communications and Mobile Computing