Multidocument Arabic Text Summarization Based on Clustering and Word2Vec to Reduce Redundancy

Samer Abdulateef,Xuequn Shang,Naseer Ahmed Khan,Bolin Chen

doi:10.3390/info11020059

Samer Abdulateef, Xuequn Shang + Show 2 more

Open Access

https://doi.org/10.3390/info11020059

Copy DOI

Journal: Information	Publication Date: Jan 23, 2020
Citations: 32	License type: CC BY 4.0

Affiliation: Northwestern Polytechnical University

Abstract

Arabic is one of the most semantically and syntactically complex languages in the world. A key challenging issue in text mining is text summarization, so we propose an unsupervised score-based method which combines the vector space model, continuous bag of words (CBOW), clustering, and a statistically-based method. The problems with multidocument text summarization are the noisy data, redundancy, diminished readability, and sentence incoherency. In this study, we adopt a preprocessing strategy to solve the noise problem and use the word2vec model for two purposes, first, to map the words to fixed-length vectors and, second, to obtain the semantic relationship between each vector based on the dimensions. Similarly, we use a k-means algorithm for two purposes: (1) Selecting the distinctive documents and tokenizing these documents to sentences, and (2) using another iteration of the k-means algorithm to select the key sentences based on the similarity metric to overcome the redundancy problem and generate the initial summary. Lastly, we use weighted principal component analysis (W-PCA) to map the sentences’ encoded weights based on a list of features. This selects the highest set of weights, which relates to important sentences for solving incoherency and readability problems. We adopted Recall-Oriented Understudy for Gisting Evaluation (ROUGE) as an evaluation measure to examine our proposed technique and compare it with state-of-the-art methods. Finally, an experiment on the Essex Arabic Summaries Corpus (EASC) using the ROUGE-1 and ROUGE-2 metrics showed promising results in comparison with existing methods.

Highlights

Automatic text summarization (ATS) is a technique designed to automatically extract salient information from related documents, which helps to produce a summarized document from a related set of documents [1]
We proposed an unsupervised technique to overcome the problems with Arabic natural language processing (ANLP), as it is one of the complex languages in the world
We used an unsupervised technique based on multidocument Arabic text summarization and have focused on text summarization problems such as noisy information, redundancy elimination, and sentence ordering

Summary

Introduction

Automatic text summarization (ATS) is a technique designed to automatically extract salient information from related documents, which helps to produce a summarized document from a related set of documents [1]. The amount of text data is increasing rapidly in areas such as news, official documents, and medical reports, so there is a need to compress such data using machine learning techniques, and text summarization can assist in extracting the significant sentences from various related documents [2]. The main problems related to document summary are redundancy, noisy information, incoherency, and diminished readability [3]. The text clustering technique is used for eliminating redundancy, and the sentences are categorized into semantically correlated sentences. Text summarization is used for selecting the key sentences (rich significant information) from correlated documents. When selecting two sentences and making the Information 2020, 11, 59; doi:10.3390/info11020059 www.mdpi.com/journal/information

Results

Discussion

Conclusion