Big Data Full-Text Search Index Minimization Using Text Summarization

Waheed Iqbal,Waqas Ilyas Malik,Khaled Mohamad Almustafa,Faisal Bukhari,Zubiar Nawaz

doi:10.5755/j01.itc.50.2.25470

Waheed Iqbal, Waqas Ilyas Malik + Show 3 more

Open Access

https://doi.org/10.5755/j01.itc.50.2.25470

Copy DOI

Abstract

An efficient full-text search is achieved by indexing the raw data with an additional 20 to 30 percent storagecost. In the context of Big Data, this additional storage space is huge and introduces challenges to entertainfull-text search queries with good performance. It also incurs overhead to store, manage, and update the largesize index. In this paper, we propose and evaluate a method to minimize the index size to offer full-text searchover Big Data using an automatic extractive-based text summarization method. To evaluate the effectivenessof the proposed approach, we used two real-world datasets. We indexed actual and summarized datasets usingApache Lucene and studied average simple overlapping, Spearman’s rho correlation, and average rankingscore measures of search results obtained using different search queries. Our experimental evaluation showsthat automatic text summarization is an effective method to reduce the index size significantly. We obtained amaximum of 82% reduction in index size with 42% higher relevance of the search results using the proposedsolution to minimize the full-text index size.

Highlights

376 datasets, the performance of Lucene decreases Information Technology and Contsroigl nificantly
We propose and evaluate a method to minimize the index size to offer full-text search over Big Data using an automatic extractive-based text summarization method
The main contributions of this paper includes: _ We propose an automatic extractive-based text summarization for Big Data index minimization for the full-text search problem. _ We evaluate the effectiveness of the proposed method by studying relevance and overlapping of the search query results with baseline datasets. _ Study the effect of different text summarization threshold levels on data index minimization and search results

Summary

Introduction

376 datasets, the performance of Lucene decreases Information Technology and Contsroigl nificantly. Recent advancements and adaptation of can be reduced to a smaller representative technology are contributing to growing digital data dataset for indexing to offer full-text search exponentially. The expected line is plotted by fitting the line using small size datasets varying from 1 GB to 10 GB This shows that on increasing size of with actWuael daptraospetos.se an automatic extractivebased text summarization for Big Data index minimization for the full-text search problem. The main contributions of this paper includes: _ We propose an automatic extractive-based text summarization for Big Data index minimization for the full-text search problem. _ Study the effect of different text summarization threshold levels on data index minimization and search results.

Related Work

Apache Lucene

Apache Solr

Elasticsearch

Cloudera Search

Sphinx

Xapian

Centroid-based Text Summarization

Datasets and Search Queries

E Search

Search Results

Experiment 1

Experiment 3

Experimental Summary

Conclusion and Future Work

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Information Technology and Control	Publication Date: Jun 17, 2021
Citations: 4	License type: cc-by

R Discovery Prime

R Discovery Prime

Big Data Full-Text Search Index Minimization Using Text Summarization

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information Technology and Control

Lead the way for us

Similar Papers

A Brief Note on DocumentSummarization

-

01 Aug 2020
01 Aug 2020

Extractive Text Summarization for Social News using Hybrid Techniques in Opinion Mining
M Nafees Muneera ... P Sriramya
International Journal of Engineering and Advanced Technology | VOL. 9
M Nafees Muneera, et. al.M Nafees Muneera ... P Sriramya
28 Feb 2020
International Journal of Engineering and Advanced Technology | VOL. 9

A Novel Hash-Based Streaming Scheme for Energy Efficient Full-Text Search in Wireless Data Broadcast
Kai Yang ... Jiaofei Zhong
-
Kai Yang, et. al.Kai Yang ... Jiaofei Zhong
01 Jan 2010
01 Jan 2010

Techniques of Big Data Text Summarization
Anish Mathew Kuriakose* ...
International Journal of Recent Technology and Engineering (IJRTE) | VOL. 8
Anish Mathew Kuriakose*, et. al.Anish Mathew Kuriakose* ...
30 Nov 2019
International Journal of Recent Technology and Engineering (IJRTE) | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Big Data Full-Text Search Index Minimization Using Text Summarization

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information Technology and Control