Abstract

An efficient full-text search is achieved by indexing the raw data with an additional 20 to 30 percent storagecost. In the context of Big Data, this additional storage space is huge and introduces challenges to entertainfull-text search queries with good performance. It also incurs overhead to store, manage, and update the largesize index. In this paper, we propose and evaluate a method to minimize the index size to offer full-text searchover Big Data using an automatic extractive-based text summarization method. To evaluate the effectivenessof the proposed approach, we used two real-world datasets. We indexed actual and summarized datasets usingApache Lucene and studied average simple overlapping, Spearman’s rho correlation, and average rankingscore measures of search results obtained using different search queries. Our experimental evaluation showsthat automatic text summarization is an effective method to reduce the index size significantly. We obtained amaximum of 82% reduction in index size with 42% higher relevance of the search results using the proposedsolution to minimize the full-text index size.

Highlights

  • 376 datasets, the performance of Lucene decreases Information Technology and Contsroigl nificantly

  • We propose and evaluate a method to minimize the index size to offer full-text search over Big Data using an automatic extractive-based text summarization method

  • The main contributions of this paper includes: _ We propose an automatic extractive-based text summarization for Big Data index minimization for the full-text search problem. _ We evaluate the effectiveness of the proposed method by studying relevance and overlapping of the search query results with baseline datasets. _ Study the effect of different text summarization threshold levels on data index minimization and search results

Read more

Summary

Introduction

376 datasets, the performance of Lucene decreases Information Technology and Contsroigl nificantly. Recent advancements and adaptation of can be reduced to a smaller representative technology are contributing to growing digital data dataset for indexing to offer full-text search exponentially. The expected line is plotted by fitting the line using small size datasets varying from 1 GB to 10 GB This shows that on increasing size of with actWuael daptraospetos.se an automatic extractivebased text summarization for Big Data index minimization for the full-text search problem. The main contributions of this paper includes: _ We propose an automatic extractive-based text summarization for Big Data index minimization for the full-text search problem. _ Study the effect of different text summarization threshold levels on data index minimization and search results.

Related Work
Apache Lucene
Apache Solr
Elasticsearch
Cloudera Search
Sphinx
Xapian
Centroid-based Text Summarization
Datasets and Search Queries
E Search
Search Results
Experiment 1
Experiment 3
Experimental Summary
Conclusion and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call