Algorithm for Fast Finding High-Frequency Strings from Large-Scale Corpus

Haijun Zhang

doi:10.4304/jsw.9.8.2154-2159

Abstract

In high-frequency string extraction, there exists enormous time and memory waste in taking statistics of tremendous low-frequency strings, which causes low efficiency. Based on the incremental n-gram model, this paper puts forward Hierarchical Pruning Algorithm (HPA) to filter out low-frequency garbage strings and to extract candidate repeats for reducing I/O reading-writing times and enhancing efficiency of memory usage. On the basis of candidate repeats, external sort method is applied to merge all of them in order to obtain the final repeat set. For improving the efficiency of candidate repeats merging, this paper proposes to employ improved Radix Sort method to process strings in O(dn). With 32 gigabyte plain text corpus, experiments show that the relationship between I/O reading-writing times of HPA and the corpus size is nearly linear, and the algorithm can efficiently extract repeats from corpus whose size is much larger than that of memory. Index Terms—repeats, hash table, low-frequency string, hierarchical pruning algorithm

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Algorithm for Fast Finding High-Frequency Strings from Large-Scale Corpus

Abstract

Talk to us

Similar Papers

More From: Journal of Software

Lead the way for us

Similar Papers

Chapter 4 - Building an Efficient Hash Table on the GPU
Dan A Alcantara ... Nina Amenta
GPU Computing Gems Jade Edition | VOL. -
Dan A Alcantara, et. al.Dan A Alcantara ... Nina Amenta
30 Nov 2011
GPU Computing Gems Jade Edition | VOL. -

The external Heapsort
L.M Wegner ... J.I Teuhola
IEEE Transactions on Software Engineering | VOL. 15
L.M Wegner, et. al.L.M Wegner ... J.I Teuhola
01 Jul 1989
IEEE Transactions on Software Engineering | VOL. 15

A Signaling Monitor Scheme of RRC Protocol in 5G Road Tester
Bingying Zhang ... Bingguang Deng
-
Bingying Zhang, et. al.Bingying Zhang ... Bingguang Deng
01 Jan 2020
01 Jan 2020

Novel Hash-Based Radix Sorting Algorithm
Paul K Mandal ... Abhishek Verma
-
Paul K Mandal, et. al.Paul K Mandal ... Abhishek Verma
01 Oct 2019
01 Oct 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Algorithm for Fast Finding High-Frequency Strings from Large-Scale Corpus

Abstract

Talk to us

Similar Papers

More From: Journal of Software