Abstract

In high-frequency string extraction, there exists enormous time and memory waste in taking statistics of tremendous low-frequency strings, which causes low efficiency. Based on the incremental n-gram model, this paper puts forward Hierarchical Pruning Algorithm (HPA) to filter out low-frequency garbage strings and to extract candidate repeats for reducing I/O reading-writing times and enhancing efficiency of memory usage. On the basis of candidate repeats, external sort method is applied to merge all of them in order to obtain the final repeat set. For improving the efficiency of candidate repeats merging, this paper proposes to employ improved Radix Sort method to process strings in O(dn). With 32 gigabyte plain text corpus, experiments show that the relationship between I/O reading-writing times of HPA and the corpus size is nearly linear, and the algorithm can efficiently extract repeats from corpus whose size is much larger than that of memory. Index Terms—repeats, hash table, low-frequency string, hierarchical pruning algorithm

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.