Abstract

Pattern matching with wildcards and length constraints (PMWL) is a complex problem which has important applications in bioinformatics, network security and information retrieval. Existing algorithms use the traditional left-most strategy when selecting among multiple candidate matching positions, which leads to incomplete final matching results. This paper presents a new data structure CluTree and a new matching algorithm RBCT*1 based on CluTree. After establishing a cluster of trees with red and black nodes according to a pattern P and a text T, which is called CluTree, our RBCT algorithm uses the sharing degree, correlation degree and mixed information entropy of each node in the CluTree for path selection and dynamic pruning. Our RBCT algorithm traverses the CluTree and finds more occurrences compared to the existing algorithms under the one-off condition in a linear time cost. Theoretical analysis and experimental results show that the RBCT algorithm outperforms other peers in retrieval precision and matching efficiency.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call