Abstract

Kernel methods are powerful tools in machine learning. They have to be computationally efficient. This paper builds on our previous work which proposed a list-based approach to compute efficiently the string subsequence kernel (SSK). In this paper we present a novel Geometric-based approach, our main idea is that the SSK computation reduces to the range query problem. We started with the construction of a match list $$L(s,t)=\left\{ (i,j):s_{i}=t_{j}\right\} $$ where s and t are the strings to be compared; such a match list contains only the required data that contribute to the result. To compute the SSK efficiently, we extended the layered range tree data structure to a layered range sum tree, a range-aggregation data structure. The SSK computation takes $$O(p|L|\log |L|)$$ time and $$O(|L|\log |L|)$$ space, where |L| is the size of the match list and p is the length of the SSK. We present an empirical evaluation of our approach against the dynamic and the sparse dynamic programming approaches both on synthetically generated data and on newswire article data. Experimental results show the efficiency of our approach for large alphabets except for very short strings. So it can be used in many applications like text categorization, information extraction and music genre classification. Moreover, compared to the sparse dynamic approach, the proposed approach outperforms also for long strings.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.