Efficient geometric-based computation of the string subsequence kernel

Slimane Bellaouar,Djelloul Ziadi,Hadda Cherroun

doi:10.1007/s10618-017-0545-7

Abstract

Kernel methods are powerful tools in machine learning. They have to be computationally efficient. This paper builds on our previous work which proposed a list-based approach to compute efficiently the string subsequence kernel (SSK). In this paper we present a novel Geometric-based approach, our main idea is that the SSK computation reduces to the range query problem. We started with the construction of a match list $$L(s,t)=\left\{ (i,j):s_{i}=t_{j}\right\} $$ where s and t are the strings to be compared; such a match list contains only the required data that contribute to the result. To compute the SSK efficiently, we extended the layered range tree data structure to a layered range sum tree, a range-aggregation data structure. The SSK computation takes $$O(p|L|\log |L|)$$ time and $$O(|L|\log |L|)$$ space, where |L| is the size of the match list and p is the length of the SSK. We present an empirical evaluation of our approach against the dynamic and the sparse dynamic programming approaches both on synthetically generated data and on newswire article data. Experimental results show the efficiency of our approach for large alphabets except for very short strings. So it can be used in many applications like text categorization, information extraction and music genre classification. Moreover, compared to the sparse dynamic approach, the proposed approach outperforms also for long strings.

Full Text