Due to the remarkable capability in handling arbitrary cluster shapes, support vector clustering (SVC) benefits data analysis in terms of data description. However, large-scale data such as network traffic frequently makes it suffer from highly intensive pricey computation and storage for solving the dual problem and storing the kernel matrix, respectively. Fortunately, support vectors which describe the clusters, in a sense, are expected in the boundaries. To tackle this issue, we propose an efficient training SVC with appropriate boundary information (ETSVC), which features excellent flexibility and scalability. In ETSVC, we first give a shrinkable boundary selection (SBS) method which collects appropriate boundaries while reducing redundancy and noise. Based on the boundary information, a redefined dual problem is then designed without scarifying the principle of SVC. Finally, we design a reformative solver (RSolver) to reformulate the training phase, which estimates the support vector function by solving the dual problem. Since only a subset of boundaries is employed for model training, theoretical analysis suggests that ETSVC reaches efficiency improvement and consumes much less memory if sacrificing efficiency to reduce storage consumption. Towards grouping P2P flows and large-scale intrusion traffic, as well as other non-traffic data, experimental results confirm that ETSVC could be applied to resources constrained platform while achieving comparable accuracies with the state-of-the-art methods.
Read full abstract