Finding top- <inline-formula><tex-math notation="LaTeX">$k$</tex-math></inline-formula> frequent items has been a hot issue in databases. Finding top- <inline-formula><tex-math notation="LaTeX">$k$</tex-math></inline-formula> persistent items is a new issue, and has attracted increasing attention in recent years. In practice, users often want to know which items are significant, <i>i.e.</i> , not only frequent but also persistent. No prior art can address both of the above two issues at the same time. Also, for high-speed data streams, prior art cannot achieve high accuracy when the memory is tight. In this paper, we define a new issue, named finding significant items, and propose a novel algorithm namely LTC to address this issue. It includes two key techniques, Long-tail Restoring and CLOCK, as well as three optimizations. In addition, LTC is extended to support finding significant items with thresholds. We theoretically derive the correct rate and error bound, and conduct extensive experiments on three real datasets to test the performance of LTC. Our experimental results show that LTC can achieve <inline-formula><tex-math notation="LaTeX">$10^5$</tex-math></inline-formula> times higher accuracy in terms of average relative error than other related algorithms. Lastly, LTC is applied to a DDoS detection task and it shows that finding significant items is more powerful than finding frequent items.
Read full abstract