Abstract
Data summarization, in the form of extracting a representative subset (i.e, a data summary) from a massive data set, is often used for big data processing. A good summary can not only significantly reduce the information redundancy, but also provide a better understanding of the original data. The utility function we use to evaluate the quality of a summary usually has a natrual diminishing returns property, also known as submodularity. Due to the rapid growth of data scale, traditional offline data processing has become more and more difficult to deal with massive data, and streaming data processing methods that require less space start to attract attention, leading to the emergence of many related studies. In this paper, we first make an algorithmic view of methods widely used in streaming submodu-lar maximization with knapsack constraint. After analyzing the ideas behind them, we further propose a new algorithm, called RSStream, for the same problem. RSStream is an innovative combination of traditional sieve approach, multi-cadidate set method and augmentation strategy with data sampling. It can achieve the state-of-the-art approximation ratio within a near-linear time and space complexity. At the end, we execute our algorithm on two real data summarization applications to demonstrate the effectiveness and efficiency of it.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.