Abstract

As one of the most popular data reduction category for large scale data mining, simple random sampling (SRS) often leads to the loss of small clusters when dealing with unevenly distributed datasets. A density biased sampling algorithm based on grid can avoid the problem. However, the grid division granularity has an influence on the efficiency and effectiveness of the algorithm. To overcome the drawback, a variable grid density biased sampling is proposed to deal with large scale unevenly distributed datasets. However, the efficiency is restricted by dimensionality. Aiming at this, an efficient density biased sampling algorithm is proposed for large high-dimensional datasets. Firstly, an efficient feature selection method is designed to obtain the feature subsets. Secondly, the variable grid division is executed in the selected feature subsets. Finally, the sample is obtained from the grid space. Synthetic datasets and UCI datasets, tested in our experiments, reveal that the proposed algorithm can achieve higher quality than SRS. Meanwhile, the proposed algorithm consumes less sampling time comparing with density biased sampling algorithm based on grid and density biased sampling algorithm based on variable grid division.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.