A huge amount of data is produced with the evolution of modern technologies. This high-throughput data generation results in Big Data, which consist of many features (attributes). Storing and processing large and varied datasets (known as big data) is challenging to do in real time. In machine learning, streaming feature selection has always been considered a superior technique for selecting the relevant subset features from highly dimensional data and thus reducing learning complexity. Instance and Feature selection has been a key research area in data mining, which chooses a subset of relevant features for use in model building. This paper aims to provide an overview of instance and feature selection methods for big data mining. It first discusses the current challenges and difficulties faced when mining valuable information from big data.Instance and feature selection has become an effective approach due to enormous data which is continuously being produced in the field of research. It is difficult to process such large datasets by many systems. Though the traditional techniques are useful for large datasets, the numbers when in hundreds, thousands or millions face scaling problems. The proposed work focuses on, scalable instance and feature selection in big data environment. Locality-sensitive hashing instance selection (LSH-IS) is a two pass method used to find similar instances along with Pearson correlation coefficient for feature selection. Hash function family is used which is a general method of reducing the size of a set; this is achieved by re- indexing the elements into buckets. This process find similar instance in same bucket, hence instance can be reduced. The work aims at improving the performance of locality sensitive hashing by storing additional information of the instances and features assigned of each class in the bucket and also to improve accuracy of instance and feature selection algorithm.
Read full abstract