Millions Of Instances Research Articles

Data reduction is becoming increasingly relevant due to the enormous amounts of data that are constantly being produced in many fields of research. Instance selection is one of the most widely used methods for this task. At the same time, most recent pattern recognition problems involve highly complex datasets with a large number of possible explanatory variables. For many reasons, this abundance of variables significantly hinders classification and recognition tasks. There are efficiency issues, too, because the speed of many classification algorithms is greatly improved when the complexity of the data is reduced. Thus, feature selection is also a widely used method for data reduction and for gaining an understanding of feature information.Although most methods address instance and feature selection separately, the two problems are interwoven, and benefits are expected from performing these two tasks jointly. However, few algorithms have been proposed for simultaneously addressing the tasks of instance and feature selection. Furthermore, most of those methods are based on complex heuristics that are very difficult to scale up even to moderately large datasets. This paper proposes a new algorithm for dealing with many instances and many features simultaneously by performing joint instance and feature selection using a simple heuristic search and several scaling-up mechanisms that can be successfully applied to datasets with millions of features and instances.In the proposed method, a forward selection search is performed in the feature space jointly with the application of standard instance selection in a constructive subspace built stepwise. Several simplifications are adopted in the search to obtain a scalable method. An extensive comparison using 95 large datasets shows the usefulness of our method and its ability to deal with millions of instances and features simultaneously. The method is able to obtain better classification performance results than state-of-the-art approaches while achieving considerable data reduction.

Read full abstract

Structuring data is crucial for managing massive amount of available data. Hierarchy (taxonomy) provides a natural and convenient way to organize the information. It has been extensively used in several domains, such as gene taxonomy for organizing gene sequences, international patent hierarchy for easy browsing and retrieval of patent documents, DMOZ taxonomy for web-pages categorization, and ImageNet database for indexing millions of images according to WordNet hierarchy. Given, a hierarchy containing thousands of classes (categories) and millions of instances (examples), there is an essential need to develop an efficient and automated approaches to categorize unlabeled test instances. This problem is referred to as Hierarchical Classification (HC) task. HC is an important machine learning problem that has been researched and explored extensively in the past few years (Silla Jr & Freitas, 2011). The popularity of large-scale HC problem is evident from various HC competitions organized in the past few years such as LSHTC 1 , BioASQ 2 and ILSVRC 3 . HC poses several challenges due to the following reasons: (i) Data imbalance with large number of classes having very few positive examples for training (rare categories), (ii) Multi-label classification, (iii) Feature selection, (iv) Inconsistent hierarchy due to domain experts manual design, and (v) Scalability due to large number of examples, features and classes. Several approaches that address these issues individually (or multiple issues together) have been developed over the years (Gopal & Yang, 2013; Babbar et al., 2013), however there are many possibilities of improving the existing methods. Specifically, we have developed the methods for handling rare categories and inconsistent hierarchy problem.

Read full abstract

Millions Of Instances Research Articles

Articles published on Millions Of Instances

A distributed evolutionary based instance selection algorithm for big data using Apache Spark

Role of Artificial Intelligence (AI) in Changing Consumer Buying Behaviour

Alpha-divergence minimization for deep Gaussian processes

Improved meta‐heuristic algorithm for selecting optimal features: A big data classification model

Parallel and accurate k‐means algorithm on CPU‐GPU architectures for spectral clustering

Multimodal Data Fusion in High-Dimensional Heterogeneous Datasets Via Generative Models

SI(FS)[formula omitted]: Fast simultaneous instance and feature selection for datasets with many features

Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing

Sample size determination for biomedical big data with limited labels

Alpha divergence minimization in multi-class Gaussian process classification

Developments in Pyrotechnic-Assisted Fuses [Happenings

On building and publishing Linked Open Schema from social Web sites

Isolation‐based anomaly detection using nearest‐neighbor ensembles

On Building and Publishing Linked Open Schema from Social Web Sites

Embedded Feature Selection Method for a Network-Level Behavioural Analysis Detection Model

IMPACT OF ARTIFICIAL INTELLIGENCE ON HUMAN RIGHTS IN INDIA: ACRITICAL STUDY

Online estimation of discrete, continuous, and conditional joint densities using classifier chains

SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification

Knowledge Base Semantic Integration Using Crowdsourcing

Large-scale hierarchical classification with rare categories and inconsistencies

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Millions Of Instances Research Articles

Articles published on Millions Of Instances

A distributed evolutionary based instance selection algorithm for big data using Apache Spark

Role of Artificial Intelligence (AI) in Changing Consumer Buying Behaviour

Alpha-divergence minimization for deep Gaussian processes

Improved meta‐heuristic algorithm for selecting optimal features: A big data classification model

Parallel and accurate k‐means algorithm on CPU‐GPU architectures for spectral clustering

Multimodal Data Fusion in High-Dimensional Heterogeneous Datasets Via Generative Models

SI(FS)[formula omitted]: Fast simultaneous instance and feature selection for datasets with many features

Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing

Sample size determination for biomedical big data with limited labels

Alpha divergence minimization in multi-class Gaussian process classification

Developments in Pyrotechnic-Assisted Fuses [Happenings

On building and publishing Linked Open Schema from social Web sites

Isolation‐based anomaly detection using nearest‐neighbor ensembles

On Building and Publishing Linked Open Schema from Social Web Sites

Embedded Feature Selection Method for a Network-Level Behavioural Analysis Detection Model

IMPACT OF ARTIFICIAL INTELLIGENCE ON HUMAN RIGHTS IN INDIA: ACRITICAL STUDY

Online estimation of discrete, continuous, and conditional joint densities using classifier chains

SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification

Knowledge Base Semantic Integration Using Crowdsourcing

Large-scale hierarchical classification with rare categories and inconsistencies