Use Of Unlabeled Data Research Articles

BackgroundRecent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers.ResultsOur experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines.ConclusionsIn the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework.

Predicting protein subcellular location is one of major challenges in Bioinformatics area since such knowledge helps us understand protein functions and enables us to select the targeted proteins during drug discovery process. While many computational techniques have been proposed to improve predictive performance for protein subcellular location, they have several shortcomings. In this work, we propose a method to solve three main issues in such techniques; i) manipulation of multiplex proteins which may exist or move between multiple cellular compartments, ii) handling of high dimensionality in input and output spaces and iii) requirement of sufficient labeled data for model training. Towards these issues, this work presents a new computational method for predicting proteins which have either single or multiple locations. The proposed technique, namely iFLAST-CORE, incorporates the dimensionality reduction in the feature and label spaces with co-training paradigm for semi-supervised multi-label classification. For this purpose, the Singular Value Decomposition (SVD) is applied to transform the high-dimensional feature space and label space into the lower-dimensional spaces. After that, due to limitation of labeled data, the co-training regression makes use of unlabeled data by predicting the target values in the lower-dimensional spaces of unlabeled data. In the last step, the component of SVD is used to project labels in the lower-dimensional space back to those in the original space and an adaptive threshold is used to map a numeric value to a binary value for label determination. A set of experiments on viral proteins and gram-negative bacterial proteins evidence that our proposed method improve the classification performance in terms of various evaluation metrics such as Aiming (or Precision), Coverage (or Recall) and macro F-measure, compared to the traditional method that uses only labeled data.

Use Of Unlabeled Data Research Articles

Related Topics

Articles published on Use Of Unlabeled Data

CUPID: consistent unlabeled probability of identical distribution for image classification

A Framework for pre-training hidden-unit conditional random fields and its extension to long short term memory networks

A novel cross-modal hashing algorithm based on multimodal deep learning

An empirical study of self-training and data balancing techniques for splice site prediction

Semi-supervised learning for ordinal Kernel Discriminant Analysis

Asymptotic comparison of semi-supervised and supervised linear discriminant functions for heteroscedastic normal populations

Semi-supervised Collective Classification in Multi-attribute Network Data

Towards Safe Semi-Supervised Learning for Multivariate Performance Measures

Semi-supervised prediction of gene regulatory networks using machine learning algorithms.

Semi-supervised deep extreme learning machine for Wi-Fi based localization

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets.

A Novel Speech Emotion Recognition Method via Incomplete Sparse Least Square Regression

A top-down information theoretic word clustering algorithm for phrase recognition

Lexicon expansion for latent variable grammars

Predict Subcellular Locations of Singleplex and Multiplex Proteins by Semi-Supervised Learning and Dimension-Reducing General Mode of Chou's PseAAC

Multilabel relationship learning

Semi-Supervised Novelty Detection Using SVM Entire Solution Path

Cross-Lingual Adaptation Using Structural Correspondence Learning

On multivariate calibration with unlabeled data

Software defect detection with rocus

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Use Of Unlabeled Data Research Articles

Related Topics

Articles published on Use Of Unlabeled Data

CUPID: consistent unlabeled probability of identical distribution for image classification

A Framework for pre-training hidden-unit conditional random fields and its extension to long short term memory networks

A novel cross-modal hashing algorithm based on multimodal deep learning

An empirical study of self-training and data balancing techniques for splice site prediction

Semi-supervised learning for ordinal Kernel Discriminant Analysis

Asymptotic comparison of semi-supervised and supervised linear discriminant functions for heteroscedastic normal populations

Semi-supervised Collective Classification in Multi-attribute Network Data

Towards Safe Semi-Supervised Learning for Multivariate Performance Measures

Semi-supervised prediction of gene regulatory networks using machine learning algorithms.

Semi-supervised deep extreme learning machine for Wi-Fi based localization

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets.

A Novel Speech Emotion Recognition Method via Incomplete Sparse Least Square Regression

A top-down information theoretic word clustering algorithm for phrase recognition

Lexicon expansion for latent variable grammars

Predict Subcellular Locations of Singleplex and Multiplex Proteins by Semi-Supervised Learning and Dimension-Reducing General Mode of Chou's PseAAC

Multilabel relationship learning

Semi-Supervised Novelty Detection Using SVM Entire Solution Path

Cross-Lingual Adaptation Using Structural Correspondence Learning

On multivariate calibration with unlabeled data

Software defect detection with rocus