An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets.

Ana Stanescu,Doina Caragea

doi:10.1186/1752-0509-9-s5-s1

Abstract

BackgroundRecent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers.ResultsOur experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines.ConclusionsIn the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework.

Highlights

Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data
A specific challenge that we address is the “data imbalance” problem, which is prevalent in many domains, including bioinformatics
While the trends are generally maintained for individual organisms, we report averages of area under the Precision-Recall Curve (auPRC) values over the five organisms, for easier interpretation

Summary

Introduction

Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. Advances in biochemical technologies over the past decades have given rise to Generation Sequencing platforms that quickly produce genomic data at much lower costs than ever before. Such overwhelmingly large volumes of sequenced DNA remain difficult to annotate. For a scenario in which the amount of labeled data is relatively small and the amount of unlabeled data is substantially larger, semi-supervised learning represents a cost-effective alternative to manual labeling

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Systems Biology	Publication Date: Jan 1, 2015
Citations: 37	License type: cc-by

R Discovery Prime

R Discovery Prime

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Systems Biology

Lead the way for us

Similar Papers

Ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
Ana Stanescu ... Doina Caragea
-
Ana Stanescu, et. al.Ana Stanescu ... Doina Caragea
01 Nov 2014
01 Nov 2014

A Novel Semisupervised Deep Learning Method for Human Activity Recognition
Qingchang Zhu ... Yeng Chai Soh
IEEE Transactions on Industrial Informatics | VOL. 15
Qingchang Zhu, et. al.Qingchang Zhu ... Yeng Chai Soh
01 Jul 2019
IEEE Transactions on Industrial Informatics | VOL. 15

A Semi-Supervised Deep Learning Approach for the Classification of Steel Surface Defects
Mathuranathan Mayuravaani ... Siyamalan Manivannan
-
Mathuranathan Mayuravaani, et. al.Mathuranathan Mayuravaani ... Siyamalan Manivannan
11 Aug 2021
11 Aug 2021

Ensemble-based approach for semisupervised learning in remote sensing
Miguel Plazas ... Raúl Ramos-Pollán
Journal of Applied Remote Sensing | VOL. 15
Miguel Plazas, et. al.Miguel Plazas ... Raúl Ramos-Pollán
05 Aug 2021
Journal of Applied Remote Sensing | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Systems Biology