Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning.

Qijuan Gao,Enhua Xia,Hanwei Yan,Xiangwei Wu,Xiu Jin,Shaowen Li,Lichuan Gu,Yingchun Xia

doi:10.3389/fgene.2020.00820

Abstract

Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.

Highlights

The process of identifying orphan genes is an emerging field
The whole genome data of the angiosperm A. thaliana were obtained from The Arabidopsis Information Resource (TAIR8) dataset ftp://ftp.arabidopsis.org/home/tair/Genes/ TAIR8_genome_release, which contained a total of 32825 gene sequences
The known orphan genes of A. thaliana downloaded from the public website https://www.biomedcentral.com/ content/supplementary/1471-2148-10-%2041-S2.TXT (Lin et al, 2010)

Summary

Introduction

The process of identifying orphan genes is an emerging field. Orphan genes play critical roles in the evolution of species and the adaptability of the environment (Davies and Davies, 2010; Donoghue et al, 2011; Huang, 2013; Cooper, 2014; Gao et al, 2014). Arendsee et al, 2014), Many attempts have been made to identify orphan genes in multiple species or taxa and to analyze their functions. Orphan genes are detected mainly by comparison of genome and transcriptome sequences of related species using BLAST (Blast-Basic Local Alignment Search Tool; Altschul et al, 1990; Tollriera et al, 2009). This approach requires large server resources and time, and common problems with complexity and timeliness occur (Ye et al, 2012)

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Genetics	Publication Date: Oct 2, 2020
Citations: 21	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Genetics

Lead the way for us

Similar Papers

Bayesian optimization-enhanced ensemble learning for the uniaxial compressive strength prediction of natural rock and its application
Chukwuemeka Daniel ... Yucong Pan
Geohazard Mechanics | VOL. 2
Chukwuemeka Daniel, et. al.Chukwuemeka Daniel ... Yucong Pan
22 May 2024
Geohazard Mechanics | VOL. 2

Machine-learning-derived online prediction models of outcomes for patients with cholelithiasis-induced acute cholangitis: development and validation in two retrospective cohorts
Shuaijing Huang ... Yadong Feng
eClinicalMedicine | VOL. 76
Shuaijing Huang, et. al.Shuaijing Huang ... Yadong Feng
05 Sep 2024
eClinicalMedicine | VOL. 76

Glass forming ability prediction of bulk metallic glasses based on fused strategy
Ting Zhang ... Li Peng
Transactions of Nonferrous Metals Society of China | VOL. 34
Ting Zhang, et. al.Ting Zhang ... Li Peng
01 May 2024
Transactions of Nonferrous Metals Society of China | VOL. 34

Establishment and validation of a heart failure risk prediction model for elderly patients after coronary rotational atherectomy based on machine learning.
Lixiang Zhang ... Xiaojuan Zhou
PeerJ | VOL. 12
Lixiang Zhang, et. al.Lixiang Zhang ... Xiaojuan Zhou
31 Jan 2024
PeerJ | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Genetics