Multi-class computational evolution: development, benchmark evaluation and application to RNA-Seq biomarker discovery

Nathaniel M Crabtree,John F Bowyer,Nysia I George,Jason H Moore

doi:10.1186/s13040-017-0134-8

Abstract

BackgroundA computational evolution system (CES) is a knowledge discovery engine that can identify subtle, synergistic relationships in large datasets. Pareto optimization allows CESs to balance accuracy with model complexity when evolving classifiers. Using Pareto optimization, a CES is able to identify a very small number of features while maintaining high classification accuracy. A CES can be designed for various types of data, and the user can exploit expert knowledge about the classification problem in order to improve discrimination between classes. These characteristics give CES an advantage over other classification and feature selection algorithms, particularly when the goal is to identify a small number of highly relevant, non-redundant biomarkers. Previously, CESs have been developed only for binary class datasets. In this study, we developed a multi-class CES.ResultsThe multi-class CES was compared to three common feature selection and classification algorithms: support vector machine (SVM), random k-nearest neighbor (RKNN), and random forest (RF). The algorithms were evaluated on three distinct multi-class RNA sequencing datasets. The comparison criteria were run-time, classification accuracy, number of selected features, and stability of selected feature set (as measured by the Tanimoto distance). The performance of each algorithm was data-dependent. CES performed best on the dataset with the smallest sample size, indicating that CES has a unique advantage since the accuracy of most classification methods suffer when sample size is small.ConclusionThe multi-class extension of CES increases the appeal of its application to complex, multi-class datasets in order to identify important biomarkers and features.

Highlights

A computational evolution system (CES) is a knowledge discovery engine that can identify subtle, synergistic relationships in large datasets
Many classification algorithms identify a set of discriminative features by performing both feature selection and model fitting, which typically leads to better accuracy and efficiency
Given the advantages of the CES, in this work, we have developed a multi-class CES and present a comparative analysis of CES with three competing feature selection and classification algorithms for multi-class data: support vector machine (SVM), random K-nearest neighbor (KNN) (RKNN), and random forest (RF)

Summary

Introduction

A computational evolution system (CES) is a knowledge discovery engine that can identify subtle, synergistic relationships in large datasets. A CES can be designed for various types of data, and the user can exploit expert knowledge about the classification problem in order to improve discrimination between classes These characteristics give CES an advantage over other classification and feature selection algorithms, when the goal is to identify a small number of highly relevant, non-redundant biomarkers. Several multi-class classification algorithms exist, the CES has advantages in that it performs better on small-sample datasets and requires fewer features to do so. ‘large p, small n’ settings), better performance and interpretability is achieved through feature selection, which is a dimensionality reduction technique by which a small, relevant subset of the original features is selected based on certain evaluation criterion Feature selection techniques such as filter methods are performed as a data preprocessing step and implemented independent of classifier learning. Many classification algorithms identify a set of discriminative features by performing both feature selection and model fitting (e.g. wrapper and hybrid methods), which typically leads to better accuracy and efficiency

Methods

Results

Discussion

Conclusion