Sparse representation approaches for the classification of high-dimensional biological data.

Yifeng Li,Alioune Ngom

doi:10.1186/1752-0509-7-s4-s6

Abstract

BackgroundHigh-throughput genomic and proteomic data have important applications in medicine including prevention, diagnosis, treatment, and prognosis of diseases, and molecular biology, for example pathway identification. Many of such applications can be formulated to classification and dimension reduction problems in machine learning. There are computationally challenging issues with regards to accurately classifying such data, and which due to dimensionality, noise and redundancy, to name a few. The principle of sparse representation has been applied to analyzing high-dimensional biological data within the frameworks of clustering, classification, and dimension reduction approaches. However, the existing sparse representation methods are inefficient. The kernel extensions are not well addressed either. Moreover, the sparse representation techniques have not been comprehensively studied yet in bioinformatics.ResultsIn this paper, a Bayesian treatment is presented on sparse representations. Various sparse coding and dictionary learning models are discussed. We propose fast parallel active-set optimization algorithm for each model. Kernel versions are devised based on their dimension-free property. These models are applied for classifying high-dimensional biological data.ConclusionsIn our experiment, we compared our models with other methods on both accuracy and computing time. It is shown that our models can achieve satisfactory accuracy, and their performance are very efficient.

Highlights

The studies in biology and medicine have been revolutionarily changed since the invents of many high-throughput sensory techniques
In this study, l1-regularized and non-negative sparse representation models are comprehensively studied for the classification of high-dimensional biological data
We prove that the sparse coding and dictionary learning models are equivalent to maximum a posteriori (MAP) estimations

Summary

Results

A Bayesian treatment is presented on sparse representations. Various sparse coding and dictionary learning models are discussed. We propose fast parallel active-set optimization algorithm for each model. Kernel versions are devised based on their dimension-free property. These models are applied for classifying high-dimensional biological data

Introduction

NNLS : min

Conclusions