Abstract

BackgroundGene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. While many methods and applications have been available in literature, there still lack powerful ones that can take account of the complex data and detect the most informative signatures.MethodsIn this article, we present a new framework for identifying gene signatures using Pareto-optimal cluster size identification for RNA-seq data. We first performed pre-filtering steps and normalization, then utilized the empirical Bayes test in Limma package to identify the differentially expressed genes (DEGs). Next, we used a multi-objective optimization technique, “Multi-objective optimization for collecting cluster alternatives” (MOCCA in R package) on these DEGs to find Pareto-optimal cluster size, and then applied k-means clustering to the RNA-seq data based on the optimal cluster size. The best cluster was obtained through computing the average Spearman’s Correlation Score among all the genes in pair-wise manner belonging to the module. The best cluster is treated as the signature for the respective disease or cellular condition.ResultsWe applied our framework to a cervical cancer RNA-seq dataset, which included 253 squamous cell carcinoma (SCC) samples and 22 adenocarcinoma (ADENO) samples. We identified a total of 582 DEGs by Limma analysis of SCC versus ADENO samples. Among them, 260 are up-regulated genes and 322 are down-regulated genes. Using MOCCA, we obtained seven Pareto-optimal clusters. The best cluster has a total of 35 DEGs consisting of all-upregulated genes. For validation, we ran PAMR (prediction analysis for microarrays) classifier on the selected best cluster, and assessed the classification performance. Our evaluation, measured by sensitivity, specificity, precision, and accuracy, showed high confidence.ConclusionsOur framework identified a multi-objective based cluster that is treated as a signature that can classify the disease and control group of samples with higher classification performance (accuracy 0.935) for the corresponding disease. Our method is useful to find signature for any RNA-seq or microarray data.

Highlights

  • Gene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment

  • Voom normalization [18] is very useful for the RNA-seq data, whereas Linear models for microarray data (Limma) tool [2, 19,20,21] is useful for this kind of data

  • In this article, we developed a new framework of identifying gene signature using Pareto-optimal cluster identification for RNA-seq data

Read more

Summary

Introduction

Gene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. For RNA-seq data, proper selection of normalization and statistical test are very important, otherwise it might generate wrong p-value for each transcript. Prior to use any clustering algorithm, it is necessary to estimate the Pareto-optimal cluster size using the combination of several clustering algorithms and various cluster validation indices as multi-objectives. It obviously reduces the errorrate whenever clustering on the underlying data. MOCCA (Multi-objective optimization for collecting cluster alternatives) [27] is a latest robust estimator of Pareto-optimal cluster size through aggregating the best cluster numbers of various clustering algorithms and several cluster validation indices as the multi-objectives. MOCCA provides the ranking of the Pareto-optimal cluster sizes based upon their domination

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call