Abstract

BackgroundWe considered the prediction of cancer classes (e.g. subtypes) using patient gene expression profiles that contain both systematic and condition-specific biases when compared with the training reference dataset. The conventional normalization-based approaches cannot guarantee that the gene signatures in the reference and prediction datasets always have the same distribution for all different conditions as the class-specific gene signatures change with the condition. Therefore, the trained classifier would work well under one condition but not under another.MethodsTo address the problem of current normalization approaches, we propose a novel algorithm called CrossLink (CL). CL recognizes that there is no universal, condition-independent normalization mapping of signatures. In contrast, it exploits the fact that the signature is unique to its associated class under any condition and thus employs an unsupervised clustering algorithm to discover this unique signature.ResultsWe assessed the performance of CL for cross-condition predictions of PAM50 subtypes of breast cancer by using a simulated dataset modeled after TCGA BRCA tumor samples with a cross-validation scheme, and datasets with known and unknown PAM50 classification. CL achieved prediction accuracy >73 %, highest among other methods we evaluated. We also applied the algorithm to a set of breast cancer tumors derived from Arabic population to assign a PAM50 classification to each tumor based on their gene expression profiles.ConclusionsA novel algorithm CrossLink for cross-condition prediction of cancer classes was proposed. In all test datasets, CL showed robust and consistent improvement in prediction performance over other state-of-the-art normalization and classification algorithms.

Highlights

  • We considered the prediction of cancer classes using patient gene expression profiles that contain both systematic and condition-specific biases when compared with the training reference dataset

  • In the absence of true PAM50 labels, we propose the Indirect Summed Evaluation Probability (ISEP) to evaluate the PAM50 prediction results and ISEP is calculated as ð Þ ISEP

  • This section is separated into three parts: (1) the ability of CL for PAM50 classification is first demonstrated in several scenarios; (2) the application of CL on Cancer2000 classification is demonstrated; (3) a Qatar breast cancer patients’ Microarray data analysis is conducted

Read more

Summary

Introduction

We considered the prediction of cancer classes (e.g. subtypes) using patient gene expression profiles that contain both systematic and condition-specific biases when compared with the training reference dataset. The conventional normalization-based approaches cannot guarantee that the gene signatures in the reference and prediction datasets always have the same distribution for all different conditions as the class-specific gene signatures change with the condition. Overcoming systematic and condition-specific biases presented in expression data as a result of different technological platforms, varying experimental/measurement conditions, and heterogeneities in the patient age, gender and race continues to be an issue yet to be completely addressed. The well-known Microarray Quality Control project (MAQC) spearheaded the algorithm development in this front and demonstrated that through careful algorithm-based normalization, consistently differentially expressed genes can be reproduced in data produced from different platforms [11]. A normalization algorithm may work well under one condition but not under another [12]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call