Identifying novel transcript biomarkers for hepatocellular carcinoma (HCC) using RNA-Seq datasets and machine learning

Rajinder Gupta,Jos Kleinjans,Florian Caiment

doi:10.1186/s12885-021-08704-9

Rajinder Gupta, Jos Kleinjans + Show 1 more

Open Access

https://doi.org/10.1186/s12885-021-08704-9

Copy DOI

Journal: BMC Cancer	Publication Date: Aug 27, 2021
Citations: 10	License type: open-access

Affiliation: Maastricht University

Abstract

BackgroundHepatocellular carcinoma (HCC) is one of the leading causes of cancer death in the world owing to limitations in its prognosis. The current prognosis approaches include radiological examination and detection of serum biomarkers, however, both have limited efficiency and are ineffective in early prognosis. Due to such limitations, we propose to use RNA-Seq data for evaluating putative higher accuracy biomarkers at the transcript level that could help in early prognosis.MethodsTo identify such potential transcript biomarkers, RNA-Seq data for healthy liver and various HCC cell models were subjected to five different machine learning algorithms: random forest, K-nearest neighbor, Naïve Bayes, support vector machine, and neural networks. Various metrics, namely sensitivity, specificity, MCC, informedness, and AUC-ROC (except for support vector machine) were evaluated. The algorithms that produced the highest values for all metrics were chosen to extract the top features that were subjected to recursive feature elimination. Through recursive feature elimination, the least number of features were obtained to differentiate between the healthy and HCC cell models.ResultsFrom the metrics used, it is demonstrated that the efficiency of the known protein biomarkers for HCC is comparatively lower than complete transcriptomics data. Among the different machine learning algorithms, random forest and support vector machine demonstrated the best performance. Using recursive feature elimination on top features of random forest and support vector machine three transcripts were selected that had an accuracy of 0.97 and kappa of 0.93. Of the three transcripts, two were protein coding (PARP2–202 and SPON2–203) and one was a non-coding transcript (CYREN-211). Lastly, we demonstrated that these three selected transcripts outperformed randomly taken three transcripts (15,000 combinations), hence were not chance findings, and could then be an interesting candidate for new HCC biomarker development.ConclusionUsing RNA-Seq data combined with machine learning approaches can aid in finding novel transcript biomarkers. The three biomarkers identified: PARP2–202, SPON2–203, and CYREN-211, presented the highest accuracy among all other transcripts in differentiating the healthy and HCC cell models. The machine learning pipeline developed in this study can be used for any RNA-Seq dataset to find novel transcript biomarkers.Code: www.github.com/rajinder4489/ML_biomarkers

Highlights

The liver, one of the largest organ in the body, performs various important functions, such as filtering harmful substances from the blood to be excreted from the body, producing bile to help in the digestion of fats from food, or storing glycogen that will be used for energy
As published in independent reports by World Health Organization (WHO) [2] and the US Center for Disease Control and Prevention (CDC) [3], liver cancer is among the top causes for cancer death worldwide, of which hepatocellular carcinoma (HCC) is the most common type of primary liver cancer, accounting for ~ 80% liver cancers
RNA-Seq data for all 250 cell models were searched on European Nucleotide Archive (ENA) using the application programming interface (API), taking the data generated using Illumina’s HiSeq platforms or newer and library layout as paired-end

Summary

Introduction

The liver, one of the largest organ in the body, performs various important functions, such as filtering harmful substances from the blood to be excreted from the body, producing bile to help in the digestion of fats from food, or storing glycogen (sugar) that will be used for energy. The employed prognosis for HCC includes radiological examinations and assessment of serum markers. Hepatocellular carcinoma (HCC) is one of the leading causes of cancer death in the world owing to limitations in its prognosis. The current prognosis approaches include radiological examination and detection of serum biomarkers, both have limited efficiency and are ineffective in early prognosis. Due to such limitations, we propose to use RNA-Seq data for evaluating putative higher accuracy biomarkers at the transcript level that could help in early prognosis

Methods

Results

Discussion

Conclusion