Abstract
Breast cancer metastasis can have a fatal outcome, with the prediction of metastasis being critical for establishing effective treatment strategies. RNA-sequencing (RNA-seq) is a good tool for identifying genes that promote and support metastasis development. The hub gene analysis method is a bioinformatics method that can effectively analyze RNA sequencing results. This can be used to specify the set of genes most relevant to the function of the cell involved in metastasis. Herein, a new machine learning model based on RNA-seq data using the random forest algorithm and hub genes to estimate the accuracy of breast cancer metastasis prediction. Single-cell breast cancer samples (56 metastatic and 38 non-metastatic samples) were obtained from the Gene Expression Omnibus database, and the Weighted Gene Correlation Network Analysis package was used for the selection of gene modules and hub genes (function in mitochondrial metabolism). A machine learning prediction model using the hub gene set was devised and its accuracy was evaluated. A prediction model comprising 54-functional-gene modules and the hub gene set (NDUFA9, NDUFB5, and NDUFB3) showed an accuracy of 0.769 ± 0.02, 0.782 ± 0.012, and 0.945 ± 0.016, respectively. The test accuracy of the hub gene set was over 93% and that of the prediction model with random forest and hub genes was over 91%. A breast cancer metastasis dataset from The Cancer Genome Atlas was used for external validation, showing an accuracy of over 91%. The hub gene assay can be used to predict breast cancer metastasis by machine learning.
Highlights
RNA-sequencing (RNA-seq) is being used to diagnose cancer and predict the behavior of cancer cells [1], which is directly linked to the expression of certain genes
These data were taken from the GEO database, which was developed by a previous breast cancer study that defined the cancer cell detected from lymph node of patients as metastatic cell, and the one detected from breast cancer as the non-metastatic one [31]
The results of analyzing imaging factors, such as computerized tomography and magnetic resonance imaging were applied to machine learning to estimate the accuracy of the predictive model, it is still difficult to apply in the clinical setting
Summary
RNA-sequencing (RNA-seq) is being used to diagnose cancer and predict the behavior of cancer cells [1], which is directly linked to the expression of certain genes. Genes involved in metastasis can be identified by comparing RNA-seq results of confirmed metastatic and non-metastatic breast cancer samples. Genes such as SETDB1 [3], MALAT1 [4], EHMT2 [5], RAB11B-AS1 [6], STAT3 [7], and RAS [8] were identified to play a role in lymph node metastasis of breast cancer. It is still impossible to effectively predict lymph node metastasis of breast cancer solely through gene expression analysis, several studies have explored these particular genes This limitation is because RNA-seq results only indicate the current state of breast cancer cells. The gene modules were created by a systematic biological strategy for evaluating gene association patterns among different samples with bioinformatics tools like WGCNA or GSEA [12]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have