Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression.

Qin Jiang,Min Jin

doi:10.3389/fgene.2021.629946

Abstract

Exploring the molecular mechanisms of breast cancer is essential for the early prediction, diagnosis, and treatment of cancer patients. The large scale of data obtained from the high-throughput sequencing technology makes it difficult to identify the driver mutations and a minimal optimal set of genes that are critical to the classification of cancer. In this study, we propose a novel method without any prior information to identify mutated genes associated with breast cancer. For the somatic mutation data, it is processed to a mutated matrix, from which the mutation frequency of each gene can be obtained. By setting a reasonable threshold for the mutation frequency, a mutated gene set is filtered from the mutated matrix. For the gene expression data, it is used to generate the gene expression matrix, while the mutated gene set is mapped onto the matrix to construct a co-expression profile. In the stage of feature selection, we propose a staged feature selection algorithm, using fold change, false discovery rate to select differentially expressed genes, mutual information to remove the irrelevant and redundant features, and the embedded method based on gradient boosting decision tree with Bayesian optimization to obtain an optimal model. In the stage of evaluation, we propose a weighted metric to modify the traditional accuracy to solve the sample imbalance problem. We apply the proposed method to The Cancer Genome Atlas breast cancer data and identify a mutated gene set, among which the implicated genes are oncogenes or tumor suppressors previously reported to be associated with carcinogenesis. As a comparison with the integrative network, we also perform the optimal model on the individual gene expression and the gold standard PMA50. The results show that the integrative network outperforms the gene expression and PMA50 in the average of most metrics, which indicate the effectiveness of our proposed method by integrating multiple data sources, and can discover the associated mutated genes in breast cancer.

Highlights

Breast cancer is considered to be the most prevalent cancer among women and the second common cause of death in both developed and undeveloped countries
This research presents a Staged Feature Selection method for breast cancer classification based on gene expression and somatic mutation datasets
Fold change (FC) and FDR were used to select differentially expressed genes, mutual information (MI) was adopted to remove the irrelevant and redundant features, and an embedded method based on gradient boosting decision tree (GBDT) with Bayesian optimization was presented to obtain the informative features

Summary

Introduction

Breast cancer is considered to be the most prevalent cancer among women and the second common cause of death in both developed and undeveloped countries. It is caused by multiple factors including genomic, transcriptomic, and epigenomic involvement in its formation and development. Distinguishing driver mutations from the passengers that have no critical effect on cancer cells is a crucial step and challenging task in understanding the molecular mechanisms of cancer, which can guide effective treatment and prognosis for cancer patients and promote the development of targeted drugs. Because of the complexity of the cancer genome, driver genes contain driver mutations and passenger mutations This makes this kind of approach sometimes ineffective

Objectives

Methods

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Genetics	Publication Date: Feb 26, 2021
Citations: 10	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Genetics

Lead the way for us

Similar Papers

Integrating mutation and gene expression cross-sectional data to infer cancer progression.
Julia L Fleck ... Ana B Pavel
BMC Systems Biology | VOL. 10
Julia L Fleck, et. al.Julia L Fleck ... Ana B Pavel
25 Jan 2016
BMC Systems Biology | VOL. 10

Selected Aspects of Molecular Diagnostics of Constitutional Alterations in BRCA1 and BRCA2 Genes Associated with Increased Risk of Breast Cancer in the Polish Population
Bohdan Górski
Hereditary Cancer in Clinical Practice | VOL. 4
Bohdan GórskiBohdan Górski
01 Jan 2006
Hereditary Cancer in Clinical Practice | VOL. 4

Differential Allele-Specific Expression Uncovers Breast Cancer Genes Dysregulated by Cis Noncoding Mutations.
Pawel F Przytycki ... Mona Singh
Cell Systems | VOL. 10
Pawel F Przytycki, et. al.Pawel F Przytycki ... Mona Singh
01 Feb 2020
Cell Systems | VOL. 10

Heterozygous Mutations in DNA Repair Genes and Hereditary Breast Cancer: A Question of Power
Nathan A Ellis ... Kenneth Offit
PLoS Genetics | VOL. 8
Nathan A Ellis, et. al.Nathan A Ellis ... Kenneth Offit
27 Sep 2012
PLoS Genetics | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Genetics