Abstract
Developing a biomedical-explainable and validatable text mining pipeline can help in cancer gene panel discovery. We create a pipeline that can contextualize genes by using text-mined co-occurrence features. We apply Biomedical Natural Language Processing (BioNLP) techniques for literature mining in the cancer gene panel. A literature-derived 4,679 × 4,630 gene term-feature matrix was built. The EGFR L858R and T790M, and BRAF V600E genetic variants are important mutation term features in text mining and are frequently mutated in cancer. We validate the cancer gene panel by the mutational landscape of different cancer types. The cosine similarity of gene frequency between text mining and a statistical result from clinical sequencing data is 80.8%. In different machine learning models, the best accuracy for the prediction of two different gene panels, including MSK-IMPACT (Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets), and Oncomine cancer gene panel, is 0.959, and 0.989, respectively. The receiver operating characteristic (ROC) curve analysis confirmed that the neural net model has a better prediction performance (Area under the ROC curve (AUC) = 0.992). The use of text-mined co-occurrence features can contextualize each gene. We believe the approach is to evaluate several existing gene panels, and show that we can use part of the gene panel set to predict the remaining genes for cancer discovery.
Highlights
Scientific articles provide text mining (TM) applications in cancer biology (Zhu et al, 2013; Azam et al, 2019; Wang et al, 2020)
This study develops a gene panel analysis framework that can discover the characteristics of a gene panel based on biomedical literature mining
We performed term feature selection according to individual gene panels to make the term feature generated by the previous step stronger and correspond to the target gene panel
Summary
Scientific articles provide text mining (TM) applications in cancer biology (Zhu et al, 2013; Azam et al, 2019; Wang et al, 2020). We developed a biomedical-explainable and validatable text mining pipeline for cancer gene panel discovery. Summarizing the abovementioned studies, we established a fully integrated text mining pipeline to find the gene termfeature, mutational landscape heatmap, and cancer information topic. The MSK-IMPACT panel performed well in the above study and in a large-scale clinical sequencing project with more than 10,000 patients (Zehir et al, 2017) They provided a comprehensive gene panel database including actionable drug targets, cancer susceptibility genes in hematological malignancies, and solid tumors. The Oncomine Cancer Panel (OCP) is only used for the clinical screening of actionable genetic mutations in solid tumors (Luthra et al, 2017) We believe the approach is to evaluate several existing gene panels, and show that we can use part of the gene panel set to predict the remaining genes
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.