Abstract
Somatic mutations often occur at high relapse sites in protein sequences, which indicates that the location clustering of somatic missense mutations can be used to identify driving genes. However, the traditional clustering algorithm has such problems as the background signal over-fitting, the clustering algorithm is not suitable for mutation data, and the performance of identifying low-frequency mutation genes needs to be improved. In this paper, we propose a linear clustering algorithm based on likelihood ratio test knowledge to identify driver genes. In this experiment, firstly, the polynucleotide mutation rate is calculated based on the prior knowledge of likelihood ratio test. Then, the simulation data set is obtained through the background mutation rate model. Finally, the unsupervised peak clustering algorithm is used to, respectively, evaluate the somatic mutation data and the simulation data to identify the driver genes. The experimental results show that our method achieves a better balance of precision and sensitivity. It can also identify the driver genes missed by other methods, making it an effective supplement to other methods. We also discover some potential linkages between genes and between genes and mutation sites, which is of great value to target drug therapy research. Method framework: Our proposed model framework is as follows. a. Counting mutation sites and the number of mutations in tumor gene elements. b. The nucleotide context mutation frequency is counted based on the likelihood ratio test knowledge, and the background mutation rate model is obtained. c. Based on Monte Carlo simulation method, data sets with the same number of mutations as gene elements are randomly sampled to obtain simulated mutation data, and the sampling frequency of each mutation site is related to the mutation rate of polynucleotide. d. The original mutation data and the simulated mutation data after random reconstruction are clustered by peak density, respectively, and the corresponding clustering scores are obtained. e. We can obtain the clustering information statistics in each gene segment and score of each gene segment from the original single nucleotide mutation data through step d. f. According to the observed score and the simulated clustering score, the p-value of the corresponding gene fragment is calculated. g. We can obtain the clustering information statistics in each gene segment and score of each gene segment from the simulated single nucleotide mutation data through step d.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Interdisciplinary sciences, computational life sciences
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.