Abstract
Microsatellite instability (MSI) is characterized by high degree of polymorphism in microsatellite lengths due to deficiency in mismatch repair (MMR) system. MSI is associated with several tumor types and its status can be considered as an important indicator for patient prognosis. Conventional clinical diagnosis of MSI examines PCR products of a panel of microsatellite markers using electrophoresis (MSI-PCR), which is laborious, costly, and time consuming. We developed MSIpred, a python package for automatic MSI classification using a machine learning technology – support vector machine (SVM). MSIpred computes 22 features characterizing tumor somatic mutational load from mutation data in mutation annotation format (MAF) generated from paired tumor-normal exome sequencing data, subsequently using these features to predict tumor MSI status with a SVM classifier trained by MAF data of 1074 tumors belonging to four types. Evaluation of MSIpred on an independent testing set, MAF data of another 358 tumors, achieved overall accuracy of ≥98% and area under receiver operating characteristic (ROC) curve of 0.967. Further analysis on discrepant cases revealed that discrepancies were partially due to misclassification of MSI-PCR. Additional testing of MSIpred on non-TCGA data also validated its good classification performance. These results indicated that MSIpred is a robust pan-tumor MSI classification tool and can serve as a complementary diagnostic to MSI-PCR in MSI diagnosis.
Highlights
Microsatellites are tandemly repeated sequences with typical repeat unit length varying from 1 to 6 bases[1]
We explored distributions of somatic mutational load in Microsatellite instability (MSI)-PCR termed microsatellite instability high (MSI-H) and non MSI-H (MSS and microsatellite instability low (MSI-L)) tumors using their corresponding mutation annotation format (MAF) files obtained from The Cancer Genome Atlas (TCGA) Research Network[22], and subsequently developed a python package, MSIpred, implementing a pan-tumor binary MSI classifier to predict MSI status from tumor MAF files
Referring to tumor clinical data provided by TCGA, 1123 of 1432 tumors were determined as microsatellite stable (MSS) using MSI-PCR, the other 309 tumors were determined as MSI-H
Summary
Microsatellites are tandemly repeated sequences with typical repeat unit length varying from 1 to 6 bases[1]. Slippage events during DNA replication can lead to gain or loss of repeat units from microsatellite loci throughout genome Under normal circumstances, these spontaneous mutations can be sensed and corrected by mismatch repair (MMR) system. MSIsensor[10] and mSINGS11 determine tumor MSI status by measuring prevalence of unstable microsatellite loci in paired tumor-normal sequencing data. These two tools showed good performance but they both require lots of computational resources since they directly examine aligned reads in BAM format. MSIseq and MOSAIC implemented decision tree classifiers for MSI classification depending on just a single feature derived from tumor somatic mutation information, which is prone to over-fit their training datasets. We explored distributions of somatic mutational load in MSI-PCR termed MSI-H and non MSI-H (MSS and MSI-L) tumors using their corresponding MAF files obtained from The Cancer Genome Atlas (TCGA) Research Network[22], and subsequently developed a python package, MSIpred, implementing a pan-tumor binary MSI classifier to predict MSI status from tumor MAF files
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.