Abstract Background: DNAs extracted from formalin-fixed and paraffin-embedded (FFPE) tissues are fragmented and contain a significant amount of single-stranded DNA (ssDNA). We found that most of the false-positive mutations in clinical sequencing with FFPE samples are ssDNA-derived artifacts. During the end-repair step of library preparation, chimeric reads are generated by mis-annealing of ssDNA molecules comprising homologous sequences with a mismatched base. The mismatched base is called as a false-positive mutation, and the position of the base is biased towards the effective ends in the individual reads because either side of the chimeric read should be soft-clipped by the mapping pipeline. Based on this theory, we developed a post hoc filter, MicroSEC, to predict such artifacts in FFPE samples. Methods: Fifty-three fresh frozen (FF) and 190 FFPE normal breast tissue samples and 23 FF and 33 FFPE breast cancer samples were obtained from 26 patients, and subjected to capture-based panel sequencing (Todai OncoPanel, TOP). We also obtained TOP data from 54 FFPE various tumor samples. MicroSEC was developed to predict artifacts with BAM files based on the positional bias of mutations within each read. The predictions were validated with amplicon-based sequencing of 97 mutations. We also developed a model which predicts artifacts only from the mutated base information without corresponding BAM files, by using 5,034 MicroSEC predictions as the supervised data with the LightGBM technique. A total of 742,030 mutations from the AACR Project GENIE were examined with the model. Results: Among the normal breast tissues, we identified 0.3 and 11.7 somatic mutations per sample in FF and FFPE specimens, and 0 (0%) and 10.1 (86.0%) were filtered out by MicroSEC, respectively. Two unique mutations with variant allele frequency of >50% in FFPE samples were eliminated. From the breast cancer specimens, we identified 4.0 and 10.7 mutations per sample in FF and FFPE, and 0 (0%) and 3.2 (30.2%) were filtered out, respectively. In the clinical sequencing data of 54 FFPE tumor samples, we identified 21.6 mutations per sample and 3.6 (16.6%) including five unique pathogenic mutations were filtered out. The validation study showed that the sensitivity and specificity for artifacts of MicroSEC were 97% (95% confidence interval (CI): 82%-100%) and 96% (95% CI: 88%-99%), respectively. Further, the prediction model reproduced the MicroSEC predictions with an area under the ROC curve of 0.995. The model detected 2,512 mutations (0.34%) as artifacts from Project GENIE data. There was a difference in the artifact detection rate between institutions, and 815 (1.45%) of the 56,100 mutations reported by UCSF were predicted as artifacts. Conclusions: MicroSEC removes only FFPE artifacts without eliminating true mutations found in FF samples. Our pipeline will increase the reliability of the clinical sequencing and advance cancer research using FFPE samples. Citation Format: Masachika Ikegami, Shinji Kohsaka, Takeshi Hirose, Toshihide Ueno, Satoshi Inoue, Naoki Kanomata, Hideko Yamauchi, Taisuke Mori, Shigeki Sekine, Yoshihiro Inamoto, Yasushi Yatabe, Hiroshi Kobayashi, Sakae Tanaka, Toru Akiyama, Tomotake Okuma, Hiroyuki Mano. MicroSEC: Sequence error filter for formalin-fixed and paraffin-embedded samples [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 2185.
Read full abstract