EP320: AI-based method to estimate the probability of a variant being an artifact

Oscar Bastidas,Juan Carlos Marco,Luz Dorta,Alfredo García Martínez

doi:10.1016/j.gim.2022.01.355

Abstract

Next Generation Sequencing is progressively becoming the leading DNA and RNA sequencing technology. It allows for massive parallel sequencing of DNA/RNA sequences in a relatively short period of time and reduced cost when compared to Sanger sequencing, producing large quantities of short sequence reads. However, NGS technologies introduce a certain amount of artifacts that are not always removed by the bioinformatics pipeline at the secondary data analysis. These artifacts can complicate the analysis of results and if they remain undetected can be a source of false positives at the variant interpretation stage. We developed an automated method based on artificial intelligence to automatically detect these artifacts so that they can be removed early in the variant interpretation pipeline avoiding false positives and saving genetic analyst precious time. A total of 200 variants were extracted from a set of 11 samples of patients coming from La Fe Hospital in Spain. These variants were a combination of real sequence variants and artifacts that were previously classified by two different human expert geneticists. A total of 80 variants were classified as real sequence variants with agreement between the two geneticists. 78 were classified as artifacts, and 42 were declared unknown by the geneticists, or a consensus about their classification was not reached. With the selected variants and artifacts a training set of 158 variants was created to feed an artificial intelligence system. For each variant, the elements used as an input of the system were : 1) sequencing depth 2) total number of reference reads 3) total number of alternative reads 4) total forward reference reads 5) total reverse reference reads 6) total forward alternative reads 7) total reverse alternative reads 8) whether the variant is a section of repetition. The system was programmed using Python programming language and NumPy and samtools libraries. In order to extract the required parameters, the system required access to both .VCF and .BAM files. After an initial training period of the system with several artificial intelligence engines, a logistic regression classifier was selected as the best performing candidate. The proposed method reported concordance with human classification in 97,7% of cases in a sample of 200 variants classified in the laboratory. Precision reached 94,11% with a recall of 100%. We presented a novel method for artefact detection in NGS tertiary analysis. The proposed methodology could be considered as a complement to existing ACMG guidelines in order to establish the validity of variant interpretation results. The method can save time and steps necessary to validate variants before being included in clinical reports. The method can be improved with the addition of new input parameters, such a nucleotides quality scores as well as cohort analysis of samples coming from the same sequencer batch.

Full Text