Abstract

It remains difficult to diagnose rare diseases by means of Genome or Exome Sequencing (GS or ES), with diagnostic yield of these tests estimated to range from 25% to 40%. Initially unsolved cases can become more solvable over time as knowledgebase annotations become more complete or as an individual patient’s phenotypic profile develops. However, it is not feasible to allocate human effort to re-analyzing the growing backlog of unsolved cases at a regular interval. The goal of this work is to identify cases which are most likely to be solvable based on a combination of demographic information and quantitative markers of the likely causality of genetic variation present in each patient. In a cohort of 611 patients who received ES, 187 (30.6%) were diagnosed and 79 had no medically meaningful findings. These groups were designated “solvable” and “not solvable”, respectively. The remaining 345 cases were designated as “potentially solvable”. For each patient, 28 features were recorded: sex, age, time since previous analysis, and the Likelihood Ratio (LR) scores of the top 25 variants as calculated by a modified version of LIRICAL, a likelihood ratio-driven algorithm which computes the post-test probability that each variant found in a patient’s Variant Call File (VCF) is causative of their phenotype. With an 80:20 train-test split of the 266 cases marked as either “solvable” or “not solvable”, we evaluated logistic regression and random forest models from a parameter grid search by 5-fold cross validation, resolving the LR scores into their principal components decomposition and using SMOTE to correct for class imbalances. This grid search selected a random forest model on 100 trees with maximum depth 10 and no minimum impurity split requirement. The selected model classified the test set with recall = 0.868 and precision = 0.767. Of the 345 potentially solvable cases, the model labeled 234 as “solvable”; based on the model’s precision, we estimate 179 of these cases are likely to be solvable with the remainder being false positives. Based on the model’s false negative rate, we estimate the model misses 23 solvable cases, implying a total of 202 solvable cases among the 345 potentially solvable cases. Thus, the model recovers 179/202 (88.6%) of the remaining solvable cases, while requiring inspection of only 234 (67.8%) of the 345 potentially solvable cases. By contrast, we would expect 234 randomly chosen potentially solvable cases to include approximately 118/202 (58.4%) of the remaining solvable cases. A machine learning approach has potential to prioritize re-analysis effort and obtain higher diagnostic yield at a lower human effort cost. This approach does not altogether depart from the default approach of re-analysis in order of previous analysis, as the time since previous analysis was a highly-weighted feature in many of the parameter configurations searched. Using a framework which includes both biological data and case-level demographics at regular intervals should give all cases a chance to eventually be considered for re-analysis.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.