Abstract During the past decade, hundreds of bioinformatics tools were developed and released for detecting structural variants from the next generation sequencing and the third generation sequencing (e.g. PacBio, Nanopore) data. Each tool claimed its specific advantages comparing to others. Some tools present their efficiency on accurately identifying indels, while some others are good at detecting large and complicated structural variants. For a data analyst, especially for those have limited expertise on sequential algorithms, how to choose a suitable variant caller that fits for the owned data? To facilitate the data analysis, here, we developed and presented an automatic recommendation system. When the system is given a sequencing data (BAM/SAM file), it suggests a variant caller that may perform best, from a set of state-of-the-art bioinformatics tools. The proposed recommendation system is developed under a machine learning framework, which is called meta-learning technique. The meta-learning method enables the system to identify the associations between the features from data and the performance of tools. To achieve this, the meta-learning method consists of the following two steps: First, the method (1) extracts the metadata from the training dataset. Here, the metadata includes meta-feature collection and meta-target identification. When the training dataset is large enough, a comprehensive meta-database is obtained. Then, it (2) constructs a recommendation model according to the meta-database. As a tool has only two recommendation status, the recommendation model can be built by a classification approach. We adopted RandomForest in the system. When a new sequencing data is given to the system, it applies meta-feature collection on the data, then the recommendation is offered according to the classification results mainly on meta-features. In addition, different scenarios may have different preference on variant calling performance. We considered 3 meta-targets, which are F-measure, precision and recall. When F-measure is set to meta-target, the variant caller that may present highest F-measure is recommended. The settings for the precision and recall are similar, respectively. We conducted a series of experiments to test the validity of the proposed recommendation system. We selected 5 popular variant callers for PacBio sequencing data. The training data was generated by published simulator. The validation data included both the simulated data and the real data with benchmarks of structural variants. The experimental results indicated that different variant callers often fit for different data, while none of the callers seems efficient for every data. For the recommendation system, the F-measure and AUC could reach 76.90% and 90.80% in some tests, respectively and the recommendation accuracy kept more than 90% on average. Citation Format: Shenjie Wang, Xiaoyan Zhu, Yuqian Liu, Xiao Xiao, Jiayin Wang. Too many bioinformatics tools? Machine learning enables automatic recommendation on variant callers for long-read sequencing data [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr LB502.
Read full abstract