Sustained software development, not number of citations or journal choice, is indicative of accurate bioinformatic software

Paul P Gardner,Michael A Black,Stephanie Mcgimpsey,James M Paterson,Fatemeh Ashari-Ghomi,Alex Gavryushkin,Aleksandra Pawlik,Sinan U Umu

doi:10.1186/s13059-022-02625-x

Abstract

BackgroundComputational biology provides software tools for testing and making inferences about biological data. In the face of increasing volumes of data, heuristic methods that trade software speed for accuracy may be employed. We have studied these trade-offs using the results of a large number of independent software benchmarks, and evaluated whether external factors, including speed, author reputation, journal impact, recency and developer efforts, are indicative of accurate software.ResultsWe find that software speed, author reputation, journal impact, number of citations and age are unreliable predictors of software accuracy. This is unfortunate because these are frequently cited reasons for selecting software tools. However, GitHub-derived statistics and high version numbers show that accurate bioinformatic software tools are generally the product of many improvements over time. We also find an excess of slow and inaccurate bioinformatic software tools, and this is consistent across many sub-disciplines. There are few tools that are middle-of-road in terms of accuracy and speed trade-offs.ConclusionsOur findings indicate that accurate bioinformatic software is primarily the product of long-term commitments to software development. In addition, we hypothesise that bioinformatics software suffers from publication bias. Software that is intermediate in terms of both speed and accuracy may be difficult to publish—possibly due to author, editor and reviewer practises. This leaves an unfortunate hole in the literature, as ideal tools may fall into this gap. High accuracy tools are not always useful if they are slow, while high speed is not useful if the results are also inaccurate.

Highlights

Computational biology provides software tools for testing and making inferences about biological data
We have collected relative accuracy and speed ranks for 498 distinct software tools
This software has been developed for solving a broad cross-section of computational biology tasks

Summary

Introduction

Computational biology provides software tools for testing and making inferences about biological data. Computational biology software is widely used and has produced some of the most cited publications in the entire scientific corpus [1,2,3] These highly-cited software tools include implementations of methods for sequence alignment and homology inference [4,5,6,7], phylogenetic analysis [8,9,10,11,12], biomolecular structure analysis [13,14,15,16,17], and visualisation and data collection [18, 19]. As datasets become larger even small error rates can have major impacts on the number of false inferences [25]

Methods

Results

Conclusion