Abstract

Computational tools used for genomic analyses are becoming more accurate but also increasingly sophisticated and complex. This introduces a new problem in that these pieces of software have a large number of tunable parameters that often have a large influence on the results that are reported. We quantify the impact of parameter choice on transcript assembly and take some first steps toward generating a truly automated genomic analysis pipeline by developing a method for automatically choosing input-specific parameter values for reference-based transcript assembly using the Scallop tool. By choosing parameter values for each input, the area under the receiver operator characteristic curve (AUC) when comparing assembled transcripts to a reference transcriptome is increased by an average of 28.9% over using only the default parameter choices on 1595 RNA-Seq samples in the Sequence Read Archive. This approach is general, and when applied to StringTie, it increases the AUC by an average of 13.1% on a set of 65 RNA-Seq experiments from ENCODE. Parameter advisors for both Scallop and StringTie are available on Github.

Highlights

  • A s the field of computational biology has matured, there has been a significant increase in the amount of data that need to be processed and a corresponding increase in the reliance of users without computational expertise on the highly complicated programs that perform the analyses

  • Our results show that sample-specific parameter vectors are important for developing any genomic pipeline that includes transcriptome assembly as a step

  • We begin to answer the question of how to produce transcriptome assemblies effectively for any input without sacrificing quality or expanding manpower. This is done using a combination of parameter tuning through exploration using coordinate ascent and the established method of parameter advising

Read more

Summary

INTRODUCTION

A s the field of computational biology has matured, there has been a significant increase in the amount of data that need to be processed and a corresponding increase in the reliance of users without computational expertise on the highly complicated programs that perform the analyses. Tuning the parameter choices to increase accuracy for one input does not imply that the results will be improved for all inputs This means that, for optimum performance, tuning must be repeated for each new piece of data. In the case of high-throughput genomic analysis, this manual procedure is infeasible For these applications, without some sort of automatic parameter choice system, the defaults must be used. To address the automated parameter choice problem for multiple sequence alignment (MSA), DeBlasio and Kececioglu (2017b) have defined a framework to automatically select the parameter values for an input This process, called ‘‘parameter advising,’’ has been shown to greatly increase the accuracy of MSA without sacrificing wall-clock running time in most cases, and it can readily be applied to new domains. We use the same measure for selecting parameter choices for a given input

Contributions
DEVELOPING A PARAMETER ADVISOR FOR TRANSCRIPT ASSEMBLY
Advisor estimator
Finding an advisor set using coordinate ascent
Assessing the generality of learned parameter vectors
Advising for StringTie
Justification for a reference-based advising metric
CONCLUSIONS
FUNDING INFORMATION
Findings
Methods
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call