Abstract

Background There is considerable ongoing effort towards making DNA sequencing machines faster and more affordable today. Improving the accuracy of next-generation sequencers directly lowers sequencing costs by reducing the need for resequencing, making genome-based diagnostics and research more affordable [1]. In this paper, we show how the accuracy of next-generation sequencing machines is significantly improved using supervised learning, specifically, multi-class support vector machines. We demonstrate our methods on the SOLiD 5500/5500 XL platform. Base-calling is the process of determining the order of nucleotides in the read sequence. In SOLiD, base-calling involves the process of color calling, since the SOLiD platform uses an encoding system where each adjacent pair of nucleotides is represented by one of four colored dyes [2]. Base-callers have been developed for other nextgeneration sequencing platforms, in particular Illumina and Roche 454 [1]. Most of them are based on explicit statistical models and some are based on support vector based supervised learning [3,4]. But ours is the first supervised learning method applied on a large scale directly to color space. Also, this is the first supervised learning method to be applied on a large-scale to SOLiD. Moreover, we show that our methods require less training data and hence our training times are much faster than previous methods.

Highlights

  • There is considerable ongoing effort towards making DNA sequencing machines faster and more affordable today

  • We show that our methods require less training data and our training times are much faster than previous methods

  • * Correspondence: shruthi@ices.utexas.edu 1Department of Computer Science, University of Texas, Austin, Texas, USA Full list of author information is available at the end of the article signal, a problem known as phasing

Read more

Summary

Introduction

There is considerable ongoing effort towards making DNA sequencing machines faster and more affordable today. Base-calling is the process of determining the order of nucleotides in the read sequence. In SOLiD, base-calling involves the process of color calling, since the SOLiD platform uses an encoding system where each adjacent pair of nucleotides is represented by one of four colored dyes [2].

Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.