Abstract
BackgroundThe assembly of next-generation short-read sequencing data can result in a fragmented non-contiguous set of genomic sequences. Therefore a common step in a genome project is to join neighbouring sequence regions together and fill gaps. This scaffolding step is non-trivial and requires manually editing large blocks of nucleotide sequence. Joining these sequences together also hides the source of each region in the final genome sequence. Taken together these considerations may make reproducing or editing an existing genome scaffold difficult.MethodsThe software outlined here, “Scaffolder,” is implemented in the Ruby programming language and can be installed via the RubyGems software management system. Genome scaffolds are defined using YAML - a data format which is both human and machine-readable. Command line binaries and extensive documentation are available.ResultsThis software allows a genome build to be defined in terms of the constituent sequences using a relatively simple syntax. This syntax further allows unknown regions to be specified and additional sequence to be used to fill known gaps in the scaffold. Defining the genome construction in a file makes the scaffolding process reproducible and easier to edit compared with large FASTA nucleotide sequences.ConclusionsScaffolder is easy-to-use genome scaffolding software which promotes reproducibility and continuous development in a genome project. Scaffolder can be found at http://next.gs.
Highlights
The assembly of next-generation short-read sequencing data can result in a fragmented non-contiguous set of genomic sequences
Plain-text scaffold files written in YAML ain’t markup language [23] (YAML) specify how these sequences should be joined
Each scaffold file represents one scaffolded nucleotide sequence and as such separate scaffolds should be defined in separate files
Summary
The assembly of next-generation short-read sequencing data can result in a fragmented non-contiguous set of genomic sequences. A common step in a genome project is to join neighbouring sequence regions together and fill gaps This scaffolding step is non-trivial and requires manually editing large blocks of nucleotide sequence. Joining these sequences together hides the source of each region in the final genome sequence. Software takes the nucleotide reads produced by sequencing hardware and, in the ideal case, outputs a single complete genome sequence composed of these individual fragments. An analogy for this process is a jigsaw puzzle: each nucleotide read represents a single piece, and the final genome sequence is the completed puzzle. This may be due to insufficient or multiple different overlaps between reads and is analogous to missing pieces in the jigsaw or pieces that fit to multiple other pieces
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.