Abstract

Various research projects often involve determining the relative position of genomic coordinates, intervals, single nucleotide variations (SNVs), insertions, deletions and translocations with respect to genes and their potential impact on protein translation. Due to the tremendous increase in throughput brought by the use of next-generation sequencing, investigators are routinely faced with the need to annotate very large datasets. We present Segtor, a tool to annotate large sets of genomic coordinates, intervals, SNVs, indels and translocations. Our tool uses segment trees built using the start and end coordinates of the genomic features the user wishes to use instead of storing them in a database management system. The software also produces annotation statistics to allow users to visualize how many coordinates were found within various portions of genes. Our system currently can be made to work with any species available on the UCSC Genome Browser. Segtor is a suitable tool for groups, especially those with limited access to programmers or with interest to analyze large amounts of individual genomes, who wish to determine the relative position of very large sets of mapped reads and subsequently annotate observed mutations between the reads and the reference. Segtor (http://lbbc.inca.gov.br/segtor/) is an open-source tool that can be freely downloaded for non-profit use. We also provide a web interface for testing purposes.

Highlights

  • The advent of next-generation sequencing (NGS) technologies has enabled a drastic growth in the number of sequencing projects by largely increasing the sequence output and by lowering overall costs

  • Once the mapping is completed, investigators are often left with the daunting task of identifying the relative position of a large number of single nucleotide variations (SNVs), insertions, deletions and genomic translocations to existing genomic features

  • We introduce Segtor, a rapid annotation tool aimed at NGS studies for genomic coordinates, intervals, SNVs, insertions, deletions and translocations

Read more

Summary

Introduction

The advent of next-generation sequencing (NGS) technologies has enabled a drastic growth in the number of sequencing projects by largely increasing the sequence output and by lowering overall costs. Certain projects involve the sequencing of an organism whose genome is already available These projects, called resequencing projects, generally involve two steps: the mapping of reads onto the known genome and the subsequent analysis of divergent features between the reference genome and the mapped sequences[1]. Among the genomic features that investigators might be interested in are high-quality mRNAs, ESTs or gene predictions already mapped to the reference sequence Another task might involve the identification of which genomic coordinates or genomic intervals overlap or span known genomic features (see [5] for an example). Research groups are sometimes faced with the challenge of identifying the closest transcription start site (TSS) for various genomic coordinates (see [6] for an example) All these problems basically amount to the task of finding which genes overlap a given genomic coordinate or interval

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.