ORFer – retrieval of protein sequences and open reading frames from GenBank and storage into relational databases or text files

Konrad Büssow,Steve Hoffmann,Volker Sievert

doi:10.1186/1471-2105-3-40

Konrad Büssow, Steve Hoffmann + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-3-40

Copy DOI

Abstract

BackgroundFunctional genomics involves the parallel experimentation with large sets of proteins. This requires management of large sets of open reading frames as a prerequisite of the cloning and recombinant expression of these proteins.ResultsA Java program was developed for retrieval of protein and nucleic acid sequences and annotations from NCBI GenBank, using the XML sequence format. Annotations retrieved by ORFer include sequence name, organism and also the completeness of the sequence. The program has a graphical user interface, although it can be used in a non-interactive mode. For protein sequences, the program also extracts the open reading frame sequence, if available, and checks its correct translation. ORFer accepts user input in the form of single or lists of GenBank GI identifiers or accession numbers. It can be used to extract complete sets of open reading frames and protein sequences from any kind of GenBank sequence entry, including complete genomes or chromosomes. Sequences are either stored with their features in a relational database or can be exported as text files in Fasta or tabulator delimited format. The ORFer program is freely available at http://www.proteinstrukturfabrik.de/orfer.ConclusionThe ORFer program allows for fast retrieval of DNA sequences, protein sequences and their open reading frames and sequence annotations from GenBank. Furthermore, storage of sequences and features in a relational database is supported. Such a database can supplement a laboratory information system (LIMS) with appropriate sequence information.

Highlights

Functional genomics involves the parallel experimentation with large sets of proteins
The ORFer program allows for fast retrieval of DNA sequences, protein sequences and their open reading frames and sequence annotations from GenBank
The ORFer program has been developed to accomplish the input of large sets of protein sequence entries of the GenBank database http://www.ncbi.nlm.nih.gov, together with the corresponding coding DNA sequences, into our relational

Summary

Introduction

Functional genomics involves the parallel experimentation with large sets of proteins This requires management of large sets of open reading frames as a prerequisite of the cloning and recombinant expression of these proteins. The cloning and expression of large sets of open reading frames and proteins requires the management and analysis of significant amounts of data. In the Protein Structure Factory, a collaborative Structural Genomics project http://www.proteinstrukturfabrik.de, a relational database system is used to store sequence information and experimental data on proteins chosen as targets for structure determination. These targets consist of human protein sequence entries of the GenBank protein database. The ORFer program has been developed to accomplish the input of large sets of protein sequence entries of the GenBank database http://www.ncbi.nlm.nih.gov, together with the corresponding coding DNA sequences, into our relational (page number not for citation purposes)

Methods

Results

Conclusion