Abstract
BackgroundHigh throughput technologies often require the retrieval of large data sets of sequences. Retrieval of EMBL or GenBank entries using keywords is easy using tools such as ACNUC, Entrez or SRS, but has some limitations, in particular when querying with complex keywords.ResultsWe show that Entrez has severe limitations with respect to retrieving subsequences. SRS works well with simple keywords but not with keywords composed of several terms, and has problems with complex queries. ACNUC works well, but does not allow precise queries in the Feature qualifiers. We developed specific Perl scripts to precisely retrieve subsequences as defined by complex descriptors in the Features qualifiers of the EMBL entries. We improved parts of the bioPerl library to allow parsing of large data files, and we embedded these scripts in a user friendly interface (OS independent) for easy use.ConclusionAlthough not as fast as the public tools that use prebuilt indexes, parsing the complete entries using a script is often necessary in order to retrieve the exact data searched for. Embedding in a user friendly interface allows biologists to use the scripts, which can easily be modified, if necessary, by bioinformaticians for unforeseen needs.
Highlights
High throughput technologies often require the retrieval of large data sets of sequences
The entire list of keywords can be retrieved, parsed and painfully analysed to build a complete list of keywords. This task is more difficult with queries composed of complex keywords containing several words or numbers. Popular tools such as ACNUC [4], Entrez [5] or SRS [6] have been designed for the purpose of querying with keywords, but we show in this paper that they should be used with care and caution and that they still have flaws for precisely retrieving sequences according to a complex keyword
We became aware of a problem when we were not able to retrieve some sequences with SRS, yet we knew they were in the EMBL database
Summary
We became aware of a problem when we were not able to retrieve some sequences with SRS, yet we knew they were in the EMBL database. 4/Extraction using a dedicated script We used EmblEx to parse the EMBL fun.dat file, which contains all of the fungi sequences (release 84), for the presence of "internal transcribed spacer 1" in the following features: misc_feature, misc_rna, gene, rrna, intron or source, in that order This query took seven minutes and returned 33,696 entries. 3/ACNUC had no problem of memory of any sort and was fast as long as scanning the FT lines was not required It retrieved entries not found by the Perl script; they corresponded to keywords for other features (such as "snorna" and "precursor_rna"), since we did not use a search in specific features for reasons mentionned above. 4/If the data obtained are of importance, it is safer to query different servers and compare the results obtained
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.