Abstract

BackgroundHigh throughput technologies often require the retrieval of large data sets of sequences. Retrieval of EMBL or GenBank entries using keywords is easy using tools such as ACNUC, Entrez or SRS, but has some limitations, in particular when querying with complex keywords.ResultsWe show that Entrez has severe limitations with respect to retrieving subsequences. SRS works well with simple keywords but not with keywords composed of several terms, and has problems with complex queries. ACNUC works well, but does not allow precise queries in the Feature qualifiers. We developed specific Perl scripts to precisely retrieve subsequences as defined by complex descriptors in the Features qualifiers of the EMBL entries. We improved parts of the bioPerl library to allow parsing of large data files, and we embedded these scripts in a user friendly interface (OS independent) for easy use.ConclusionAlthough not as fast as the public tools that use prebuilt indexes, parsing the complete entries using a script is often necessary in order to retrieve the exact data searched for. Embedding in a user friendly interface allows biologists to use the scripts, which can easily be modified, if necessary, by bioinformaticians for unforeseen needs.

Highlights

  • High throughput technologies often require the retrieval of large data sets of sequences

  • The entire list of keywords can be retrieved, parsed and painfully analysed to build a complete list of keywords. This task is more difficult with queries composed of complex keywords containing several words or numbers. Popular tools such as ACNUC [4], Entrez [5] or SRS [6] have been designed for the purpose of querying with keywords, but we show in this paper that they should be used with care and caution and that they still have flaws for precisely retrieving sequences according to a complex keyword

  • We became aware of a problem when we were not able to retrieve some sequences with SRS, yet we knew they were in the EMBL database

Read more

Summary

Results

We became aware of a problem when we were not able to retrieve some sequences with SRS, yet we knew they were in the EMBL database. 4/Extraction using a dedicated script We used EmblEx to parse the EMBL fun.dat file, which contains all of the fungi sequences (release 84), for the presence of "internal transcribed spacer 1" in the following features: misc_feature, misc_rna, gene, rrna, intron or source, in that order This query took seven minutes and returned 33,696 entries. 3/ACNUC had no problem of memory of any sort and was fast as long as scanning the FT lines was not required It retrieved entries not found by the Perl script; they corresponded to keywords for other features (such as "snorna" and "precursor_rna"), since we did not use a search in specific features for reasons mentionned above. 4/If the data obtained are of importance, it is safer to query different servers and compare the results obtained

Conclusion
Background
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.