Querying the public databases for sequences using complex keywords contained in the feature lines

Olivier Croce,Richard Christen,Michaël Lamarre

doi:10.1186/1471-2105-7-45

Olivier Croce, Richard Christen + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-7-45

Copy DOI

Abstract

BackgroundHigh throughput technologies often require the retrieval of large data sets of sequences. Retrieval of EMBL or GenBank entries using keywords is easy using tools such as ACNUC, Entrez or SRS, but has some limitations, in particular when querying with complex keywords.ResultsWe show that Entrez has severe limitations with respect to retrieving subsequences. SRS works well with simple keywords but not with keywords composed of several terms, and has problems with complex queries. ACNUC works well, but does not allow precise queries in the Feature qualifiers. We developed specific Perl scripts to precisely retrieve subsequences as defined by complex descriptors in the Features qualifiers of the EMBL entries. We improved parts of the bioPerl library to allow parsing of large data files, and we embedded these scripts in a user friendly interface (OS independent) for easy use.ConclusionAlthough not as fast as the public tools that use prebuilt indexes, parsing the complete entries using a script is often necessary in order to retrieve the exact data searched for. Embedding in a user friendly interface allows biologists to use the scripts, which can easily be modified, if necessary, by bioinformaticians for unforeseen needs.

Highlights

High throughput technologies often require the retrieval of large data sets of sequences
The entire list of keywords can be retrieved, parsed and painfully analysed to build a complete list of keywords. This task is more difficult with queries composed of complex keywords containing several words or numbers. Popular tools such as ACNUC [4], Entrez [5] or SRS [6] have been designed for the purpose of querying with keywords, but we show in this paper that they should be used with care and caution and that they still have flaws for precisely retrieving sequences according to a complex keyword
We became aware of a problem when we were not able to retrieve some sequences with SRS, yet we knew they were in the EMBL database

Summary

Results

We became aware of a problem when we were not able to retrieve some sequences with SRS, yet we knew they were in the EMBL database. 4/Extraction using a dedicated script We used EmblEx to parse the EMBL fun.dat file, which contains all of the fungi sequences (release 84), for the presence of "internal transcribed spacer 1" in the following features: misc_feature, misc_rna, gene, rrna, intron or source, in that order This query took seven minutes and returned 33,696 entries. 3/ACNUC had no problem of memory of any sort and was fast as long as scanning the FT lines was not required It retrieved entries not found by the Perl script; they corresponded to keywords for other features (such as "snorna" and "precursor_rna"), since we did not use a search in specific features for reasons mentionned above. 4/If the data obtained are of importance, it is safer to query different servers and compare the results obtained

Conclusion

Background

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jan 27, 2006
Citations: 12	License type: cc-by

R Discovery Prime

R Discovery Prime

Querying the public databases for sequences using complex keywords contained in the feature lines

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Ancient mitochondrial DNA from hair
M.Thomas P Gilbert ... Alan Cooper
Current Biology | VOL. 14
M.Thomas P Gilbert, et. al.M.Thomas P Gilbert ... Alan Cooper
01 Jun 2004
Current Biology | VOL. 14

Abstract 386: OnkoInsight: an end-to-end cancer informatics pipeline to generate insights from large sequencing datasets
Li Tai Fang ... Marghoob Mohiyuddin
Cancer Research | VOL. 77
Li Tai Fang, et. al.Li Tai Fang ... Marghoob Mohiyuddin
01 Jul 2017
Cancer Research | VOL. 77

FONZIE: An optimized pipeline for minisatellite marker discovery and primer design from large sequence data sets
Pascal Bally ... Marie-Hélène Balesdent
BMC Research Notes | VOL. 3
Pascal Bally, et. al.Pascal Bally ... Marie-Hélène Balesdent
29 Nov 2010
BMC Research Notes | VOL. 3

Scalable Sequence Clustering for Large-Scale Immune Repertoire Analysis
Prem Bhusal ... Ning Jiang
-
Prem Bhusal, et. al.Prem Bhusal ... Ning Jiang
15 Dec 2021
15 Dec 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Querying the public databases for sequences using complex keywords contained in the feature lines

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics