Abstract

BackgroundExpressed Sequence Tag (EST) sequences are widely used in applications such as genome annotation, gene discovery and gene expression studies. However, some of GenBank dbEST sequences have proven to be “unclean”. Identification of cDNA termini/ends and their structures in raw ESTs not only facilitates data quality control and accurate delineation of transcription ends, but also furthers our understanding of the potential sources of data abnormalities/errors present in the wet-lab procedures for cDNA library construction.ResultsAfter analyzing a total of 309,976 raw Pinus taeda ESTs, we uncovered many distinct variations of cDNA termini, some of which prove to be good indicators of wet-lab artifacts, and characterized each raw EST by its cDNA terminus structure patterns. In contrast to the expected patterns, many ESTs displayed complex and/or abnormal patterns that represent potential wet-lab errors such as: a failure of one or both of the restriction enzymes to cut the plasmid vector; a failure of the restriction enzymes to cut the vector at the correct positions; the insertion of two cDNA inserts into a single vector; the insertion of multiple and/or concatenated adapters/linkers; the presence of 3′-end terminal structures in designated 5′-end sequences or vice versa; and so on. With a close examination of these artifacts, many problematic ESTs that have been deposited into public databases by conventional bioinformatics pipelines or tools could be cleaned or filtered by our methodology. We developed a software tool for Abnormality Filtering and Sequence Trimming for ESTs (AFST, http://code.google.com/p/afst/) using a pattern analysis approach. To compare AFST with other pipelines that submitted ESTs into dbEST, we reprocessed 230,783 Pinus taeda and 38,709 Arachis hypogaea GenBank ESTs. We found 7.4% of Pinus taeda and 29.2% of Arachis hypogaea GenBank ESTs are “unclean” or abnormal, all of which could be cleaned or filtered by AFST.ConclusionscDNA terminal pattern analysis, as implemented in the AFST software tool, can be utilized to reveal wet-lab errors such as restriction enzyme cutting abnormities and chimeric EST sequences, detect various data abnormalities embedded in existing Sanger EST datasets, improve the accuracy of identifying and extracting bona fide cDNA inserts from raw ESTs, and therefore greatly benefit downstream EST-based applications.

Highlights

  • Expressed Sequence Tag (EST) sequences are widely used in applications such as genome annotation, gene discovery and gene expression studies

  • In order to significantly reduce the errors in public EST databases, we proposed a protocol that processes raw EST data based on cDNA termini/ends – a set of diagnostic sequence elements that can be used to delineate cDNA insert ends and facilitate extraction of bona fide cDNA insert sequences from raw ESTs [11,12]

  • 3TSS-1 represents the combination of a poly(A) tail and a XhoI site (CTCGAG, Enzyme2); 3TSS-2 denotes the combination of a XhoI site (CTCGAG, Enzyme2) and the adjacent plasmid vector fragment marked as Vector fragment 2 (VF2); 3TSS-3 represents the poly(A) tail; 3TSS-4 denotes direct adjunction of a poly(A) tail, a guanine (G) instead of a XhoI site (CTCGAG, Enzyme2), and the vector fragment vector fragment 2 (VF2), which is impossible in theory; and 3TSS-5 stands only for the vector fragment VF2

Read more

Summary

Introduction

Expressed Sequence Tag (EST) sequences are widely used in applications such as genome annotation, gene discovery and gene expression studies. Double-termini adapters, the palindrome linker sequences that likely concatenate two different transcripts to form chimeric ESTs, were identified in many Pinus teada ESTs [10]. In another case, we were able to identify a number of spurious sequence remnants (i.e. vector or adapter fragments) in a large portion of the GenBank ESTs and their clusters/contigs for Chlamydomonas reinhardtii [11], an artifact of undertrimming during the procedures of raw EST cleanup

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call