Abstract
BackgroundAssignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets.ResultsWe present AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1) analyzes nucleotide and protein sequence data; (2) determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3) assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and (4) generates output in HTML, text and GFF formats for the user's convenience. We have compared AutoFACT to four well-established annotation pipelines. The error rate of functional annotation is estimated to be only between 1–2%. Comparison of AutoFACT to the traditional top-BLAST-hit annotation method shows that our procedure increases the number of functionally informative annotations by approximately 50%.ConclusionAutoFACT will serve as a useful annotation tool for smaller sequencing groups lacking dedicated bioinformatics staff. It is implemented in PERL and runs on LINUX/UNIX platforms. AutoFACT is available at .
Highlights
Assignment of function to new molecular sequence data is an essential step in genomics projects
It is derived from the raw alignment score in which the statistical properties of the scoring system used have been taken into account
AutoFACT annotation is similar AutoFACT annotation is 'Unassigned protein' AutoFACT annotation differs aFCuiogtmoumrpeart3icsopnipoefliAneustoFACT annotations across four phylogenetically diverse organisms previously annotated by well-established Comparison of AutoFACT annotations across four phylogenetically diverse organisms previously annotated by well-established automatic pipelines
Summary
Methodology AutoFACT takes a single FASTA-formatted sequence file as input, automatically recognizes the sequence type as nucleotide or protein and proceeds to ask the user for preferences regarding which databases to use, the order of database importance and bit score cutoff. If there are no matches to UniRef terms, the informative terms from the informative hit of the database (nr, in this example) are queried in the same way as above, until a functionally informative description line has been assigned to the sequence. AutoFACT yields an ~50% increase in informative annotations compared to top BLAST hits against NCBI's nr and the UniRef databases. FACT annotated as 'unassigned protein', either because the only BLAST hits were to other human sequences or because the informative terms could not be matched across database sources. Because AutoFACT considered hits to Saccharomyces cerevisiae as 'uninformative', 6/10 sequences were classified as ' [domain name]-containing proteins'. AutoFACT annotations for each organism mentioned above can be viewed at http://megasun.bch.umontreal.ca/ Software/AutoFACT.htm
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.