SeqScreen: a biocuration platform for robust taxonomic and biological process characterization of nucleic acid sequences of interest

Dreycey Albin,Mihai Pop,Todd J Treangen,Jeremy Selengut,Gene Godbold,Adam Porter,R A Leo Elworth,Mikael Lindvall,Christian Diaz,Advait Balaji,Nidhi Shah,Krista Ternus,Jacob Lu,Pravin Muthu,Dan Nasko,Chris Hulme-Lowe,Madeline Diep

doi:10.1109/bibm47256.2019.8982987

Abstract

Rapid advancements in synthetic biology and nucleic acid synthesis, in particular concerns about its intentional or accidental misuse, call for more sophisticated screening tools to identify genes of interest within short sequence fragments. One major gap in predicting genes of concern is the inadequacy of current tools and ontologies to describe the specific biological processes of pathogenic proteins. The objective of this work is to design software that sensitively assigns taxonomic classifications, functional annotations, and biological processes of interest to short nucleotide sequences of unknown origin (50bp-1,000bp). The overarching goal is to perform sensitive characterization of short sequences and highlight specific pathogenic biological processes of interest (BPoIs). The SeqScreen software executes these tasks in analytical workflows with Nextflow and outputs results in a tab-delimited report. Local and global alignments differentiate hits to taxonomically-related sequences from similar but unrelated sequences, and an ensemble approach leverages multiple tools and databases to assign a variety of functional terms to each query sequence. Final biological process assessments are made from the predicted functional annotations, which leverage information in pre-existing databases, as well as new custom biocurations. Machine learning models predict each biological process of interest on large protein databases before incorporation into the SeqScreen framework to streamline computational efficiency, ensure reproducible results, allow for version control, and facilitate the review of the automated predictions by expert biocurators. The SeqScreen source code is available at https://gitlab.com/treangenlab/seqscreen.

Full Text