Abstract

Rapid advancements in synthetic biology and nucleic acid synthesis, in particular concerns about its intentional or accidental misuse, call for more sophisticated screening tools to identify genes of interest within short sequence fragments. One major gap in predicting genes of concern is the inadequacy of current tools and ontologies to describe the specific biological processes of pathogenic proteins. The objective of this work is to design software that sensitively assigns taxonomic classifications, functional annotations, and biological processes of interest to short nucleotide sequences of unknown origin (50bp-1,000bp). The overarching goal is to perform sensitive characterization of short sequences and highlight specific pathogenic biological processes of interest (BPoIs). The SeqScreen software executes these tasks in analytical workflows with Nextflow and outputs results in a tab-delimited report. Local and global alignments differentiate hits to taxonomically-related sequences from similar but unrelated sequences, and an ensemble approach leverages multiple tools and databases to assign a variety of functional terms to each query sequence. Final biological process assessments are made from the predicted functional annotations, which leverage information in pre-existing databases, as well as new custom biocurations. Machine learning models predict each biological process of interest on large protein databases before incorporation into the SeqScreen framework to streamline computational efficiency, ensure reproducible results, allow for version control, and facilitate the review of the automated predictions by expert biocurators. The SeqScreen source code is available at https://gitlab.com/treangenlab/seqscreen.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call