Staphopia: an analysis pipeline and Application Programming Interface focused on Staphylococcus aureus

Petit Robert A ,Read Timothyq

doi:10.6084/m9.figshare.5488666.v2

Abstract

Staphopia: an analysis pipeline and Application Programming Interface focused on Staphylococcus aureus Robert A. Petit IIII and Timothy D. Read Abstract Rapid low-cost sequencing of clinically-important bacterial pathogens has generated thousands of publicly available datasets and many hundreds of thousands more will undoubtedly soon be generated. Analyzing these genomes and extracting relevant information for each pathogen and the associated clinical phenotypes requires not only resources and bioinformatic skills but domain knowledge on the nuances of the organism. We have created an analysis pipeline and API focused on Staphylococcus aureus, which is not only a common human commensal but is also of important public health interest, with MRSA (methicillin-resistant S. aureus) a major antibiotic-resistant hospital pathogen. Staphopia can be used both for basic science studies (e.g. patterns of evolution) but also potentially as a platform for rapid clinical diagnostics. Written in Python, Staphopia’s analysis pipeline consists of submodules running open-source tools managed by a pipeline manager. It accepts raw FASTQ reads as an input, which undergo quality control filtration, error correction and reduction to a maximum of 100x coverage. This data reduction is advantageous for load management when processing thousands of genomes. Using preprocessed reads the pipeline branches off into de novo assembly-based analysis and mapping-based analysis. Modules running species-specific analyses such as antibiotic resistance profiling and multi-locus sequence type (MLST), use the contigs. Genes are annotated from the contigs using PROKKA and the UniProt database. Mapping is used to to call all variants (SNPs and InDels) against a single reference chromosome (S. aureus N315). From the processed reads, 31-mers are counted for each input sample. Depending on the size of the input file, analysis is completed between 20-60 minutes. With Staphopia’s web application, built using the Django web-framework, analysis results from each genome are stored within a PostgreSQL database with the exception of k-mers, which are stored using Elasticsearch. Users can access these results graphically through a web front end (staphopia.emory.edu) or programmatically through a web API. We have also written a R package (staphopia-R) to access the API. The pipeline has also been encapsulated into a Docker image, simplifying installation and running on local machines and in the cloud. Using publicly available sequencing projects available from the SRA, we have loaded > 20,000 genomes into Staphopia. Staphopia is primarily developed for Illumina data but future updates will be targeted to adapting analyses to long-read data. More information about Staphopia is available at staphopia.emory.edu. All code is available on public GitHub repos.

Full Text