FastqToGenCounts: Converting Fastq Files to Gene Counts Matrices

Josh Loecker,Brandt Bessell

doi:10.4049/jimmunol.210.supp.249.30

Abstract

Abstract FastqToGeneCounts is a computational pipeline built on Snakemake to process and analyze bulk RNA sequencing data to determine gene expression. It can handle raw data from the Gene Expression Omnibus or local FastQ files and runs on a supercomputing cluster. The primary function of FastqToGeneCounts is to align FastQ files to a reference genome. To ensure accurate outputs, it also offers trimming, quality control, and contaminant screening options. One of the benefits of using FastqToGeneCounts is that it addresses the issues commonly encountered with traditional alignment workflows. These include the need for manual file naming and directory setup, which can lead to errors and problems resuming failed workflows. Additionally, these workflows often require many resources requested at once, leading to long “wait times” for cluster access. Modifying parameters can also be challenging, as most settings are not localized to a single file. In contrast, FastqToGeneCounts integrates with Snakemake to improve resume-ability, minimize resource usage, and provide easy parameter modification. Jobs can be resumed from a failed state, only the minimum required resources are requested to reduce waiting time, and a single YAML file can be used to configure parameters within the pipeline. FastqToGeneCounts has been used to analyze four immature natural killer datasets and has successfully determined gene expression in these samples. It is capable of analyzing a single raw RNA-seq input file in approximately 15 compute minutes, and additional files do not increase runtime due to Snakemake’s high interoperability with SLURM. FastqToGeneCounts removes poor-quality reads from downstream analysis and presents results in an easy-to-read report.

Full Text