ALFA: annotation landscape for aligned reads

Mathieu Bahin,Hervé Le Hir,Alice Lebreton,Charles Bernard,Auguste Genovesio,Leila Bastianelli,Valentine Murigneux,Benoit F Noël

doi:10.1186/s12864-019-5624-2

Abstract

BackgroundThe last 10 years have seen the rise of countless functional genomics studies based on Next-Generation Sequencing (NGS). In the vast majority of cases, whatever the species, whatever the experiment, the two first steps of data analysis consist of a quality control of the raw reads followed by a mapping of those reads to a reference genome/transcriptome. Subsequent steps then depend on the type of study that is being made. While some tools have been proposed for investigating data quality after the mapping step, there is no commonly adopted framework that would be easy to use and broadly applicable to any NGS data type.ResultsWe present ALFA, a simple but universal tool that can be used after the mapping step on any kind of NGS experiment data for any organism with available genomic annotations. In a single command line, ALFA can compute and display distribution of reads by categories (exon, intron, UTR, etc.) and biotypes (protein coding, miRNA, etc.) for a given aligned dataset with nucleotide precision. We present applications of ALFA to Ribo-Seq and RNA-Seq on Homo sapiens, CLIP-Seq on Mus musculus, RNA-Seq on Saccharomyces cerevisiae, Bisulfite sequencing on Arabidopsis thaliana and ChIP-Seq on Caenorhabditis elegans.ConclusionsWe show that ALFA provides a powerful and broadly applicable approach for post mapping quality control and to produce a global overview using common or dedicated annotations. It is made available to the community as an easy to install command line tool and from the Galaxy Tool Shed.

Highlights

The last 10 years have seen the rise of countless functional genomics studies based on NextGeneration Sequencing (NGS)
We introduce ALFA (Annotation Lanscape For Aligned reads), a simple and broadly applicable tool that produces a global overview of the distribution of mapped reads, both in terms of genomic categories and biotypes with nucleotide precision
ALFA highlights laboratory dependent differences between reads falling in coding sequences (CDS) (t-test significant at a 5% level with a p-value of 4 × 10–2)

Summary

Introduction

The last 10 years have seen the rise of countless functional genomics studies based on NextGeneration Sequencing (NGS). In ChIP-Seq [3] or CLIP-Seq [4] experiments, peaks will need to be detected prior to further processing; in RNA-Seq, a differential analysis will often be performed on aligned reads; in BS-Seq experiments, a dedicated analysis of sequence will be applied in order to Dedicated categorization of reads for specific kinds of NGS data were employed in past studies performing RNA-Seq [5], Ribosome profiling [6], ChIP-seq [7] or miR-seq [8] They were not designed to be broadly applicable. To the best of our knowledge, there is no available tool

Results

Discussion

Conclusion