Abstract

As sequencing read length has increased, researchers have quickly adopted longer reads for their experiments. Here, we examine 14 pathogen or host–pathogen differential gene expression data sets to assess whether using longer reads is warranted. A variety of data sets was used to assess what genomic attributes might affect the outcome of differential gene expression analysis including: gene density, operons, gene length, number of introns/exons and intron length. No genome attribute was found to influence the data in principal components analysis, hierarchical clustering with bootstrap support, or regression analyses of pairwise comparisons that were undertaken on the same reads, looking at all combinations of paired and unpaired reads trimmed to 36, 54, 72 and 101 bp. Read pairing had the greatest effect when there was little variation in the samples from different conditions or in their replicates (e.g. little differential gene expression). But overall, 54 and 72 bp reads were typically most similar. Given differences in costs and mapping percentages, we recommend 54 bp reads for organisms with no or few introns and 72 bp reads for all others. In a third of the data sets, read pairing had absolutely no effect, despite paired reads having twice as much data. Therefore, single-end reads seem robust for differential-expression analyses, but in eukaryotes paired-end reads are likely desired to analyse splice variants and should be preferred for data sets that are acquired with the intent to be community resources that might be used in secondary data analyses.

Highlights

  • As sequencing throughput has increased and sequencing costs have decreased, measuring differential expression of genes using sequence data has become an increasingly powerful, effective and popular approach [9, 10]

  • The human, mouse, canine, Ixodes and Aspergillus reference genomes and annotations were downloaded from Ensembl; the Ehrlichia chaffeensis, Escherichia coli, Helicobacter pylori, Pseudomonas aeruginosa and wBm reference files were downloaded from the National Center for Biotechnology Information (NCBI); the Candida albicans reference files were downloaded from the Candida Genome Database; and the Brugia malayi reference files were downloaded from WormBase (Table S1)

  • The effect of the read aligner was tested using the E. coli and Wolbachia data sets by comparing the results above that were obtained with Bowtie with those from bwa mem (Additional Files S31–S36). These two data sets were selected since they represented the extremes of the results presented above with E. coli data not being affected upon changing read length or pairing status, whereas both read length and pairing influence the results with the Wolbachia data

Read more

Summary

Introduction

As sequencing throughput has increased and sequencing costs have decreased, measuring differential expression of genes using sequence data has become an increasingly powerful, effective and popular approach [9, 10]. We examine host– pathogen interaction studies to assess whether using these longer reads is warranted, given their increased cost relative to using the same number of shorter reads. To this end, we compared the use of various read lengths and read pairing for 14 diverse host–pathogen data sets with varying genomic attributes including: gene density, operons, gene length, number of introns/exons, G+C content and intron length. For data sets that will be community resources, paired ends are likely desired to enable their use in other studies, but for differential-­ expression analyses, single-­end reads yield robust results

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call