Abundance Estimation Error Research Articles

BackgroundShotgun metagenomes are often assembled prior to annotation of genes which biases the functional capacity of a community towards its most abundant members. For an unbiased assessment of community function, short reads need to be mapped directly to a gene or protein database. The ability to detect genes in short read sequences is dependent on pre- and post-sequencing decisions. The objective of the current study was to determine how library size selection, read length and format, protein database, e-value threshold, and sequencing depth impact gene-centric analysis of human fecal microbiomes when using DIAMOND, an alignment tool that is up to 20,000 times faster than BLASTX.ResultsUsing metagenomes simulated from a database of experimentally verified protein sequences, we find that read length, e-value threshold, and the choice of protein database dramatically impact detection of a known target, with best performance achieved with longer reads, stricter e-value thresholds, and a custom database. Using publicly available metagenomes, we evaluated library size selection, paired end read strategy, and sequencing depth. Longer read lengths were acheivable by merging paired ends when the sequencing library was size-selected to enable overlaps. When paired ends could not be merged, a congruent strategy in which both ends are independently mapped was acceptable. Sequencing depths of 5 million merged reads minimized the error of abundance estimates of specific target genes, including an antimicrobial resistance gene.ConclusionsShotgun metagenomes of DNA extracted from human fecal samples sequenced using the Illumina platform should be size-selected to enable merging of paired end reads and should be sequenced in the PE150 format with a minimum sequencing depth of 5 million merge-able reads to enable detection of specific target genes. Expecting the merged reads to be 180-250 bp in length, the appropriate e-value threshold for DIAMOND would then need to be more strict than the default. Accurate and interpretable results for specific hypotheses will be best obtained using small databases customized for the research question.

Read full abstract

BackgroundThe main goal of the whole transcriptome analysis is to correctly identify all expressed transcripts within a specific cell/tissue - at a particular stage and condition - to determine their structures and to measure their abundances. RNA-seq data promise to allow identification and quantification of transcriptome at unprecedented level of resolution, accuracy and low cost. Several computational methods have been proposed to achieve such purposes. However, it is still not clear which promises are already met and which challenges are still open and require further methodological developments.ResultsWe carried out a simulation study to assess the performance of 5 widely used tools, such as: CEM, Cufflinks, iReckon, RSEM, and SLIDE. All of them have been used with default parameters. In particular, we considered the effect of the following three different scenarios: the availability of complete annotation, incomplete annotation, and no annotation at all. Moreover, comparisons were carried out using the methods in three different modes of action. In the first mode, the methods were forced to only deal with those isoforms that are present in the annotation; in the second mode, they were allowed to detect novel isoforms using the annotation as guide; in the third mode, they were operating in fully data driven way (although with the support of the alignment on the reference genome). In the latter modality, precision and recall are quite poor. On the contrary, results are better with the support of the annotation, even though it is not complete. Finally, abundance estimation error often shows a very skewed distribution. The performance strongly depends on the true real abundance of the isoforms. Lowly (and sometimes also moderately) expressed isoforms are poorly detected and estimated. In particular, lowly expressed isoforms are identified mainly if they are provided in the original annotation as potential isoforms.ConclusionsBoth detection and quantification of all isoforms from RNA-seq data are still hard problems and they are affected by many factors. Overall, the performance significantly changes since it depends on the modes of action and on the type of available annotation. Results obtained using complete or partial annotation are able to detect most of the expressed isoforms, even though the number of false positives is often high. Fully data driven approaches require more attention, at least for complex eucaryotic genomes. Improvements are desirable especially for isoform quantification and for isoform detection with low abundance.

Read full abstract

Abundance Estimation Error Research Articles

Articles published on Abundance Estimation Error

Pre- and post-sequencing recommendations for functional annotation of human fecal metagenomes

Models for assessing local‐scale co‐abundance of animal species while accounting for differential detectability and varied responses to the environment

Hybrid Spectral Unmixing: Using Artificial Neural Networks for Linear/Non-Linear Switching

Centralized Collaborative Sparse Unmixing for Hyperspectral Images

Abundance quantification by independent component analysis of hyperspectral imagery for oil spill coverage calculation

Computational approaches for isoform detection and estimation: good and bad news.

The use of an adaptive acoustic-survey design to estimate the abundance of highly skewed fish populations

The relationship between sampling intensity and sampling error—empirical results from acoustic surveys in Polish vendace lakes

Towards an optimal sampling strategy for Alexandrium catenella (Dinophyceae) benthic resting cysts

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Abundance Estimation Error Research Articles

Articles published on Abundance Estimation Error

Pre- and post-sequencing recommendations for functional annotation of human fecal metagenomes

Models for assessing local‐scale co‐abundance of animal species while accounting for differential detectability and varied responses to the environment

Hybrid Spectral Unmixing: Using Artificial Neural Networks for Linear/Non-Linear Switching

Centralized Collaborative Sparse Unmixing for Hyperspectral Images

Abundance quantification by independent component analysis of hyperspectral imagery for oil spill coverage calculation

Computational approaches for isoform detection and estimation: good and bad news.

The use of an adaptive acoustic-survey design to estimate the abundance of highly skewed fish populations

The relationship between sampling intensity and sampling error—empirical results from acoustic surveys in Polish vendace lakes

Towards an optimal sampling strategy for Alexandrium catenella (Dinophyceae) benthic resting cysts