Abstract

High-throughput biological data analysis commonly involves identifying features such as genes, genomic regions, and proteins, whose values differ between two conditions, from numerous features measured simultaneously. The most widely used criterion to ensure the analysis reliability is the false discovery rate (FDR), which is primarily controlled based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions. Clipper is a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper outperforms existing methods for a broad range of applications in high-throughput data analysis.

Highlights

  • High-throughput technologies are widely used to measure system-wide biological features, such as genes, genomic regions, and proteins (“high-throughput” means the number of features is large, at least in thousands)

  • We benchmarked Clipper against bioinformatics tools in studies including peak calling from control sample than in the experimental (ChIP)-seq data, peptide identification from mass spectrometry data, differentially expressed genes (DEGs) identification from bulk and single-cell RNA-seq data, and differentially interacting chromatin regions (DIRs) identification from Hi-C data

  • Clipper has broad applications in omics data analyses We demonstrate the use of Clipper in four omics data applications: peak calling from ChIP-seq data, peptide identification from mass spectrometry (MS) data, DEG identification from bulk or single-cell RNA-seq data, and DIR identification from Hi-C data

Read more

Summary

Introduction

High-throughput technologies are widely used to measure system-wide biological features, such as genes, genomic regions, and proteins (“high-throughput” means the number of features is large, at least in thousands). The most common goal of analyzing high-throughput data is to contrast two conditions so as to reliably screen “interesting features,” where “interesting” means “enriched” or “differential.” “Enriched features” are defined to have higher expected measurements (without measurement errors) under the experimental (i.e., treatment) condition than the background (i.e., negative control) condition. Typical enrichment analyses include calling protein-binding sites in a genome from chromatin immunoprecipitation sequencing (ChIP-seq) data [1, 2] and identifying peptides from mass spectrometry (MS) data [3]. “differential features” are defined to have different expected measurements between two conditions, and their detection is called “differential analysis.”.

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call