Approximate distance correlation for selecting highly interrelated genes across datasets.

Qunlun Shen,Shihua Zhang

doi:10.1371/journal.pcbi.1009548

Abstract

With the rapid accumulation of biological omics datasets, decoding the underlying relationships of cross-dataset genes becomes an important issue. Previous studies have attempted to identify differentially expressed genes across datasets. However, it is hard for them to detect interrelated ones. Moreover, existing correlation-based algorithms can only measure the relationship between genes within a single dataset or two multi-modal datasets from the same samples. It is still unclear how to quantify the strength of association of the same gene across two biological datasets with different samples. To this end, we propose Approximate Distance Correlation (ADC) to select interrelated genes with statistical significance across two different biological datasets. ADC first obtains the k most correlated genes for each target gene as its approximate observations, and then calculates the distance correlation (DC) for the target gene across two datasets. ADC repeats this process for all genes and then performs the Benjamini-Hochberg adjustment to control the false discovery rate. We demonstrate the effectiveness of ADC with simulation data and four real applications to select highly interrelated genes across two datasets. These four applications including 21 cancer RNA-seq datasets of different tissues; six single-cell RNA-seq (scRNA-seq) datasets of mouse hematopoietic cells across six different cell types along the hematopoietic cell lineage; five scRNA-seq datasets of pancreatic islet cells across five different technologies; coupled single-cell ATAC-seq (scATAC-seq) and scRNA-seq data of peripheral blood mononuclear cells (PBMC). Extensive results demonstrate that ADC is a powerful tool to uncover interrelated genes with strong biological implications and is scalable to large-scale datasets. Moreover, the number of such genes can serve as a metric to measure the similarity between two datasets, which could characterize the relative difference of diverse cell types and technologies.

Highlights

High-throughput sequencing technologies (e.g., RNA-seq, scRNA-seq, scATAC-seq) provide an unprecedented opportunity to analyze biological process with large-scale data
Detecting of highly interrelated genes across datasets is hindered because the samples of them are always different and they could have different numbers of samples
We present a new algorithm that can identify interrelated genes across datasets based on distance correlation

Summary

Introduction

High-throughput sequencing technologies (e.g., RNA-seq, scRNA-seq, scATAC-seq) provide an unprecedented opportunity to analyze biological process with large-scale data. Differential analysis plays a vital role in comparative studies, and many methods like limma [7] and edgeR [8] have been put forward to identify differentially expressed genes between two different datasets [9]. The problem of measuring the correlation between two genes in a single dataset or two multi-modal datasets from the same samples has been well studied and can be conducted using Pearson correlation coefficient, Spearman correlation coefficient, Kendall correlation coefficient and so on. It should be noted that the performance of MIC can be significantly reduced with a limited number of samples in practice [12]

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLOS Computational Biology	Publication Date: Nov 9, 2021
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Approximate distance correlation for selecting highly interrelated genes across datasets.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS Computational Biology

Lead the way for us

Similar Papers

Approximate distance correlation for selecting highly interrelated genes across datasets
Mingyao Li ... Jian Ma
-
Mingyao Li, et. al.Mingyao Li ... Jian Ma
09 Nov 2021
09 Nov 2021

Author response: In vivo generation of bone marrow from embryonic stem cells in interspecies chimeras
Bingqiang Wen ... Guolun Wang
-
Bingqiang Wen, et. al.Bingqiang Wen ... Guolun Wang
23 Aug 2022
23 Aug 2022

Mona, a novel hematopoietic-specific adaptor interacting with the macrophage colony-stimulating factor receptor, is implicated in monocyte/macrophage development.
R P Bourette
The EMBO journal | VOL. 17
R P BouretteR P Bourette
15 Dec 1998
The EMBO journal | VOL. 17

Abstract 4166A: An artificial intelligence based meta-analysis of publicly available single cell RNA-seq datasets for hematopoietic and lymphoid malignancies identifies repurposable cancer drug targets
Bei Jiang ... Michael Januszyk
Cancer Research | VOL. 80
Bei Jiang, et. al.Bei Jiang ... Michael Januszyk
13 Aug 2020
Cancer Research | VOL. 80

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Approximate distance correlation for selecting highly interrelated genes across datasets.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS Computational Biology