Integrative Clustering Analysis with Application in Multi-Source Gene Expression Data

Liuqing Yang,Qing Pan,Yunpeng Zhao

doi:10.6339/21-jds1028

Abstract

In omics studies, different sources of information about the same set of genes are often available. When the group structure (e.g., gene pathways) within the genes are of interests, we combine the normal hierarchical model with the stochastic block model, through an integrative clustering framework, to model gene expression and gene networks jointly. The integrative framework provides higher accuracy in extensive simulation studies when one or both of the data sources contain noises or when different data sources provide complementary information. An empirical guideline in the choice between integrative versus separate clustering models is proposed. The integrative clustering method is illustrated on the mouse embryo single cell RNAseq and bulk cell microarray data, which identified not only the gene sets shared by both data sources but also the gene sets unique in one data source.

Highlights

Network analysis is the study of networks representing relationships between objects
The integrative clustering method is illustrated on the mouse embryo single cell RNAseq and bulk cell microarray data, which identified the gene sets shared by both data sources and the gene sets unique in one data source
We examine the performance of integrative clustering versus separate clustering in the presence of contamination as well as orthogonal community structures in different data sources

Summary

Introduction

Network analysis is the study of networks representing relationships (i.e., links or edges) between objects (i.e., vertices or nodes). We dichotomize connectivity measures in single cell RNAseq data and apply stochastic block model (SBM) on the resulting gene network data. SBM is a popular community detection approach on binary network data, which assumes that the link probability of each pair of nodes. The proposed method falls into the category of probability clustering models, since it combines the log likelihoods of the NHM and the SBM on two sets of data from independent data sources. The NHM describes the clustering structure on the mean values of gene expression levels, while the SBM extracts groups using mutual distances between genes.

Normal Hierarchical Model for Gene Expression Data

Pseudo-Likelihood Method for the Stochastic Block Model

Integrative Method

Simulation

Empirical Guidelines Choosing Between Integrative and Separate Analyses

Performance Under Unequal Community Structures

BIC-Type Criteria to Choose K

Robustness to Correlated Data Sources

Clustering Analysis for Mouse Embryo Data

Discussion and Future

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Integrative Clustering Analysis with Application in Multi-Source Gene Expression Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Data Science

Lead the way for us

Journal: Journal of Data Science	Publication Date: Nov 9, 2021
License type: cc-by

Similar Papers

A Framework for Elucidating Regulatory Networks Based on Prior Information and Expression Data
O Gevaert ... S Van Vooren
Annals of The Lyceum of Natural History of New York | VOL. 1115
O Gevaert, et. al.O Gevaert ... S Van Vooren
16 Nov 2007
Annals of The Lyceum of Natural History of New York | VOL. 1115

Rational prescribing and sources of information
Flora Haayer
Social science & medicine (1982) | VOL. 16
Flora HaayerFlora Haayer
01 Jan 1981
Social science & medicine (1982) | VOL. 16

Coping With the Infodemic With Scientific Knowledge Management
Jorge Biolchini ... Elaine Cristina Ferreira Dias
-
Jorge Biolchini, et. al.Jorge Biolchini ... Elaine Cristina Ferreira Dias
01 Jan 2021
01 Jan 2021

Heterogeneous information systems: understanding integration
S Heiler ... S Zdonik
-
S Heiler, et. al.S Heiler ... S Zdonik
07 Apr 1991
07 Apr 1991

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Integrative Clustering Analysis with Application in Multi-Source Gene Expression Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Data Science