Abstract

In omics studies, different sources of information about the same set of genes are often available. When the group structure (e.g., gene pathways) within the genes are of interests, we combine the normal hierarchical model with the stochastic block model, through an integrative clustering framework, to model gene expression and gene networks jointly. The integrative framework provides higher accuracy in extensive simulation studies when one or both of the data sources contain noises or when different data sources provide complementary information. An empirical guideline in the choice between integrative versus separate clustering models is proposed. The integrative clustering method is illustrated on the mouse embryo single cell RNAseq and bulk cell microarray data, which identified not only the gene sets shared by both data sources but also the gene sets unique in one data source.

Highlights

  • Network analysis is the study of networks representing relationships between objects

  • The integrative clustering method is illustrated on the mouse embryo single cell RNAseq and bulk cell microarray data, which identified the gene sets shared by both data sources and the gene sets unique in one data source

  • We examine the performance of integrative clustering versus separate clustering in the presence of contamination as well as orthogonal community structures in different data sources

Read more

Summary

Introduction

Network analysis is the study of networks representing relationships (i.e., links or edges) between objects (i.e., vertices or nodes). We dichotomize connectivity measures in single cell RNAseq data and apply stochastic block model (SBM) on the resulting gene network data. SBM is a popular community detection approach on binary network data, which assumes that the link probability of each pair of nodes. The proposed method falls into the category of probability clustering models, since it combines the log likelihoods of the NHM and the SBM on two sets of data from independent data sources. The NHM describes the clustering structure on the mean values of gene expression levels, while the SBM extracts groups using mutual distances between genes.

Normal Hierarchical Model for Gene Expression Data
Pseudo-Likelihood Method for the Stochastic Block Model
Integrative Method
Simulation
Empirical Guidelines Choosing Between Integrative and Separate Analyses
Performance Under Unequal Community Structures
BIC-Type Criteria to Choose K
Robustness to Correlated Data Sources
Clustering Analysis for Mouse Embryo Data
Discussion and Future
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call