Abstract

Motivated by classes of problems frequently found in the analysis of gene expression data, we propose a semiparametric Bayesian model to detect biclusters, that is, subsets of individuals sharing similar patterns over a set of conditions. Our approach is based on the well-known plaid model by Lazzeroni and Owen (2002). By assuming a truncated stick-breaking prior we also find the number of biclusters present in the data as part of the inference. Evidence from a simulation study shows that the model is capable of correctly detecting biclusters and performs well compared to some competing approaches. The flexibility of the proposed prior is demonstrated with applications to the analysis of gene expression data (continuous responses) and histone modifications data (count responses).

Highlights

  • Assume we record measurements {yij} corresponding to a sample of i = 1, . . . , n individuals on each of j = 1, . . . , J conditions

  • Biclustering was first discussed by Hartigan (1972) in the context of creating a method for simultaneously grouping rows and columns of a matrix

  • A different approach was considered by Cheng and Church (2000), who proposed an algorithm based on identifying submatrices with similar entries, as measured by a mean squared residue

Read more

Summary

Introduction

Ni et al (2020) proposed a model for feature allocation that can produce overlapping biclusters of patient-disease and symptom-disease, but using a completely different approach than ours, based on matrix factorizations Their model is in reality designed for the case of categorical entries in the data matrix. The main novelties of this paper can be summarized as follows: (1) we free the traditional plaid model from the restriction of a pre-specified number of biclusters, while allowing for a wide range of possible sampling models to accommodate for various data formats; (2) we define the binary indicator matrices by way of a novel approach based on thresholding a double stick-breaking prior that allows us to provide inference on the number and conformation (i.e. genes and conditions) of possibly overlapping biclusters; and (3) we introduce a penalty prior that controls the size of biclusters.

The modeling approach
Sampling model
Hierarchical prior structure
Implementation
Simulation study
Data illustrations
Yeast cell data
Histone modifications data
Findings
Discussion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.