Single-Cell Transcriptome Data Clustering via Multinomial Modeling and Adaptive Fuzzy K-Means Algorithm.

Liang Chen,Minghua Deng,Yuyao Zhai,Weinan Wang

doi:10.3389/fgene.2020.00295

Liang Chen, Minghua Deng + Show 2 more

Open Access

https://doi.org/10.3389/fgene.2020.00295

Copy DOI

Abstract

Single-cell RNA sequencing technologies have enabled us to study tissue heterogeneity at cellular resolution. Fast-developing sequencing platforms like droplet-based sequencing make it feasible to parallel process thousands of single cells effectively. Although a unique molecular identifier (UMI) can remove bias from amplification noise to a certain extent, clustering for such sparse and high-dimensional large-scale discrete data remains intractable and challenging. Most existing deep learning-based clustering methods utilize the mean square error or negative binomial distribution with or without zero inflation to denoise single-cell UMI count data, which may underfit or overfit the gene expression profiles. In addition, neglecting the molecule sampling mechanism and extracting representation by simple linear dimension reduction with a hard clustering algorithm may distort data structure and lead to spurious analytical results. In this paper, we combined the deep autoencoder technique with statistical modeling and developed a novel and effective clustering method, scDMFK, for single-cell transcriptome UMI count data. ScDMFK utilizes multinomial distribution to characterize data structure and draw support from neural network to facilitate model parameter estimation. In the learned low-dimensional latent space, we proposed an adaptive fuzzy k-means algorithm with entropy regularization to perform soft clustering. Various simulation scenarios and the analysis of 10 real datasets have shown that scDMFK outperforms other state-of-the-art methods with respect to data modeling and clustering algorithms. Besides, scDMFK has excellent scalability for large-scale single-cell datasets.

Highlights

In the past decade, high-throughput sequencing technology has been widely used in various fields of biology and medicine, greatly promoting research in related areas (Reuter et al, 2015)
We defined pij as the relative abundance of the amount of mRNA expressed by j-th gene shared in total mRNA of i-th cell, namely, pij yij m j=1 yij Considering ni ≪ ti and true transcripts counts yij are unknown, we supposed that Unique Molecular Identifier (UMI) counts Xij are samples of yij with relative abundances remaining constant; the probability distribution function of Xi = (Xi1, Xi2, . . . , Xim) is multinomial distribution with parameter vector pi = to be estimated, fi(Xi) ni! Xi1!Xi2! . . . Xim!
Having finished whole model construction, we summarize the two components: denoising autoencoder based on multinomial modeling and fuzzy soft k-means clustering with adaptive loss

Summary

Introduction

High-throughput sequencing technology has been widely used in various fields of biology and medicine, greatly promoting research in related areas (Reuter et al, 2015). In recent years exciting single-cell transcriptome sequencing technology has been booming, allowing researchers to reveal the expression of all cells in the whole genome at the cellular level, in turn facilitating cell heterogeneity and tissue differentiation research (Shapiro et al, 2013; Patel et al, 2014; Kolodziejczyk et al, 2015; Wang and Navin, 2015). Single-cell sequencing technologies like Smart-seq or MATQ-seq can measure the full length of transcripts, but have small cell throughput and are somewhat expensive (Picelli et al, 2013; Sheng et al, 2017). Developed droplet-based sequencing technologies, such as 10x Chromium and Drop-seq, can efficiently profile a large number of cells in parallel with high throughput in a single experiment (Svensson et al, 2017; Zheng et al, 2017). In this article, we have focused on the research and analysis of single-cell RNA-seq UMI count data

Methods

Results

Conclusion