Dictionary learning allows model-free pseudotime estimation of transcriptomic data

Mona Rams,Tim O.F Conrad

doi:10.1186/s12864-021-08276-9

Mona Rams, Tim O.F Conrad

Open Access

https://doi.org/10.1186/s12864-021-08276-9

Copy DOI

Journal: BMC Genomics	Publication Date: Jan 15, 2022
Citations: 2	License type: open-access

Affiliation: Freie Universität Berlin, Zuse Institute Berlin

Abstract

BackgroundPseudotime estimation from dynamic single-cell transcriptomic data enables characterisation and understanding of the underlying processes, for example developmental processes. Various pseudotime estimation methods have been proposed during the last years. Typically, these methods start with a dimension reduction step because the low-dimensional representation is usually easier to analyse. Approaches such as PCA, ICA or t-SNE belong to the most widely used methods for dimension reduction in pseudotime estimation methods. However, these methods usually make assumptions on the derived dimensions, which can result in important dataset properties being missed. In this paper, we suggest a new dictionary learning based approach, dynDLT, for dimension reduction and pseudotime estimation of dynamic transcriptomic data. Dictionary learning is a matrix factorisation approach that does not restrict the dependence of the derived dimensions. To evaluate the performance, we conduct a large simulation study and analyse 8 real-world datasets.ResultsThe simulation studies reveal that firstly, dynDLT preserves the simulated patterns in low-dimension and the pseudotimes can be derived from the low-dimensional representation. Secondly, the results show that dynDLT is suitable for the detection of genes exhibiting the simulated dynamic patterns, thereby facilitating the interpretation of the compressed representation and thus the dynamic processes. For the real-world data analysis, we select datasets with samples that are taken at different time points throughout an experiment. The pseudotimes found by dynDLT have high correlations with the experimental times. We compare the results to other approaches used in pseudotime estimation, or those that are method-wise closely connected to dictionary learning: ICA, NMF, PCA, t-SNE, and UMAP. DynDLT has the best overall performance for the simulated and real-world datasets.ConclusionsWe introduce dynDLT, a method that is suitable for pseudotime estimation. Its main advantages are: (1) It presents a model-free approach, meaning that it does not restrict the dependence of the derived dimensions; (2) Genes that are relevant in the detected dynamic processes can be identified from the dictionary matrix; (3) By a restriction of the dictionary entries to positive values, the dictionary atoms are highly interpretable.

Highlights

Pseudotime estimation from dynamic single-cell transcriptomic data enables characterisation and understanding of the underlying processes, for example developmental processes
To gain a deep understanding of our method dynDLT for pseudotime estimation of transcriptomic data, a simulation study is performed as a first analysis
Comparison methods The results from dynDLT are compared to results from independent component analysis (ICA), non-negative matrix factorisation (NMF), principal component analysis (PCA), t-distributed stochastic neighbour embedding, and Uniform Manifold Approximation and Projection (UMAP)

Summary

Introduction

Pseudotime estimation from dynamic single-cell transcriptomic data enables characterisation and understanding of the underlying processes, for example developmental processes. Various pseudotime estimation methods have been proposed during the last years These methods start with a dimension reduction step because the low-dimensional representation is usually easier to analyse. Approaches such as PCA, ICA or t-SNE belong to the most widely used methods for dimension reduction in pseudotime estimation methods. Gene expression profiling can help to understand the underlying mechanisms, identify key genes in these processes, distinguish and characterise variants of different subgroups and more First approaches studying this time dynamic behaviour are ordering bulk samples based on expression similarity [1,2,3]. The data of each single-cell is interpreted as a snapshot of the temporal development

Objectives

Methods

Results

Discussion

Conclusion