Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets.

Alexander Chervov,Andrei Zinovyev,Jonathan Bac

doi:10.3390/e22111274

Abstract

Construction of graph-based approximations for multi-dimensional data point clouds is widely used in a variety of areas. Notable examples of applications of such approximators are cellular trajectory inference in single-cell data analysis, analysis of clinical trajectories from synchronic datasets, and skeletonization of images. Several methods have been proposed to construct such approximating graphs, with some based on computation of minimum spanning trees and some based on principal graphs generalizing principal curves. In this article we propose a methodology to compare and benchmark these two graph-based data approximation approaches, as well as to define their hyperparameters. The main idea is to avoid comparing graphs directly, but at first to induce clustering of the data point cloud from the graph approximation and, secondly, to use well-established methods to compare and score the data cloud partitioning induced by the graphs. In particular, mutual information-based approaches prove to be useful in this context. The induced clustering is based on decomposing a graph into non-branching segments, and then clustering the data point cloud by the nearest segment. Such a method allows efficient comparison of graph-based data approximations of arbitrary topology and complexity. The method is implemented in Python using the standard scikit-learn library which provides high speed and efficiency. As a demonstration of the methodology we analyse and compare graph-based data approximation methods using synthetic as well as real-life single cell datasets.

Highlights

Graph theory based methods play a significant role in modern data science and its applications to various fields of science, including bioinformatics
In order to benchmark the graph-based data approximation (GBDA) methods, a standard strategy consists in creating synthetic datasets with known underlying ground truth graph topology as a generative model, applying the methods and comparing how far the reconstructions are from the ground truth graph model
In the paper we proposed an improvement for benchmarking and parameter tuning for graph-based data approximation methods

Summary

Introduction

Graph theory based methods play a significant role in modern data science and its applications to various fields of science, including bioinformatics. One of the applications of graphs in unsupervised machine learning is for dimensionality reduction, where the geometrical structure of a data point cloud is approximated by a system of nodes embedded in the data space and connected by edges in a more or less complex graph. This graph can represent a regular grid as in Self-Organizing. Maps (SOM) [1], Elastic Maps (ElMap) [2,3] or principal curve approaches [4,5] In this case, the graph can serve as a base for reconstructing a low-dimensional principal manifold [3,6]. Some large-scale clinical or single cell omics datasets can be modeled as a set of diverging and bifurcating trajectories connecting a root state

Methods

Results

Conclusion