Enhancing Understandability of Omics Data with SHAP, Embedding Projections and Interactive Visualisations

Zhonglin Qu,Quang Vinh Nguyen,Daniel R Catchpoole,Simeon J Simoff,Yezihalem Tegegne,Paul J Kennedy

doi:10.1007/978-981-19-8746-5_5

Abstract

Uniform Manifold Approximation and Projection (UMAP) is a new and effective non-linear dimensionality reduction (DR) method recently applied in biomedical informatics analysis. UMAP’s data transformation process is complicated and lacks transparency. Principal component analysis (PCA) is a conventional and essential DR method for analysing single-cell datasets. PCA projection is linear and easy to interpret. The UMAP is more scalable and accurate, but the complex algorithm makes it challenging to endorse the users’ trust. Another challenge is that some single-cell data have too many dimensions, making the computational process inefficient and lacking accuracy. This paper uses linkable and interactive visualisations to understand UMAP results by comparing PCA results. An explainable machine learning model, SHapley Additive exPlanations (SHAP) run on Random Forest (RF), is used to optimise the input single-cell data to make UMAP and PCA processes more efficient. We demonstrate that this approach can be applied to high-dimensional omics data exploration to visually validate informative molecule markers and cell populations identified from the UMAP-reduced dimensionality space.

Full Text