A Practical Guide to Sparse k-Means Clustering for Studying Molecular Development of the Human Brain.

Justin L Balsor,Desmond Singh,Ewalina Jeyanesan,Kathryn M Murphy,Jonathan Zaslavsky,Rachel Kwan,Keon Arbabi

doi:10.3389/fnins.2021.668293

Abstract

Studying the molecular development of the human brain presents unique challenges for selecting a data analysis approach. The rare and valuable nature of human postmortem brain tissue, especially for developmental studies, means the sample sizes are small (n), but the use of high throughput genomic and proteomic methods measure the expression levels for hundreds or thousands of variables [e.g., genes or proteins (p)] for each sample. This leads to a data structure that is high dimensional (p ≫ n) and introduces the curse of dimensionality, which poses a challenge for traditional statistical approaches. In contrast, high dimensional analyses, especially cluster analyses developed for sparse data, have worked well for analyzing genomic datasets where p ≫ n. Here we explore applying a lasso-based clustering method developed for high dimensional genomic data with small sample sizes. Using protein and gene data from the developing human visual cortex, we compared clustering methods. We identified an application of sparse k-means clustering [robust sparse k-means clustering (RSKC)] that partitioned samples into age-related clusters that reflect lifespan stages from birth to aging. RSKC adaptively selects a subset of the genes or proteins contributing to partitioning samples into age-related clusters that progress across the lifespan. This approach addresses a problem in current studies that could not identify multiple postnatal clusters. Moreover, clusters encompassed a range of ages like a series of overlapping waves illustrating that chronological- and brain-age have a complex relationship. In addition, a recently developed workflow to create plasticity phenotypes (Balsor et al., 2020) was applied to the clusters and revealed neurobiologically relevant features that identified how the human visual cortex changes across the lifespan. These methods can help address the growing demand for multimodal integration, from molecular machinery to brain imaging signals, to understand the human brain’s development.

Highlights

As molecular tools have become integrated with human neuroscience, there has been a renewed interest in mapping human brain development
The optimization of sparse k-means cluster analysis (RSKC) for small sample sizes provides another approach for analyzing the human brain’s molecular development that is sensitive to the subtle molecular changes that occur across the postnatal lifespan
The current study shows that the application of sparse clustering leverages the high dimensional nature of proteomic and transcriptomic data from human brain development to find agerelated clusters that are spread across the lifespan

Summary

Introduction

As molecular tools have become integrated with human neuroscience, there has been a renewed interest in mapping human brain development. Other areas of human neuroscience are applying data-driven approaches such as principal component analysis (PCA) (Bray, 2017) or unsupervised clustering (Lebenberg et al, 2018) to identify age-related changes in brain development. Applying cluster analysis to studying the molecular development of the human brain is challenging because of the limited availability of developmental postmortem tissue samples. We apply one of those approaches, sparse k-means clustering (Witten and Tibshirani, 2010; Kondo et al, 2016), to illustrate a data-driven approach for studying brain development that uses the expression of many genes or proteins to partition samples into age-related clusters. We show that clustering can identify aspects of human visual cortex development that are not apparent in typical developmental ontologies

Methods

Results

Discussion

Conclusion