Lag penalized weighted correlation for time series clustering

Thevaa Chandereng,Anthony Gitter

doi:10.1186/s12859-019-3324-1

Thevaa Chandereng, Anthony Gitter

Open Access

https://doi.org/10.1186/s12859-019-3324-1

Copy DOI

Abstract

BackgroundThe similarity or distance measure used for clustering can generate intuitive and interpretable clusters when it is tailored to the unique characteristics of the data. In time series datasets generated with high-throughput biological assays, measurements such as gene expression levels or protein phosphorylation intensities are collected sequentially over time, and the similarity score should capture this special temporal structure.ResultsWe propose a clustering similarity measure called Lag Penalized Weighted Correlation (LPWC) to group pairs of time series that exhibit closely-related behaviors over time, even if the timing is not perfectly synchronized. LPWC aligns time series profiles to identify common temporal patterns. It down-weights aligned profiles based on the length of the temporal lags that are introduced. We demonstrate the advantages of LPWC versus existing time series and general clustering algorithms. In a simulated dataset based on the biologically-motivated impulse model, LPWC is the only method to recover the true clusters for almost all simulated genes. LPWC also identifies clusters with distinct temporal patterns in our yeast osmotic stress response and axolotl limb regeneration case studies.ConclusionsLPWC achieves both of its time series clustering goals. It groups time series with correlated changes over time, even if those patterns occur earlier or later in some of the time series. In addition, it refrains from introducing large shifts in time when searching for temporal patterns by applying a lag penalty. The LPWC R package is available at https://github.com/gitter-lab/LPWC and CRAN under a MIT license.

Highlights

The similarity or distance measure used for clustering can generate intuitive and interpretable clusters when it is tailored to the unique characteristics of the data
Lag Penalized Weighted Correlation overview The goal of LPWC is to group genes that have similar shapes in their expression levels over time
Because the aligned time series can pair measurements that are temporally far apart, LPWC weights the pairs of timepoints to give stronger consideration to those that are close in time

Summary

Introduction

The similarity or distance measure used for clustering can generate intuitive and interpretable clusters when it is tailored to the unique characteristics of the data. In time series datasets generated with high-throughput biological assays, measurements such as gene expression levels or protein phosphorylation intensities are collected sequentially over time, and the similarity score should capture this special temporal structure. Time series data are collected extensively to study complex and dynamic biological systems [1, 2]. Many time series clustering algorithms have been introduced to understand the dynamics of biological processes. Some of these clustering approaches are hierarchical, Chandereng and Gitter BMC Bioinformatics (2020) 21:21 iteratively merging small clusters or dividing large clusters. Others partition entities into clusters, which often requires specifying the number of clusters in advance

Methods

Results

Discussion

Conclusion