Application of dynamic topic models to toxicogenomics data.

Mikyung Lee,Weida Tong,Zhichao Liu,Ruili Huang

doi:10.1186/s12859-016-1225-0

Abstract

BackgroundAll biological processes are inherently dynamic. Biological systems evolve transiently or sustainably according to sequential time points after perturbation by environment insults, drugs and chemicals. Investigating the temporal behavior of molecular events has been an important subject to understand the underlying mechanisms governing the biological system in response to, such as, drug treatment. The intrinsic complexity of time series data requires appropriate computational algorithms for data interpretation. In this study, we propose, for the first time, the application of dynamic topic models (DTM) for analyzing time-series gene expression data.ResultsA large time-series toxicogenomics dataset was studied. It contains over 3144 microarrays of gene expression data corresponding to rat livers treated with 131 compounds (most are drugs) at two doses (control and high dose) in a repeated schedule containing four separate time points (4-, 8-, 15- and 29-day). We analyzed, with DTM, the topics (consisting of a set of genes) and their biological interpretations over these four time points. We identified hidden patterns embedded in this time-series gene expression profiles. From the topic distribution for compound-time condition, a number of drugs were successfully clustered by their shared mode-of-action such as PPARɑ agonists and COX inhibitors. The biological meaning underlying each topic was interpreted using diverse sources of information such as functional analysis of the pathways and therapeutic uses of the drugs. Additionally, we found that sample clusters produced by DTM are much more coherent in terms of functional categories when compared to traditional clustering algorithms.ConclusionsWe demonstrated that DTM, a text mining technique, can be a powerful computational approach for clustering time-series gene expression profiles with the probabilistic representation of their dynamic features along sequential time frames. The method offers an alternative way for uncovering hidden patterns embedded in time series gene expression profiles to gain enhanced understanding of dynamic behavior of gene regulation in the biological system.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1225-0) contains supplementary material, which is available to authorized users.

Highlights

All biological processes including perturbation-responses are inherently dynamic
This study consists of several steps: (1) generation of documents of differentially expressed genes (DEGs) lists for each compound at each time point; (2) Building a generative probabilistic model using dynamic topic models (DTM) to maximize the posterior probability of observed temporal DEGs; (3) Assignment of the topic with largest conditional probability value to each compound-time condition; (4) ranking DEGs according to their conditional probability of each topic and assessment of topic evolution over time (4) topic analysis in the biological context
This can be used for the assessment of the association between a specific condition and a specific topic. We used this statistical probability to group the conditions by connecting them with topics. These results are provided in Additional file 1: Table S1 that includes Mode of Action (MoA) and therapeutic category information for the 131 drugs

Summary

Introduction

All biological processes including perturbation-responses are inherently dynamic. Investigating the temporal behavior of these dynamic processes is an important part of biological research. Even if we permute the order of time points, the results of these algorithms would not change Another drawback of these approaches is the mutual exclusiveness of genes with respect to their involvement in biological processes responding to exposure. Schliep introduced Hidden Markov Model widely used in speech recognition to consider time dependencies along sequential timeline of time series gene expression data [3]. These algorithms have been constantly under improvement [4,5,6]. CAGED applies regression analysis to cluster genes on the basis of their trajectories over multiple time points, while STEM first defines a set of representative temporal profiles, assigns genes to one of several predefined temporal trajectories. We propose, for the first time, the application of dynamic topic models (DTM) for analyzing time-series gene expression data

Objectives

Methods

Results

Conclusion