Topic modeling for cluster analysis of large biological and medical datasets.

Weizhong Zhao,Wen Zou,James J Chen

doi:10.1186/1471-2105-15-s11-s11

Abstract

BackgroundThe big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets.ResultsIn this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths.ConclusionTopic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets.

Highlights

The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets
The Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) database is the primary database for post-marketing safety surveillance of all approved drugs and therapeutic biologic products
In this study, three large datasets representing different types of large biological or medical were selected to illustrate the applications of topic modeling for cluster analysis

Summary

Introduction

The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Identification of pattern and structure among a large number of samples and/or the associated variables requires the development of more powerful statistical methods and data mining techniques. The FDA Adverse Event Reporting System (FAERS) database is the primary database for post-marketing safety surveillance of all approved drugs and therapeutic biologic products. The FAERS database consists of over 5,000 drugs and over 16,000 adverse events reported. Data mining methods have been proposed to detect signals of unexpected occurrences in FAERS [10,11,12]

Methods

Results

Discussion

Conclusion