Scalable non-negative matrix tri-factorization

Marinka žitnik,Blaž Zupan,Andrej Čopar

doi:10.1186/s13040-017-0160-6

Abstract

BackgroundMatrix factorization is a well established pattern discovery tool that has seen numerous applications in biomedical data analytics, such as gene expression co-clustering, patient stratification, and gene-disease association mining. Matrix factorization learns a latent data model that takes a data matrix and transforms it into a latent feature space enabling generalization, noise removal and feature discovery. However, factorization algorithms are numerically intensive, and hence there is a pressing challenge to scale current algorithms to work with large datasets. Our focus in this paper is matrix tri-factorization, a popular method that is not limited by the assumption of standard matrix factorization about data residing in one latent space. Matrix tri-factorization solves this by inferring a separate latent space for each dimension in a data matrix, and a latent mapping of interactions between the inferred spaces, making the approach particularly suitable for biomedical data mining.ResultsWe developed a block-wise approach for latent factor learning in matrix tri-factorization. The approach partitions a data matrix into disjoint submatrices that are treated independently and fed into a parallel factorization system. An appealing property of the proposed approach is its mathematical equivalence with serial matrix tri-factorization. In a study on large biomedical datasets we show that our approach scales well on multi-processor and multi-GPU architectures. On a four-GPU system we demonstrate that our approach can be more than 100-times faster than its single-processor counterpart.ConclusionsA general approach for scaling non-negative matrix tri-factorization is proposed. The approach is especially useful parallel matrix factorization implemented in a multi-GPU environment. We expect the new approach will be useful in emerging procedures for latent factor analysis, notably for data integration, where many large data matrices need to be collectively factorized.

Highlights

Matrix factorization is a well established pattern discovery tool that has seen numerous applications in biomedical data analytics, such as gene expression co-clustering, patient stratification, and gene-disease association mining
Non-negative matrix factorization considers as input a data matrix X and learns two latent factors, U and V, such that their product UV approximates X, X ≈ UV, under some criterion of approximation error
We develop a principled mathematical approach and an algorithmic solution to latent factor learning for non-negative matrix tri-factorization

Summary

Introduction

Matrix factorization is a well established pattern discovery tool that has seen numerous applications in biomedical data analytics, such as gene expression co-clustering, patient stratification, and gene-disease association mining. Dimensionality reduction approaches address challenges in modern biomedical data analytics by learning useful projections of data into a smaller, compact and pattern-rich latent space. One class of non-negative matrix factorization approaches is non-negative matrix tri-factorization, which extends a two-factor model by introducing a third latent factor S, such that X ≈ USVT [11] This representation is more appropriate for non-square data because it explicitly models data interactions through a latent factor S [12]. Several optimization techniques for parallel non-negative (two-factor) matrix factorization have recently been proposed [13,14,15] These techniques first partition matrix X into blocks and exploit the block-matrix multiplication when learning U and V. Such a straightforward approach does not apply to matrix trifactorization because, as we show in the Methods section, the learning of any block of U and V depends on factor S

Methods

Results

Discussion

Conclusion