Abstract

Matrix tri-factorization subject to binary constraints is a versatile and powerful framework for the simultaneous clustering of observations and features, also known as biclustering. Applications for biclustering encompass the clustering of high-dimensional data and explorative data mining, where the selection of the most important features is relevant. Unfortunately, due to the lack of suitable methods for the optimization subject to binary constraints, the powerful framework of biclustering is typically constrained to clusterings which partition the set of observations or features. As a result, overlap between clusters cannot be modelled and every item, even outliers in the data, have to be assigned to exactly one cluster. In this paper we propose Broccoli, an optimization scheme for matrix factorization subject to binary constraints, which is based on the theoretically well-founded optimization scheme of proximal stochastic gradient descent. Thereby, we do not impose any restrictions on the obtained clusters. Our experimental evaluation, performed on both synthetic and real-world data, and against 6 competitor algorithms, show reliable and competitive performance, even in presence of a high amount of noise in the data. Moreover, a qualitative analysis of the identified clusters shows that Broccoli may provide meaningful and interpretable clustering structures.

Highlights

  • In the field of clustering, and more generally in data mining, one of the biggest open problems is the optimization subject to binary constraints

  • Binary constraints arise from a request for interpretable and definite data mining results: is a picture showing a cat? Should a movie be recommended to this user? Should the chess move be this one? Binary results provide definite yes or no answers to the questions arising when solving data mining tasks

  • For the movie genre case, and for movie recommendation in general, the exclusivity assumption is inappropriate: there is no one-to-one relation between genres and groups of movies, or between groups of users and movies. This scenario provides an ideal motivation for the fundamental contribution of the biclustering task: the simultaneous clustering of users and movies, where a bicluster is a selection of users and movies, such that the users give similar ratings for the movies in the group, and the movies have similar ratings from the users in the group

Read more

Summary

Introduction

In the field of clustering, and more generally in data mining, one of the biggest open problems is the optimization subject to binary constraints. We are assuming that if a picture shows a cat, it cannot show a dog; if a movie is assigned to one cluster (e.g., a genre), it cannot belong to another cluster (i.e., to another genre); there should be only one possible chess move, which is the optimal one From these examples, we can observe that the exclusivity assumption may make sense or not depending on the specific application. This motivation only works for the philosophy behind biclustering; many algorithms for biclustering impose the exclusivity constraint, even though it is no necessarily required by the task definition Provided that they are not constrained by such an exclusivity assumption, biclusters could, e.g., represent a group of science-fiction fans and science-fiction movies. Given a data matrix D ∈ Rm×n, there exist suitable numerical methods to optimize a nonnegative matrix tri-factorization: min D − Y C X 2

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call