CoClust: A Python Package for Co-Clustering

François Role,Stanislas Morbieu,Mohamed Nadif

doi:10.18637/jss.v088.i07

Abstract

Co-clustering (also known as biclustering), is an important extension of cluster analysis since it allows to simultaneously group objects and features in a matrix, resulting in row and column clusters that are both more accurate and easier to interpret. This paper presents the theory underlying several effective diagonal and non-diagonal co-clustering algorithms, and describes CoClust, a package which provides implementations for these algorithms. The quality of the results produced by the implemented algorithms is demonstrated through extensive tests performed on datasets of various size and balance. CoClust has been designed to complete and easily interface with popular Python machine learning libraries such as scikit-learn.

Highlights

IntroductionIn the era of data science, clustering various kinds of objects (documents, genes, customers) has become a key activity and many high quality packaged implementations are provided for this purpose by many popular packages such as stats (R Core Team 2013), skmeans (Hornik, Feinerer, Kober, and Buchta 2012), kernlab (Karatzoglou, Smola, Hornik, and Zeileis 2004), NbClust (Charrad, Ghazzali, Boiteau, and Niknafs 2014), Cluto (Karypis 2003), scikitlearn (Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot, and Duchesnay 2011), SciPy (scipy.cluster module) (Jones, Oliphant, and Peterson 2001–), nltk (nltk.cluster module) (Bird, Klein, and Loper 2009), Weka (Hall, Frank, Holmes, Pfahringer, Reutemann, and Witten 2009), etc
Since the seminal work of Hartigan (1972), co-clustering has found applications in many areas such as bio-informatics (Cheng and Church 2000; Madeira and Oliveira 2004; Tanay, Sharan, and Shamir 2005; Cho and Dhillon 2008; Gupta and Aggarwal 2010; Hanczar and Nadif 2011, 2012), web mining (Xu, Zong, Dolog, and Zhang 2010; Charrad, Lechevallier, Ahmed, and Saporta 2009; George and Merugu 2005; Deodhar and Ghosh 2010) and text mining (Dhillon 2001; Dhillon, Mallela, and Modha 2003) and various co-clustering algorithms have been proposed over the years (recent surveys can be found in (Freitas, Ayadi, Elloumi, Oliveira, and Hao 2012; Eren, Deveci, Kucuktunc, and Catalyurek 2013; Henriques, Antunes, and Madeira 2015))
While quite a large number of implementations of co-clustering algorithms1 have been developed for gene expression data, such as biclust (Kaiser and Leisch 2008), bicat (Barkow, Bleuler, Prelic, Zimmermann, and Zitzler 2006) and bibench (Eren et al 2013), in contrast, not so many implementations are available for co-clustering co-occurrence matrices such, for example, as document-term matrices used in text mining applications

Summary

Introduction

In the era of data science, clustering various kinds of objects (documents, genes, customers) has become a key activity and many high quality packaged implementations are provided for this purpose by many popular packages such as stats (R Core Team 2013), skmeans (Hornik, Feinerer, Kober, and Buchta 2012), kernlab (Karatzoglou, Smola, Hornik, and Zeileis 2004), NbClust (Charrad, Ghazzali, Boiteau, and Niknafs 2014), Cluto (Karypis 2003), scikitlearn (Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot, and Duchesnay 2011), SciPy (scipy.cluster module) (Jones, Oliphant, and Peterson 2001–), nltk (nltk.cluster module) (Bird, Klein, and Loper 2009), Weka (Hall, Frank, Holmes, Pfahringer, Reutemann, and Witten 2009), etc. A natural extension of standard cluster analysis is co-clustering where objects and features are simultaneously grouped into meaningful blocks called co-clusters or biclusters, making large data sets easier to handle and interpret. While quite a large number of implementations of co-clustering algorithms have been developed for gene expression data, such as biclust (Kaiser and Leisch 2008), bicat (Barkow, Bleuler, Prelic, Zimmermann, and Zitzler 2006) and bibench (Eren et al 2013), in contrast, not so many implementations are available for co-clustering co-occurrence matrices such, for example, as document-term matrices used in text mining applications. The CoClust package presented in this paper provides implementations of algorithms designed to efficiently handle such matrices. Depending on the method used, algorithms for co-clustering co-occurrence matrices can broadly be divided into several categories:

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Statistical Software	Publication Date: Jan 1, 2019
Citations: 28	License type: cc-by

R Discovery Prime

R Discovery Prime

CoClust: A Python Package for Co-Clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Statistical Software

Lead the way for us

Similar Papers

What can Android mobile app developers do about the energy consumption of machine learning?
Andrea Mcintosh ... Abram Hindle
Empirical Software Engineering | VOL. 24
Andrea Mcintosh, et. al.Andrea Mcintosh ... Abram Hindle
04 Jun 2018
Empirical Software Engineering | VOL. 24

OvNMTF Algorithm: an Overlapping Non-Negative Matrix Tri-Factorization for Coclustering
Waldyr L De Freitas ... Sarajane M Peres
-
Waldyr L De Freitas, et. al.Waldyr L De Freitas ... Sarajane M Peres
01 Jul 2020
01 Jul 2020

Understanding Software-2.0
Malinda Dilhara ... Danny Dig
ACM Transactions on Software Engineering and Methodology | VOL. 30
Malinda Dilhara, et. al.Malinda Dilhara ... Danny Dig
23 Jul 2021
ACM Transactions on Software Engineering and Methodology | VOL. 30

Обчислення сингулярного розкладу матриць з використанням графічного процесора
S.S Sukharskyi
PROBLEMS IN PROGRAMMING | VOL. -
S.S SukharskyiS.S Sukharskyi
01 Jan 2023
PROBLEMS IN PROGRAMMING | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CoClust: A Python Package for Co-Clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Statistical Software