Abstract

Co-clustering (also known as biclustering), is an important extension of cluster analysis since it allows to simultaneously group objects and features in a matrix, resulting in row and column clusters that are both more accurate and easier to interpret. This paper presents the theory underlying several effective diagonal and non-diagonal co-clustering algorithms, and describes CoClust, a package which provides implementations for these algorithms. The quality of the results produced by the implemented algorithms is demonstrated through extensive tests performed on datasets of various size and balance. CoClust has been designed to complete and easily interface with popular Python machine learning libraries such as scikit-learn.

Highlights

  • IntroductionIn the era of data science, clustering various kinds of objects (documents, genes, customers) has become a key activity and many high quality packaged implementations are provided for this purpose by many popular packages such as stats (R Core Team 2013), skmeans (Hornik, Feinerer, Kober, and Buchta 2012), kernlab (Karatzoglou, Smola, Hornik, and Zeileis 2004), NbClust (Charrad, Ghazzali, Boiteau, and Niknafs 2014), Cluto (Karypis 2003), scikitlearn (Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot, and Duchesnay 2011), SciPy (scipy.cluster module) (Jones, Oliphant, and Peterson 2001–), nltk (nltk.cluster module) (Bird, Klein, and Loper 2009), Weka (Hall, Frank, Holmes, Pfahringer, Reutemann, and Witten 2009), etc

  • Since the seminal work of Hartigan (1972), co-clustering has found applications in many areas such as bio-informatics (Cheng and Church 2000; Madeira and Oliveira 2004; Tanay, Sharan, and Shamir 2005; Cho and Dhillon 2008; Gupta and Aggarwal 2010; Hanczar and Nadif 2011, 2012), web mining (Xu, Zong, Dolog, and Zhang 2010; Charrad, Lechevallier, Ahmed, and Saporta 2009; George and Merugu 2005; Deodhar and Ghosh 2010) and text mining (Dhillon 2001; Dhillon, Mallela, and Modha 2003) and various co-clustering algorithms have been proposed over the years (recent surveys can be found in (Freitas, Ayadi, Elloumi, Oliveira, and Hao 2012; Eren, Deveci, Kucuktunc, and Catalyurek 2013; Henriques, Antunes, and Madeira 2015))

  • While quite a large number of implementations of co-clustering algorithms1 have been developed for gene expression data, such as biclust (Kaiser and Leisch 2008), bicat (Barkow, Bleuler, Prelic, Zimmermann, and Zitzler 2006) and bibench (Eren et al 2013), in contrast, not so many implementations are available for co-clustering co-occurrence matrices such, for example, as document-term matrices used in text mining applications

Read more

Summary

Introduction

In the era of data science, clustering various kinds of objects (documents, genes, customers) has become a key activity and many high quality packaged implementations are provided for this purpose by many popular packages such as stats (R Core Team 2013), skmeans (Hornik, Feinerer, Kober, and Buchta 2012), kernlab (Karatzoglou, Smola, Hornik, and Zeileis 2004), NbClust (Charrad, Ghazzali, Boiteau, and Niknafs 2014), Cluto (Karypis 2003), scikitlearn (Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot, and Duchesnay 2011), SciPy (scipy.cluster module) (Jones, Oliphant, and Peterson 2001–), nltk (nltk.cluster module) (Bird, Klein, and Loper 2009), Weka (Hall, Frank, Holmes, Pfahringer, Reutemann, and Witten 2009), etc. A natural extension of standard cluster analysis is co-clustering where objects and features are simultaneously grouped into meaningful blocks called co-clusters or biclusters, making large data sets easier to handle and interpret. While quite a large number of implementations of co-clustering algorithms have been developed for gene expression data, such as biclust (Kaiser and Leisch 2008), bicat (Barkow, Bleuler, Prelic, Zimmermann, and Zitzler 2006) and bibench (Eren et al 2013), in contrast, not so many implementations are available for co-clustering co-occurrence matrices such, for example, as document-term matrices used in text mining applications. The CoClust package presented in this paper provides implementations of algorithms designed to efficiently handle such matrices. Depending on the method used, algorithms for co-clustering co-occurrence matrices can broadly be divided into several categories:

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.