Αναγνώριση κοινοτήτων σε τόπους του παγκόσμιου ιστού

Ελευθέριος Μωυσιάδης

doi:10.12681/eadd/20556

Abstract

Clustering is an important issue in the analysis and exploration of data. There is a wide area of clustering applications including information retrieval, image segmentation, character recognition, VLSI design, computer graphics and gene analysis. In particular, applications of graph-clustering algorithms include: monitoring computer networks for administration purposes, visualizing knowledge bases to support human understanding of complex data structures, clustering metric data, clustering of web data and identification of web communities. In addition, there is a growing interest in network analysis. In this context, graph-clustering is also known as community mining. Community mining algorithms have been applied in the study of several networks, including networks of email messages as well as social, metabolic, gene networks, etc. Therefore, the problem of graph-clustering or community mining is well studied and a variety of related algorithms is presented in the literature. Many graph-clustering or community mining algorithms are based on the intuitive notion of intra-cluster (within clusters) density “versus” the inter-cluster (between clusters) sparsity. More precisely, they assume that a community (cluster) is a vertex subset such that for all of its vertices, the number of links connecting a vertex to its community is higher than the number of links connecting the vertex to the remaining graph. Henceforth, we shall call such a community, a conventional one and a clustering consisting of conventional communities, a conventional clustering. We argue that conventional clustering is limited, as it does not identify a cluster that contains vertices more strongly connected with its graph complement. In order to overcome the limitation of conventional clustering, we propose the refined community, which requires that for all of its vertices, the number of links connecting a vertex to its community is higher than the number of links connecting the vertex to any other community. Correspondingly, we call refined, a clustering consisting of refined communities. Refined communities may be natural in many real word applications. Communities of a web site constitute a prominent example of such an application. Most information exchange tasks between users and the site require access to a community of pages, e.g., subscription to a site may require four pages: a welcome page from which the subscription procedure is initiated, a page for the newcomer to submit her/his personal details, a page containing the license agreement and finally, a page to provide the newcomer with an appropriate identification number. If a user is visiting a page belonging to a web-site community, it is likely that she/he wishes to visit the remaining pages in the same community. Therefore, web-site communities may be utilized for automatic site adaptation or navigation help. However, hyperlinks linking pages belonging to different web-site communities are frequent in a web site. Thus, a web-site community often contains pages which are connected with more hyperlinks with the remaining site graph than with the community they belong to. Therefore, a web-site community rather follows the model of refined than conventional cluster. As a result, most of the existing graph-clustering or community-mining algorithms, when they are applied on a graph representing a web site, typically result on few large-scale clusters, which contain disparate pages. In this context, the main contribution of our work is summarized as follows: We clarify and formalise the notions of refined and conventional communities and present their basic properties. We show that many well-known graph-clustering and community-mining algorithms typically fail to extract the community structure of a web-site. We highlight limitations of each examined algorithm that result to such a failure. For the study and exploration of communities, we propose three graph-clustering algorithms, which are based on the distance between clusters of vertices. We propose two community mining approaches which overcome the limitations of alternative approaches and are efficient in the exploration of refined communities. Experimentally, we evaluate the proposed approaches with respect to well-known community mining approaches from the literature. Earlier experimentation on the field use benchmark graphs for the evaluation of algorithms. Each benchmark graph encapsulates a pre-specified clustering. Clustering solutions derived by algorithms under evaluation are compared with the pre-specified clustering. However, the pre-specified clustering that is typically used is assumed to be a conventional one. Hence, the evaluation is limited in the exploration of clustering solutions consisting only of conventional clusters. Our experimentation involves benchmark graphs that overcome this limitation as well as three real datasets, each originating from the pages of a web site. Although our initial intention was to partition a graph representing a web site, our approach can be useful to many graph-clustering applications mentioned in this abstract. Here, we demonstrate an application on supporting plagiarism detection in students programming assignments. Experimental results show that our approach performs favourably in comparison to alternative community-mining algorithms.

Full Text