Τεχνικές ομαδοποίησης δεδομένων του παγκόσμιου ιστού

Βασιλική Κουτσονικόλα

doi:10.12681/eadd/19424

Abstract

The World Wide Web is now the largest, most open, most democratic publishing system in the world. It has profoundly influenced many aspects of our lives, changing the ways we communicate, conduct business, shop, entertain, and so on. The number of users interacting with the web is constantly increasing, resulting in the significant growth of web data size and heterogeneity. However, the abundant web information is not stored in any systematically structured way and thus, new models, techniques, and technologies for Web data management are required, upon which efficient and effective services can be built. In this context, the contribution of the dissertation focuses on the following subjects. Chapter 3 presents a clustering-driven framework for Directory Services data organization. Directory Services have proliferated as the appropriate storage framework for various and heterogeneous data sources, operating under a wide range of applications and services. Due to the increased amount and heterogeneity of the LDAP data, there is an emerging requirement for appropriate data organization schemes. The proposed structure based clustering algorithm can be used for the LDAP Directory Information Tree definition. A thorough study of the algorithm’s performance is provided, which designates its efficiency. Moreover, a query framework is presented, which, considering the clustering based LDAP data organization, results in the enhancement of the LDAP server’s performance. Chapter 4 deals with the problem of web users clustering in order to identify similarities regarding the users’ navigational behaviour. Two clustering frameworks are proposed which are based on usage data to capture common users interests. The clustering results can be then managed in the context of Web users oriented applications (web personalization, recommendation engines, and so on). In the first method, the usage of Kullback-Leibler divergence (KL-divergence), an information theoretic distance, is proposed, as an alternative option for measuring distances between web users. The performance of KL-divergence is compared with other well known distance measures, focusing on their tolerance in noisy environments, such as the Web. The second proposed clustering framework emphasizes the need to discover similarities in users’ accessing behavior with respect to the time locality of their navigational acts. In this context, two time aware clustering approaches are presented, for tuning and binding the page and time visiting criteria. The two tracks of the proposed algorithms define clusters with users that show similar visiting behavior at the same time period, by varying the priority given to page or time visiting. In Chapter 5, three co-clustering frameworks are studied in order to identify relations between elements of two different datasets. The two of the proposed frameworks are based on information that describe (content) data while the third one on usage data. Specifically, the first framework aims to reveal hidden dependencies between two time-related datasets whose elements are bilaterally affected over time. Thus, the co-clustering approach operates on the basis of two distinct criteria: the direction and duration of their impact. The results’ anlysis can be particularly beneficial for prediction systems. The second approach is applied on data derived from Social Tagging Systems. Its goal is to exploit joint groups of related tags and social data sources, in which both social (in terms of co-occurrence) and semantic aspects of tags are considered simultaneously. The expected benefit of the whole process is that the collective activity of tagging will isolate erroneous tags and illustrate the dominant tags in each cluster, expressing, thus, the community’s point of view around the corresponding topic. The third proposed co-clustering approach is based on usage data, which capture users navigational behaviour, in order to identify groups of related web users and pages. It is a three step process that relies on the principles of spectral clustering analysis and provides a relation scheme for the revealed users’ and pages’ clusters. The analysis of the obtained results can prove particularly beneficial for a variety of applications such as web personalization and profiling, caching and prefetching and content delivery networks. Finally, Chapter 6 concludes this dissertation and gives extensions and directions for future work.

Full Text