Abstract

Extensive amount of data stored in medical documents require developing methods that help users to find what they are looking for effectively by organizing large amounts of information into a small number of meaningful clusters. The produced clusters contain groups of objects which are more similar to each other than to the members of any other group. Thus, the aim of high-quality document clustering algorithms is to determine a set of clusters in which the inter-cluster similarity is minimized and intra-cluster similarity is maximized. The most important feature in many clustering algorithms is treating the clustering problem as an optimization process, that is, maximizing or minimizing a particular clustering criterion function defined over the whole clustering solution. The only real difference between agglomerative algorithms is how they choose which clusters to merge. The main purpose of this paper is to compare different agglomerative algorithms based on the evaluation of the clusters quality produced by different hierarchical agglomerative clustering algorithms using different criterion functions for the problem of clustering medical documents. Our experimental results showed that the agglomerative algorithm that uses I1 as its criterion function for choosing which clusters to merge produced better clusters quality than the other criterion functions in term of entropy and purity as external measures.

Highlights

  • Large quantities of information about patients and their medical conditions are available within the clinical documents

  • This paper focuses on comparing different agglomerative algorithms that use criterion functions to choose which clusters to be merged for the problem of clustering medical documents

  • The second way is the averaging relative, which is recommended by Jain et al [7] and is calculated by dividing the entropy obtained by a particular criterion function for each dataset and value of k (5, 10, 15- or 20) by the smallest entropy - the best entropy- obtained for that particular dataset and value of k over the different criterion functions

Read more

Summary

Introduction

Large quantities of information about patients and their medical conditions are available within the clinical documents. To enhance the understanding of disease progression and management, an evaluation of stored clinical data, when performed, may lead to the discovery of trends and patterns hidden within the data. Methods are needed to facilitate searching such large quantities of clinical documents [1]. Clustering the medical documents into small number of meaningful clusters is one of the methods that facilitate discovering trends and patterns hidden within these documents, because dealing with only the cluster that will contain relevant documents should improve effectiveness and efficiency.

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call