以聚合法(AGNES)提升檢索效果之研究—以中文新聞為例

宋永杰

doi:10.6846/tku.2007.00927

Abstract

Usually the document ranking returned by the traditional vector space model of an information retrieval system is unorganized. It is often found that related documents do not have adjacent ranks. In order not to miss the needed information, the user still has to read several unrelated documents before finding another related document. In this research, we cluster the documents from the traditional vector space model based on the binary tree hierarchy constructed by the AGglomerative NESting (AGNES) algorithm. The clusters are ranked by the average of the coupling and the cohesion measures proposed in this thesis, and the documents in the cluster are ranked by the similarity between the query and the document. We try to improve the precision by such ranking adjustment. We used the Chinese news dataset and went through the word segmentation, vector representation, AGNES clustering, query based document retrieval and the final ranking adjustments for evaluation. As result, our system can improve the precision by 20.9% to 24.0% compared to the traditional vector space model. We also tested the result by the Wilcoxon Signed Ranks Test. It shows that our system is significantly better than the traditional vector space model for queries of one or two keywords.

Full Text