Persian Web Pages Clustering Improvement: Customizing the STC Algorithm

Mohammad Azadnia,Sina Rezagholizadeh,Alireza Yari

doi:10.1109/iccit.2009.295

Abstract

Today the Internet in almost all ethnic groups and cultures is found and the Web pages are developing very quickly in most countries and different languages. Considering the size and incoherent available information in the Internet has made the use of search engines obvious and necessary. Since search engines pay less attention to the linguistics and content features of documents in different languages and cultures, just uses the pages genuine content similarities, to provide the needs of users, will not be that successful. Regarding the fact, search engines for more effective retrieval and clustering Web pages should consider the linguistics, contents, characteristics and properties of languages. More over they should develop ways to eliminate the complexity of languages as well as using linguistic features to cluster Web pages more effective. In this paper a method for clustering and ranking Web pages in Persian language including its contents and linguistic properties has been developed. Clustering scheme provided based on STC algorithm is one of the best algorithms in clustering text documents. The main idea of this method includes some pre-processing phase to overcome the complexity of linguistic feature in Persian language. Open Source tools are available for these pre-processing steps and there is no need to implement them, simply some changes in their modules may be needed. Some of these pre-processing steps are extract phrases, parse sentences, remove stop words and also add neighbor pages pointed terms to the collection of phrases. All steps in this method have a linear behavior in time order and can apply to the large data sets. This means the proposed method in our research is scalable for mass document sources as the Web.

Full Text