Punjabi Documents Clustering System

Vishal Gupta,Saurabh Sharma

doi:10.4304/jetwi.5.2.171-187

Abstract

Text document clustering inherits its qualities from Natural Languages Processing, Machine Learning and Information Retrieval. For unsupervised document organization, automatic topic extraction and fast information filtering and accuracy in retrieval, this is an effective method. Many clustering algorithms are available for unsupervised document organization and its retrieval thereof. The documents for text clustering are merely considered as an assortment of words in traditional approaches to clustering. The semantic relationship of the words should form the decisive base for clustering, which is generally conveniently forgotten albeit the information is vital for the purpose. A new method for generating frequent phrases by analyzing the semantic relations between the words in a sentence is discussed. Karaka list captures the semantic relations, which is a grammatical connector for connecting Nouns, Pronouns and Verbs in a sentence. This new clustering method utilizes an amalgamation of the theories behind Karaka Analyzer, Frequent Item sets and Frequent Word Sequences. Results are indicative of the fact that New Hybrid approach performs better in terms of Number of Clusters, Meaningful label of Clusters and effectiveness of clustering for those documents which do not have desired information in frequent phrases. Use of semantic features is the key to better results.

Full Text