OPTIMIZING CLUSTERS ALIGNMENT FOR BILINGUAL MALAY-ENGLISH CORPORA

Alfred Alfred

doi:10.3844/jcssp.2012.1970.1978

Abstract

Bilingual corpora, containing the same documents in two different languages, are becoming an essential resource for natural language processing. Clustering bilingual corpora provides us with an insight into the differences between languages when term frequency-based Information Retrieval (IR) tools are used. It also allows one to use the Natural Language Processing (NLP) and IR tools in one language to implement IR for another language. This study reports on our work on applying Hierarchical Agglomerative Clustering (HAC) to a large corpus of documents where each appears both in Malay and English languages. These documents are clustered for each language and both results are compared with respect to the content of clusters produced. Further, the effects of using different methods of computing the inter-clusters distance on the cluster results is also studied. These methods include Single, Complete and Average links. Finally, this study describes an experiment employing a genetic algorithm to fine-tune individual term’s weight in order to reproduce more closely a predefined set of clusters. In this way, clustering becomes a supervised learning technique that is trained to better reproduce known clusters in Malay language when applied to the corresponding documents in English language. On the data available, the results of clustering one language resemble the other, provided the number of clusters required is relatively small. The method used to compute the inter-clusters distance also influences the cluster results. The result actually showed an increase in the percentage of aligned clusters, when we applied the genetic algorithm to fine-tune weights of terms considered in clustering the bilingual Malay-English corpora. This study concludes that with a smaller number of clusters, k = 5, all of the clusters from English texts can be mapped into the clusters of Malay texts, by using the Complete link distance measure in clustering the bilingual parallel corpus. In contrast, with a large size of clusters, fewer clusters from English texts can be mapped into the clusters of Malay texts.

Highlights

In labeling articles in both languages, an appropriate clustering technique must be applied in order to have an efficient and effective representation of articles in both languages
This study reports on our work on applying Hierarchical Agglomerative Clustering (HAC) to a large corpus of documents where each appears both in Malay and English languages
This study concludes that with a smaller number of clusters, k = 5, all of the clusters from English texts can be mapped into the clusters of Malay texts, by using the Complete link distance measure in clustering the bilingual parallel corpus

Summary

Introduction

In labeling articles in both languages, an appropriate clustering technique must be applied in order to have an efficient and effective representation of articles in both languages. Effective and efficient document clustering algorithms are required in order to provide efficient and effective intuitive navigation and browsing mechanisms by categorizing large amount of information into a small number of meaningful clusters. Rayner Alfred et al / Journal of Computer Science 8 (12) (2012) 1970-1978 examine the impacts of clustering corpora when the weights of terms are tuned by using a genetic algorithm in order to optimize the clustering results. Clustering the corpora, based on the finetuned weights of terms that exist in the documents, may increase the quality of clustering results, since the weights of terms are fine-tuned according to a predefined fitness function implemented in the optimization algorithm (e.g., evolutionary algorithm)

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Computer Science	Publication Date: Dec 1, 2012
Citations: 3	License type: cc-by

R Discovery Prime

R Discovery Prime

OPTIMIZING CLUSTERS ALIGNMENT FOR BILINGUAL MALAY-ENGLISH CORPORA

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computer Science

Lead the way for us

Similar Papers

The Attitude of Non-Malays towards Malay and English Language and their Perception on Language of Choice (Malay or English) for Multi-racial Communication in Malaysia: A Study on Non-Malay Tertiary Students (UNITEN)
Noraziah Mohd Amin ... Noor Azam Abdul Rahman
International Journal of Engineering & Technology | VOL. 7
Noraziah Mohd Amin, et. al.Noraziah Mohd Amin ... Noor Azam Abdul Rahman
03 Dec 2018
International Journal of Engineering & Technology | VOL. 7

Natural Language Processing for Requirements Engineering
Liping Zhao ... Waad Alhoshan
ACM Computing Surveys | VOL. 54
Liping Zhao, et. al.Liping Zhao ... Waad Alhoshan
17 Apr 2021
ACM Computing Surveys | VOL. 54

Natural Language Processing and the Promise of Big Data: Small Step Forward, but Many Miles to Go.
Thomas M Maddox ... Michael A Matheny
Circulation. Cardiovascular quality and outcomes | VOL. 8
Thomas M Maddox, et. al.Thomas M Maddox ... Michael A Matheny
18 Aug 2015
Circulation. Cardiovascular quality and outcomes | VOL. 8

SimpleApprenant: a platform to improve French L2 learners’ knowledge of multiword expressions
Amalia Todirascu ... Marion Cargill
-
Amalia Todirascu, et. al.Amalia Todirascu ... Marion Cargill
09 Dec 2019
09 Dec 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

OPTIMIZING CLUSTERS ALIGNMENT FOR BILINGUAL MALAY-ENGLISH CORPORA

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computer Science