Abstract

In information retrieval, the key to an effective indexing can be achieved through the removal of stop words. Despite having many theories and algorithms related to the construction of stop words in many languages, yet, most of the Malay stop words used are either utilized/borrowed from English stop words, or constructed manually by different researchers which happen to be costly, time consuming and susceptible to error. In other words, no standard stop word list has been constructed for Malay language yet. In this study, we propose an aggregation technique using three different approaches for an automatic construction of general Malay Stop words. The first approach based on statistical method, by considering words’ frequencies (highest and lowest) against their ranks, this method inspired by zipf’s law. The second approach by considering words’ distribution against documents using variance measure. The third approach by computing how informative a word is by using Entropy measure. As a result, a total of 339 Malay stop words were produced. The discussion and implication of these findings are further elaborated.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call