Abstract

The generation of texts are dramatically increased in this era. A text basically consists of structured and unstructured texts. The enormous amount of unstructured texts can be easily perceived by humans, unfortunately cannot be simply processed by computer. It needs efficient techniques to reduce the information into more valuable vectors. In this article, we introduce text clustering method using Malay linguistic information to reduce the unstructured semantic information derived from Wikipedia Bahasa Melayu’s articles. The proposed method uses the linguistic features in Malay language to cater the morphological issues of Malay words. We have incorporated semantic information from semantic lexical resource for Malay, which called Wikipedia Bahasa Melayu (WikiBM). Then, an experiment was conducted to evaluate the effects of text clustering to the semantic similarity value using gloss definition of WikiBM’s article. We used Jaccard similarity to calculate the overlaps vectors from the text of WikiBM. Then, the correlation was computed using Pearson’s correlation. The score between original text definition was compared to the new text definition using text clustering method. From the experiment, we can conclude that the correlation value was increased after the semantic information was reduced to more valuable vectors using text clustering method (from 0.39 to 0.43).

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.