The Effectiveness of Bottom Up Technique with Probabilistic Approach for A Malay Parser

Muhammad Azhar Fairuzz Hiloh,Lailatul Qadri Zakaria,Mohd Juzaiddin Ab Aziz

doi:10.17576/gema-2018-1802-09

Muhammad Azhar Fairuzz Hiloh, Lailatul Qadri Zakaria + Show 1 more

Open Access

https://doi.org/10.17576/gema-2018-1802-09

Copy DOI

Abstract

Parsing is a process of analyzing the input string in a sentence to define the syntax structures according to rules of grammar. This task is performed by a parser which will produce a parse tree as output. However, a problem occurs when the parsing process produces two or more parse trees in which the parser unable to represent a precise parse tree. This limitation is caused by ambiguity in the structure of sentences. Ambiguity is occurred when a word is classified more than one category of syntax and its usage will affect the semantics of the sentence. Thus, the parser needs to have an approach to solve the ambiguity problem and is able to process the most appropriate parse tree to present a sentence. Like other languages in the world, Malay language, a national language for Malaysian, is not exempted from ambiguity problem. However, due to its grammar being context-free grammar, the probabilistic context-free grammar approach can be used to support the parser in determining a more accurate parse tree. This study focuses on the development of statistical parser using a bottom-up technique for Malay language. The training data, in the form of simple Malay language sentences, are collected from various sources. Based on this training data, a statistical lexical corpus of Malay language which consists of vocabulary, grammar rules and their probability was developed. The bottom up parsing will be supported by implementing Cocke–Younger–Kasami (CYK) algorithm. The parser’s performance is evaluated based on its effectiveness to overcome ambiguity by suggesting a more precise parse tree. In conclusion, the Malay Language Parser can be useful to help user identify the appropriate parse tree and solve ambiguity issues in Malay Language.

Highlights

Malay language is a formal language that is widely used in administrative, education and business in Malaysia
The results show that the Malay Parser is able to suggest the most appropriate parse tree based on the probability value of words matched with the grammar rules
An evaluation of the parser performance shows its ability to propose the most precise parse tree if a sentence syntactic structure produces more than a parse tree

Summary

Introduction

Malay language is a formal language that is widely used in administrative, education and business in Malaysia. Malay language has attracted many researchers to perform Natural Language Processing (NLP) studies both in linguistic and computerization (Sabrina et al, 2011; Ahmad et al, 2007; Rozana et al, 2011; Yusmita & Zulaikha, 2011; Noor Hafhizah & Karim, 2012; Nik Safiah, 1975). NLP implementations can be divided into several components, which are phonology, morphology, syntax, semantic, discourse and pragmatic. Syntax is the study of how words are put together to form correct sentences. Semantic is about analyzing meaning, what word means, and how these meaning combine in sentence to form sentence meaning. Discourse concerns on how the immediately proceeding sentence affect the interpretation of the sentence and pragmatic describes a relationship of meaning to the goals and intentions of the speaker

Objectives

Methods

Results

Conclusion