Dependency based Multiword Expression Extraction towards NLP applications

P Sanjanaashree,M Anand Kumar,K P Soman

doi:10.1145/2660859.2660928

Abstract

This paper explores the full-fledged supervised Machine Learning based approach for the automatic extraction of lexical chunks, commonly called as Multi-Word Expression (MWE). The concept of MWE concerns a variety of constructions in everyday language in the form of idioms, phrasal verbs and noun compounds. The pervasiveness of MWE in the NLP tasks that deals with real text, such as Machine Translation and Information retrieval should be provided with enough MWE treatment; if not, the system will fail to generate high-quality natural output. Here, we are extracting phrasal verbs from the English movie subtitle corpus based on their corresponding linguistic pattern and standard association scores. The extracted phrasal verbs have been used to train various machine learning algorithms for discriminating MWE. Two methods of linguistic pattern extraction are implemented, out of which one is proven to be effective. Here, we have demonstrated two major findings, 1) MWE extraction based on dependency information along with POS tag provides better accuracy than it had been extracted from the POS tag pattern alone. 2) The result obtained from extraction is used to train three different machine learning classifiers, out of which Random forest classifier is verified to be the suitable classifier for the application handled.

Full Text