Keynote speaker I: Development of Indonesian natural language processing tools and its usage in text applications

Ayu Purwarianti

doi:10.1109/icaicta.2015.7335344

Abstract

Not only because of the fast growing of internet where we can automatically extract information from unstructured text in documents (text analytics), but also the increasing human need on using computer as simple as possible (conversation), the field of natural language processing has become more interesting in recent years. This phenomenon is also applied for Indonesian language where there is about 250 million people using this language and the neighbor country understands the language. Different with major language such as English or Japanese, the data resource for Indonesian language processing is very limited, most of them were developed by researchers individually. Here, we will describe our ongoing research on building Indonesian Natural Language Processing Tools which we named it INANLP. This tool consists of several natural language processing tools, covers from lexical, syntactic to semantic processing. By the limitation of the data resource and expert knowledge, we employ both the statistical method and rule based method in building the tool. For tools with an adequate expert knowledge, we only employ a rule based method such as for tokenization, stemming, word formalization, semantic analyzer. As for other tools, we employ statistical method with an additional knowledge to handle the OOV (out of vocabulary), such as for POS tagger and Named Entity tagger. As for the parser, until now, we only applied the statistical based, since the POS tagger used in the parser is already designed to handle the OOV. With this limitation, we manage to use INANLP in building text applications including text analytics, text understanding or text conversation. In text analytics, we use INANLP to build text classification and information extraction, for example is a complaint management system, which aim is to automatically extract complaint information (written in social media) from the citizen. Here, we use the tokenization, formalization and named entity tagger of INANLP to build the system. Another application example is in text understanding, where we tried to generate mind map from a text with simple sentences. The INANLP modules used here are tokenization, POS tagger, parser and semantic analyzer. We also built a question answering system which aim is to find answers for given question from unstructured text or structured data. Here we employed tokenization, POS tagger, Named Entity tagger, and parser to build the question answering system.

Full Text