Abstract

This research discusses how natural language processing (NLP) toolkit for Indonesia formal text and social media text, named as InaNLP, has been developed. Several NLP modules were integrated into InaNLP to make people easier in building an NLP system for Indonesia language. The toolkit contains several NLP modules such as sentence splitter, tokenization, Part of Speech (POS) tagger, phrase chunker, named entity (NE) tagger, syntactic parser, semantic analyzer, and word normalization. Several NLP modules were built using rule based approach, whereas several others implemented statistical based approach. Here, the accuracy of several modules such as the POS tagger, NE tagger, syntactic parser and semantic analyzer are shown. In the NE tagger, five (5) word windows with features of POS, orthography, and word list are used. In the NE tagger experiment for evaluating the features, using SMO algorithm and 1500 sentences, for 15 NE classes, token classification accuracy of 93.419%, which outperform related work, could be achieved. For the POS tagger, using 12,000 token as the training data and 3,000 token as the testing data, the accuracy of 96.50% was achieved. For the syntactic parser, using CYK algorithm and 100 sentences as the training data and 36 sentences as the testing data, the experiment achieved the accuracy of 47.22%. For the semantic analyzer, using 200 sentences as the training data, the experiment achieved the accuracy of 62.50%. This research also shows an example in building an Indonesia NLP system using InaNLP for complaint tweet classification. In the experiment for the complaint classification, using 7440 data, the experiment achieved 0.892 of average F-measure score.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.