Abstract

As bilateral relation between Indonesia and Japan strengthens, the need of consistent term usage for both languages becomes important. In this paper, a new method for Indonesian-Japanese term extraction is presented. In general, this is done in 3 steps: (1) n-gram extraction for each language, (2) n-gram cross-pairing between both languages, and (3) classification. This method is aimed to be able to handle term extraction from both parallel corpora and comparable corpora. In order to use this method, we have to build a classification model first using machine learning. There are 4 types of feature we take into consideration. They are dictionary based features, cognate based features, combined features, and statistic features. The first three features are linguistic features. Dictionary based features consider word-pair existence in a predefined dictionary, cognate based features consider morpheme level similarity, combined features consider both dictionary and cognate based features altogether, and statistic features is used in case the first 3 features fail. The only statistic feature we use is context heterogeneity similarity, which consider the variety of words that can precede or follow a term. For learning algorithm, we use SVM (Support Vector Machine). In the experiment, we compared several scenarios: only linguistic features, only statistic features, or both features combined. The classification model was built from parallel corpora since plenty of term pairs can be extracted from parallel corpora. The size of training data was 5,000 term pairs. The best result was achieved by using only linguistic features and without the preprocessing step. The accuracy was up to 90.98% and recall 92.14%. A testing from comparable corpora was also done with size of 37,392 term pairs where 94 were equivalent translation and 37,298 were not. Evaluation using test set gave accuracy of 98.63% precision, but with low recall score of 24.47%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.