multiple-choices, regarding students’ learning achievement. When the number of students in a class is huge, however, examinations using essay questions become hard to conduct and take a long evaluation time. Automatic essay evaluation has, therefore, become a potential approach in this situation. Various methods have been proposed, however, optimal solutions for such evaluation in the Indonesian language are less known. Furthermore, with the rapid development of machine learning approaches, in particular deep learning approaches, the investigation of such optimal solutions becomes more necessary. Method: To address the aforementioned issue, this study proposed the investigation of text representation approaches for optimal automatic evaluation of Indonesian essay answers. The investigation compared pre-trained word embedding methods such as Word2vec, GloVe, FastText, and RoBERTa, as well as compared text encoding methods such as long short-term memories (LSTMs) and transformers. LSTMs are able to capture temporal semantics by employing state variables, while transformers are able to capture long-term dependency between parts of their input sequences. Additionally, we investigated classification-based and similarity-based training to build text encoders. We expected that these training approaches allowed encoders to extract different views of information. We compared classification results produced by different text encoders and combinations of text encoders. Result: We evaluated various text representation approaches using the UKARA dataset. Our experiments showed that the FastText word embedding method outperformed the Word2vec, GloVe, and RoBERTa methods. The FastText method achieved an F1-score of 75.43% on validation sets, while the Word2vec, GloVe, and RoBERTa methods achieved F1-scores of 69.56%, 74.53%, and 72.87%, respectively. In addition, the experiments showed that combinations of text encoders outperformed individual encoders. The combination of the LSTM encoder, the transformer encoder, and the TF-IDF encoder obtained an F1-score of 77.22% in the best case, which is better than the best F1-scores of the individual LSTM encoders (75.35%), the best combination of transformer encoders (71.49%), and the individual TF-IDF encoder (76.69%). We observed that LSTM encoders produced better performance when they were built using classification-based training. Meanwhile, the transformer encoders obtained better performance when built using similarity-based training. Novelty: The novelty proposed in this research is the optimal combination of text encoders specifically constructed for the evaluation of essay answers in the Indonesian language. Our experiments showed that the combination of three encoders - namely the LSTM encoder built using classification-based training, the transformer encoder built using classification-based and similarity-based training, and the TF-IDF encoder - obtained the best classification performance.
Read full abstract