Using Multiple Monolingual Models for Efficiently Embedding Korean and English Conversational Sentences

Youngki Park,Youhyun Shin

doi:10.3390/app13095771

Abstract

This paper presents a novel approach for finding the most semantically similar conversational sentences in Korean and English. Our method involves training separate embedding models for each language and using a hybrid algorithm that selects the appropriate model based on the language of the query. For the Korean model, we fine-tuned the KLUE-RoBERTa-small model using publicly available semantic textual similarity datasets and used Principal Component Analysis (PCA) to reduce the resulting embedding vectors. We also selected a highly-performing English embedding model from available SBERT models. We compared our approach to existing multilingual models using both human-generated and large language model-generated conversational datasets. Our experimental results demonstrate that our hybrid approach outperforms state-of-the-art multilingual models in terms of accuracy, elapsed time for sentence embedding, and elapsed time for finding the nearest neighbor, regardless of whether a GPU is used. These findings highlight the potential benefits of training separate embedding models for different languages, particularly for tasks involving finding the most semantically similar conversational sentences. We expect that our approach will be used for diverse natural language processing-related fields, including machine learning education.

Full Text