The tremendous development in Natural Language Processing (NLP) has enabled the detection of bilingual and multilingual textual similarity. One of the main challenges of the Textual Similarity Detection (TSD) system lies in learning effective text representation. The research focuses on identifying similar texts between Indonesian and English across a broad range of semantic similarity spectrums. The primary challenge is generating English and Indonesian dense vector representation, a.k.a. embeddings that share a single vector space. Through trial and error, the research proposes using the Universal Sentence Encoder (USE) model to construct bilingual embeddings and FAISS to index the bilingual dataset. The comparison between query vectors and index vectors is done using two approaches: the heuristic comparison with Euclidian distance and a clustering algorithm, Approximate Nearest Neighbors (ANN). The system is tested with four different semantic granularities, two text granularities, and evaluation metrics with a cutoff value of k={2,10}. Four semantic granularities used are highly similar or near duplicate, Semantic Entailment (SE), Topically Related (TR), and Out of Topic (OOT), while the text granularities take on the sentence and paragraph levels. The experimental results demonstrate that the proposed system successfully ranks similar texts in different languages within the top ten. It has been proven by the highest F1@2 score of 0.96 for the near duplicate category on the sentence level. Unlike the near-duplicate category, the highest F1 scores of 0.77 and 0.89 are shown by the SE and TR categories, respectively. The experiment results also show a high correlation between text and semantic granularity.
Read full abstract