FarSick: A Persian Semantic Textual Similarity And Natural Language Inference Dataset

Zahra Ghasemi,Mohammad Ali Keyvanrad

doi:10.1109/iccke54056.2021.9721521

Abstract

Semantic textual similarity(STS) and natural language inference(NLI) are important tasks in natural language processing(NLP) such as information retrieval, text classification, subject extraction, text summarization, machine translation and plagiarism detection. Lack of appropriate datasets in the Persian language is a major obstacle to progress in this area. Therefore, in this paper, we present FarSick, a new dataset for STS and NLI tasks in the Persian language. FarSick is the first relatively large-scale STS dataset for the Persian language. It includes 9804 pairs of Persian sentences with labels for similarity and inference for each pair of sentences. This dataset is collected by translating and editing the sentences of SICK dataset. We also measured the performance of traditional, statistical and deep learning models on it, e.g. transformers, Convolution Neural Networks, Bidirectional LSTMs, weighted average of word vectors, etc. We used different pre-trained embeddings, word2vec, glove, fastText and Bert sentence transformer. We used accuracy metric to test NLI tasks and Pearson metric to test STS tasks. The dataset is available at https://github.com/ZahraGhasemi-AI/FarSick.

Full Text