Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair

Ghazeefa Fatima,Muhammad Salman Khan,Ali Saeed,Rao Muhammad Adeel Nawab

doi:10.1145/3472618

Abstract

Semantic word similarity is a quantitative measure of how much two words are contextually similar. Evaluation of semantic word similarity models requires a benchmark corpus. However, despite the millions of speakers and the large digital text of the Urdu language on the Internet, there is a lack of benchmark corpus for the Cross-lingual Semantic Word Similarity task for the Urdu language. This article reports our efforts in developing such a corpus. The newly developed corpus is based on the SemEval-2017 task 2 English dataset, and it contains 1,945 cross-lingual English–Urdu word pairs. For each of these pairs of words, semantic similarity scores were assigned by 11 native Urdu speakers. In addition to corpus generation, this article also reports the evaluation results of a baseline approach, namely “Translation Plus Monolingual Analysis” for automated identification of semantic similarity between English–Urdu word pairs. The results showed that the path length similarity measure performs better for the Google and Bing translated words. The newly created corpus and evaluation results are freely available online for further research and development.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing	Publication Date: Nov 18, 2021
Citations: 1

Similar Papers

Developing a Large Benchmark Corpus for Urdu Semantic Word Similarity
Iqra Muneer ... Ghazeefa Fatima
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 22
Iqra Muneer, et. al.Iqra Muneer ... Ghazeefa Fatima
13 Mar 2023
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 22

Short Text Similarity Calculation Using Semantic Information
Haoyu Pu ... Hailin Zhao
-
Haoyu Pu, et. al.Haoyu Pu ... Hailin Zhao
01 Aug 2017
01 Aug 2017

An approach to reduce part of speech ambiguity using semantically annotated lexicon definitions
Andrei Minca ... Stefan Diaconescu
-
Andrei Minca, et. al.Andrei Minca ... Stefan Diaconescu
01 Sep 2012
01 Sep 2012

An Approach to Reduce Part of Speech Ambiguity Using Semantically Annotated Lexicon Definitions
Andrei Minc ... Tefan Diaconescu
-
Andrei Minc, et. al.Andrei Minc ... Tefan Diaconescu
01 Jan 2013
01 Jan 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing