Experimental data for computing semantic similarity between concepts using multiple inheritances in Wikipedia category graph

Muhammad Jawad Hussain,Shahbaz Hassan Wasti,Guangjian Huang,Yuncheng Jiang

doi:10.1016/j.dib.2020.105377

Muhammad Jawad Hussain, Shahbaz Hassan Wasti + Show 2 more

Open Access

https://doi.org/10.1016/j.dib.2020.105377

Copy DOI

Journal: Data in Brief	Publication Date: Mar 10, 2020
License type: cc-by

Affiliation: South China Normal University

Abstract

This data article compiles the detailed and descriptive experimental data of Wikipedia-based semantic similarity approach called as Neighbourhood Aggregated Semantic Contribution (NASC), presented in Husain, et al. [1]. The JWPL (Java Wikipedia Library)-DataMachine and JWPL WikipediaAPI are used to extract the required Wikipedia features from Wikipedia dump. The dataset presents the disambiguated Wikipedia concepts of the gold standard word similarity benchmarks MC30 (English), RG65es (Spanish) and RG65fr (French) and their associated set of categories in the corresponding Wikipedia category graph (WCG). The dataset also contains the number of ancestors, common ancestors, pages, and common pages in the k-neighbourhood of the associated categories for different levels of parameter k in the English, Spanish, and French WCGs. The presented dataset can be used to assess the semantic similarity between Wikipedia concepts in English (MC30), Spanish (RG65es), and French (RG65fr) languages benchmarks. Moreover, the dataset will be useful for the further analysis and comparison of the taxonomic structures of the English, Spanish, and French WCGs.

Highlights

This data article compiles the detailed and descriptive experimental data of Wikipedia-based semantic similarity approach called as Neighbourhood Aggregated Semantic Contribution (NASC), presented in Husain, et al [1]
These functions are used to compute the semantic weight and semantic value of a category according to its k-neighbourhood in the corresponding Wikipedia category graph (WCG) respectively
JWPL is an open-source, Java-based application programming interface that allows access to all the information contained in Wikipedia

Summary

Data accessibility

JWPL (Java Wikipedia Library)-DataMachine and JWPL WikipediaAPI were used to extract the required information from Wikipedia. Wikipedia concept pair (Coast, Forest) from MC30 (English) and its equivalent pairs (Costa, Bosque), and (Cote geographic, Foret) from RG65es (Spanish) and RG65fr (French) on different values of parameter k. These Tables highlight the structural differences among. These functions are used to compute the semantic weight and semantic value of a category according to its k-neighbourhood in the corresponding WCG respectively.

Data extraction

The parameter k and implementation of our methods