JWSAN: Japanese word similarity and association norm

Keisuke Inohara,Akira Utsumi

doi:10.1007/s10579-021-09543-7

Keisuke Inohara, Akira Utsumi

Open Access

https://doi.org/10.1007/s10579-021-09543-7

Copy DOI

Abstract

We present a new Japanese dataset, Japanese Word Similarity and Association Norm (JWSAN), comprising human rating scores of similarity and association for 2145 word pairs, with a clear distinction between word similarity and word association. Computational models of human semantic memory or mental lexicon, such as distributed semantic models, must predict not only association but also similarity. People can distinguish between word similarity and association. However, although the SimLex-999 dataset is publicly available for English, there is no Japanese similarity dataset with a clear distinction between the two types of word relatedness. JWSAN is the first large Japanese dataset with similarity and association ratings, containing noun, verb, and adjective word pairs. It is also characterized by data collection from a sufficient number of age- and-gender-controlled assessors, with similarity and association ratings obtained via a web-based survey conducted of 6450 native speakers of Japanese. In addition, the effects of the gender and age of the raters were also examined; these factors were only given scant consideration in the past. This dataset can act as a benchmark for improving distributed semantic models in Japanese.

Full Text