Japanese adverbs are difficult to classify, with little progress made since the 1930s. Now in the age of large language models, linguists need a framework for lexical grouping that incorporates quantitative, evidence-based relationships rather than purely theoretical categorization. We herein address this need for the case of Japanese adverbs by developing a semantic positioning approach that incorporates large language model embeddings with fuzzy set theory to achieve empirical Japanese adverb groupings. To perform semantic positioning, we (i) obtained multi-dimensional embeddings for a list of Japanese adverbs using a BERT or RoBERTa model pre-trained on Japanese text, (ii) reduced the dimensionality of each embedding by principle component analysis (PCA), (iii) mapped the relative position of each adverb in a 3D plot using K-means clustering with an initial cluster count of n=3, (iv) performed silhouette analysis to determine the optimal cluster count, (v) performed PCA and K-means clustering on the adverb embeddings again to generate 2D semantic position plots, then finally (vi) generated a centroid distance matrix. Fuzzy set theory informs our workflow at the embedding step, where the meanings of words are treated as quantifiable vague data. Our results suggest that Japanese adverbs optimally cluster into n=4 rather than n=3 groups following silhouette analysis. We also observe a lack of consistency between adverb semantic positions and conventional classification. Ultimately, 3D/2D semantic position plots and centroid distance matrices were simple to generate and did not require special hardware. Our novel approach offers advantages over conventional adverb classification, including an intuitive visualization of semantic relationships in the form of semantic position plots, as well as a quantitative clustering “fingerprint” for Japanese adverbs that express vague language data as a centroid distance matrix.
Read full abstract