Abstract

The present study examined how Kansai and Chugoku dialects in Japanese are classified using hierarchical density-based spatial clustering (HDBSCAN) and Random Forest (RF). We obtained the written-format pronunciations of 48 words from 1450 Japanese speakers over age 50 who had been residing in their birthplaces. We calculated phonetic distance (ALINE distance) between the dialectal pronunciations and standard Japanese ones, and ran HDBSCAN and RF models with 1000 bootstrap samples. The optimal HDBSCAN model demonstrated that there are two groups of speakers in the northern and southern pastoral areas of Kansai region. The RF models demonstrated that speakers from each prefecture were classified with a wide range of accuracies (F1 = 0.73). Kansai-region speakers in the urban area where Osaka, Kyoto, and Nara prefectures share borderlines were poorly classified. Similarly, Chugoku-region speakers living near the borderlines of Hiroshima, Shimane, and Tottori prefectures were poorly classified. The rest of the participants were generally classified well. These results suggest that each prefecture generally has its own dialect, but its distribution goes beyond prefecture borderlines. This is the first study to reveal such classification patterns of Japanese dialects using machine learning approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call