A person’s voice serves as an indicator of age, as it changes with anatomical and physiological influences throughout their life. Although age prediction is a subject of interest across various disciplines, age-prediction studies using Korean voices are limited. The few studies that have been conducted have limitations, such as the absence of specific age groups or detailed age categories. Therefore, this study proposes an optimal combination of speech features and deep-learning models to recognize detailed age groups using a large Korean-speech dataset. From the speech dataset, recorded by individuals ranging from their teens to their 50s, four speech features were extracted: the Mel spectrogram, log-Mel spectrogram, Mel-frequency cepstral coefficients (MFCCs), and ΔMFCCs. Using these speech features, four deep-learning models were trained: ResNet-50, 1D-CNN, 2D-CNN, and a vision transformer. A performance comparison of speech feature-extraction methods and models indicated that MFCCs + ΔMFCCs was the best for both sexes when trained on the 1D-CNN model; it achieved an accuracy of 88.16% for males and 81.95% for females. The results of this study are expected to contribute to the future development of Korean speaker-recognition systems.
Read full abstract