Abstract While recent advances in sociophonetic data processing have made it possible to analyze large datasets and audio not originally intended for linguistic analysis, overlapping speech in recordings with multiple speakers continues to be an issue that results in lost data. We evaluate whether current source separation models produce audio that is clean enough to produce reliable measurements for sociophonetic analysis. We compare formant estimates from a pair of pristine recordings and merged-and-separated versions of those same recordings using the Libri2mix, Whamr16K, and WSJ02mix source separation models. Based on auditory inspection of the separated files, visualization of vowel formant estimates, and statistical analysis, Libri2 performed best and WSJ02 was worst. While the mean formant measurements per vowel were usually small, differences for each observation were larger in unpredictable ways. We are cautiously optimistic about using these tools in sociophonetic analysis, so long as analysis is conducted on vowel means. We conclude with recommendations that researchers can implement when using source separation in sociophonetic research.
Read full abstract