Digital pattern playback: Converting spectrograms to sound for educational purposes

Takayuki Arai,Keiichi Yasu,Takahito Goto

doi:10.1250/ast.27.393

Abstract

Department of Electrical and Electronics Engineering, Sophia University,7–1 Kioi-cho, Chiyoda-ku, Tokyo, 102–8554 Japan(Received 27 June 2006, Accepted for publication 28 July 2006)Keywords: Education in acoustics, Speech science, Acoustic phoneticsPACS number: 43.70.Bk, 43.10.Sv [doi:10.1250/ast.27.393]1. IntroductionPattern playback, a device that converts a spectrographicrepresentation back to a speech signal, was developed byCooper and his colleagues from Haskins Laboratories in thelate 1940s [1] and has contributed tremendously to the rapiddevelopment of research in speech science [2–4]. Byconverting a spectrogram into a sound, we can test whichacoustic cue projected on the spectrogram is important forspeech perception. Furthermore, we can simplify the acousticcue and/or systematically change an aspect of the acousticcue, redraw a spectrographic representation, and synthesizestimulus sounds. By doing this, many studies have beenconducted, such as the study of the locus theory, whichaccounts for the importance of the second formant trajectoryof a following vowel for the perception of a preceding stopconsonant [5].Today, we can easily implement a modern pattern play-back with digital technology, and this is valuable for pedagog-ical applications. Thus, in this study, we implement a digitalpattern playback and explore its usefulness for education [6].2. PrincipleIn the original ‘‘pattern playback’’ [1], the light source andtone wheel generate an optical set of harmonics at 120Hz, andthe amplitudes of the harmonics are modulated by a givenspectrogram. The spectrogram is placed on the top of a beltmoving at a constant speed, and an amplitude-modulatedsignal is output from the loudspeaker.This analog version of pattern playback can easily beimplemented with modern digital technology. In fact, Nye etal. reported a digital version of the pattern playback fromHaskins Laboratories using a PDP-11 computer system [7]. Inthis study, we propose two simple but versatile algorithms fordigital pattern playback.The ﬁrst algorithm, or the AM method, is based on theconcept of amplitude modulation (AM). In this algorithm,the amplitudes of harmonics are modulated by the darknesspattern of a spectrogram as shown in Fig. 1. This is somewhatsimilar to the original pattern playback based on the sourceﬁlter theory of speech production. Changing the fundamentalfrequency of the harmonics yields a variation in pitch, and iteventually allows us to put intonation onto the output sounds.As an alternative option, we can also use a noise source,instead of the harmonic source, to produce unvoiced sounds.Many studies discuss how to reconstruct the originalphase components from a spectrographic representation (e.g.,[8]). However, the original pattern playback, even without thereconstruction of phase components, is still extremely power-ful for educational purposes because it shows the importanceof formant transitions, et cetera. Furthermore, we want toimplement a simple, digital system that everybody can use.For this reason, our system does not reconstruct the phasecomponents or change the fundamental frequency duringplaying back.The second algorithm, or the FFT method, is based on thefast Fourier transform (FFT). In this algorithm, a time slice ofa given spectrogram is treated as a logarithmic spectrum ofthat time frame, and the spectrum is converted back into thetime domain by the inverse FFT as shown in Fig. 2. Becausewe are not reconstructing the original phase, we simply set thephase components to zero.Because our aim is a simple algorithm with no pitchchange during playback, we have carefully chosen a frameshift dependent on the fundamental period. In other words, weused the frame shift that exactly matches the desired fun-damental period. To do this, we ﬁrst reduce the frequencyresolution of the spectrum to obtain only the spectral envelope(especially for a spectrogram obtained by a narrow-bandanalysis), which reﬂects the vocal-tract ﬁlter. Then, by takingthe inverse FFT, we get an impulse response of the ﬁlter forthat time frame. Finally, we place the impulse response alongthe time axis frame-by-frame with the time interval of theframeshift,whichisalsoequivalenttothefundamentalperiod.We are technically able to change the time intervals to placethe impulse responses depending on the instantaneous pitchcontour, although we maintain a constant fundamental period.In theory, we can use a variety of sets of values for eachparameter. In practice, we use the following values. For thesampling frequency, 8 to 16kHz is preferable. For the framelength, 256 or 512 points is optimal. We can use a frame shiftof 3–13ms. This range is suitable for producing a speechsound uttered by an adult male or female, because thefundamental period is set to the frame shift. We often use theframe shift of 10ms, as when the fundamental frequency is100Hz. We can reconstruct an intelligible speech sound aslong as the spectrum within a frame is represented at about 40points or more up to 8kHz. A non-linear transformation of

Full Text