Data-driven production models for speech processing
This dissertation investigates articulatory models for speech processing, demonstrating that low-dimensional models can accurately capture real speech production data, enable synthesis and inversion of articulatory movements from audio, and improve speech recognition by incorporating articulatory information.
When difficult computations are to be performed on sensory data it is often advantageous to employ a model of the underlying process which produced the observations. Because such generative models capture information about the set of possible observations, they can help to explain complex variability naturally present in the data and are useful in separating signal from noise. In the case of neural and artificial sensory processing systems generative models are learned directly from environmental input although they are often rooted in the underlying physics of the modality involved. One effective use of learned models is made by performing model inversion or state inference on incoming observation sequences to discover the underlying state or control parameter trajectories which could have produced them. These inferred states can then be used as inputs to a pattern recognition or pattern completion module. In the case of human speech perception and production, the models in question are called articulatory models and relate the movements of a talker's mouth to the sequence of sounds produced. Linguistic theories and substantial psychophysical evidence argue strongly that articulatory model inversion plays an important role in speech perception and recognition in the brain. Unfortunately, despite potential engineering advantages and evidence for being part of the human strategy, such inversion of speech production models is absent in almost all artificial speech processing systems. This dissertation presents a series of experiments which investigate articulatory speech processing using real speech production data from a database containing simultaneous audio and mouth movement recordings. I show that it is possible to learn simple low dimensionality models which accurately capture the structure observed in such real production data. I discuss how these models can be used to learn a forward synthesis system which generates spectral sequences from articulatory movements. I also describe an inversion algorithm which estimates movements from an acoustic signal. Finally, I demonstrate the use of articulatory movements, both true and recovered, in a simple speech recognition task, showing the possibility of doing true articulatory speech recognition in artificial systems.
- Research Article
- 10.1111/j.1749-818x.2010.00225.x
- Aug 1, 2010
- Language and Linguistics Compass
Teaching and Learning Guide for: Mirror Neurons, the Motor System, and Language – From the Motor Theory to Embodied Cognition and Beyond
- Conference Article
4
- 10.1109/isspa.1999.818096
- Aug 22, 1999
Summary form only given, as follows. Quantitative models of human speech production and perception mechanisms provide important insights into our cognitive abilities and can lead to high-quality speech synthesis, robust automatic speech recognition and coding schemes, and better speech and hearing prostheses. Some of our research activities in these two areas are described. Our speech production work involved collecting, and analyzing magnetic resonance images (MRI), acoustic recordings, and electropalatography (EPG) data from talkers of American English during speech production. The articulatory database is the largest of its kind in the world and contains the first images of liquids (such as /I/ and /r/) and fricatives (such as /s/ and /sh) for both male and female talkers. MR images are useful for characterizing the 3D geometry of the vocal tract (VT) and for measuring lengths, area functions, and volumes. EPG is used to study inter- and intra-speaker variabilities in the articulatory dynamics, while acoustic recordings are necessary for modeling. Inter- and intra-speaker characteristics of the VT and tongue shapes will be illustrated for various speech sounds, as well as results of acoustic modeling based on the MRI and acoustic data. The implications of our findings on vocal-tract normalization schemes and speech synthesis are also discussed. In the speech perception area, aspects of auditory signal processing and speech perception are parameterized and implemented in a speech recognition system. Our models parameterize the sensitivity to spectral dynamics and local peak frequency positions in the speech signal. These cues remain robust when listening to speech in noise. Recognition evaluations using the dynamic model with a stochastic hidden Markov model (HMM) recognition system showed increased robustness to noise over other state-of-the-art representations. The applications of auditory modeling to speech coding are discussed. We developed an embedded and perceptually-based speech and audio coder. Perceptual metrics are used to ensure that encoding is optimized to the human listener and is based on calculating the signal-to-mask ratio in short-time frames of the input signal. An adaptive bit allocation scheme is employed and the subband energies are then quantized. The coder is variable-rate, noise-robust and suitable for wireless communications.
- Research Article
- 10.1055/s-0031-1272745
- Mar 1, 2011
- Klinische Neurophysiologie
Introduction: Current thinking divides speech processing in speech perception and speech recognition. A dual-stream model of speech processing proposes a left-lateralized dorsal stream mapping acoustic speech signals to frontal articulatory networks aiding in speech perception. The bilaterally organized ventral stream is supposed to subserve speech recognition (Hickok and Poeppel, Nature Reviews Neuroscience 2007). Yet, overt articulation does not engage the dorsal stream, at least for overt reading (Kell et al., Brain 2009 and Kell et al., Cerebral Cortex 2010). We set out to disentangle the functional roles of the dorsal and ventral speech processing stream and their contribution to lateralization by comparing preparation for and execution of speech processing with speech production.
- Conference Article
22
- 10.1109/ijsis.1998.685467
- Mar 21, 1998
The ultimate goal of text-to-speech synthesis is to convert ordinary orthographic text into an acoustic signal that is indistinguishable from human speech. Originally, synthesis systems were architected around a system of rules and models that were based on research on human language and speech production and perception processes. The quality of speech produced by such systems is inherently limited by the quality of the rules and the models. Given that our knowledge of human speech processes is still incomplete, the quality of text-to-speech is far from natural-sounding. Hence, today's interest in high quality speech for applications, in combination with advances in computer resource, has caused the focus to shift from rules and model-based methods to corpus-based methods that presumably bypass rules and models. For example, many systems now rely on large word pronunciation dictionaries instead of letter-to-phoneme rules and large prerecorded sound inventories instead of rules predicting the acoustic correlates of phonemes. Because of the need to analyze large amounts of data, this approach relies on automated techniques such as those used in automatic speech recognition.
- Research Article
245
- 10.1016/j.tics.2006.09.002
- Sep 25, 2006
- Trends in Cognitive Sciences
What is the relationship between phonological short-term memory and speech processing?
- Research Article
72
- 10.1016/j.cub.2011.11.052
- Dec 22, 2011
- Current Biology
Children's Development of Self-Regulation in Speech Production
- Research Article
1
- 10.1121/1.422537
- May 1, 1998
- The Journal of the Acoustical Society of America
Two of the goals of research in speech communication are to develop models of normal speech production and normal speech perception. Related objectives are to uncover the process by which children acquire the knowledge implicit in these models and to determine how the models are modified for disordered speech. Even partial achievement of these goals can have significant practical consequences, including machine recognition and synthesis of speech, and improved methods for diagnosis and remediation of speech disorders. In this paper, a current view of a framework for models of speech production and perception will be described, and some of the steps that have led to refinement of these models over the past 50-odd years will be described. Advances have been made in quantifying acoustic mechanisms of speech production and in specifying the nature of the discrete linguistic representation of an utterance in memory. From studies of speech perception and speech motor control, some understanding has been gained of how properties of the sound are related to the linguistic representation. There are large deficiencies, however, in our understanding of variability in speech due to speaker differences, speaking style, and context. [Work supported in part by NIH Grants DC00075 and DC02525.]
- Research Article
- 10.5075/epfl-thesis-3637
- Jan 1, 2006
- Infoscience (Ecole Polytechnique Fédérale de Lausanne)
The goal of this thesis is to develop and design new feature representations that can improve the automatic speech recognition (ASR) performance in clean as well noisy conditions. One of the main shortcomings of the fixed scale (typically 20-30 ms long analysis windows) envelope based feature such as MFCC, is their poor handling of the non-stationarity of the underlying signal. In this thesis, a novel stationarity-synchronous speech spectral analysis technique has been proposed that sequentially detects the largest quasi-stationary segments in the speech signal (typically of variable lengths varying from 20-60 ms), followed by their spectral analysis. In contrast to a fixed scale analysis technique, the proposed technique provides better time and frequency resolution, thus leading to improved ASR performance. Moving a step forward, this thesis then outlines the development of theoretically consistent amplitude modulation and frequency modulation (AM-FM) techniques for a broad band signal such as speech. AM-FM signals have been well defined and studied in the context of communications systems. Borrowing upon these ideas, several researchers have applied AM-FM modeling for speech signals with mixed results. These techniques have varied in their definition and consequently the demodulation methods used therein. In this thesis, we carefully define AM and FM signals in the context of ASR. We show that for a theoretically meaningful estimation of the AM signals, it is important to constrain the companion FM signal to be narrow-band. Due to the Hilbert relationships, the AM signal induces a component in the FM signal which is fully determinable from the AM signal and hence forms the redundant information. We present a novel homomorphic filtering technique to extract the leftover FM signal after suppressing the redundant part of the FM signal. The estimated AM message signals are then down-sampled and their lower DCT coefficients are retained as speech features. We show that this representation is, in fact, the exact dual of the real cepstrum and hence, is referred to as fepstrum. While Fepstrum provides amplitude modulations (AM) occurring within a single frame size of 100ms, the MFCC feature provides static energy in the Mel-bands of each frame and its variation across several frames (the deltas). Together these two features complement each other and the ASR experiments (hidden Markov model and Gaussian mixture model (HMM-GMM) based) indicate that Fepstrum feature in conjunction with MFCC feature achieve significant ASR improvement when evaluated over several speech databases. The second half of this thesis deals with the noise robust feature extraction techniques. We have designed an adaptive least squares filter (LeSF) that enhances a speech signal corrupted by broad band noise that can be non-stationary. This technique exploits the fact that the autocorrelation coefficients of a broad-band noise decay much more rapidly with increasing time lag as compared to those of the speech signal. This is especially true for voiced speech as it consists of several sinusoids at the multiples of the fundamental frequency. Hence the autocorrelation coefficients of the voiced speech are themselves periodic with period equal to the pitch period. On the other hand, the autocorrelation coefficients of a broad band noise are rapidly decaying with increasing time lag. Therefore, a high order (typically 100 tap) least square filter that has been designed to predict a noisy speech signal (speech + additive broad band noise) will predict more of the clean speech components than the broad band noise. This has been analytically proved in this thesis and we have derived analytic expressions for the noise rejection achieved by such a least squares filter. This enhancement technique has led to significant ASR accuracy in the presence of real life noises such as factory noise and aircraft cockpit noise. Finally, the last two chapters of this thesis deal with feature level noise robustness technique. Unlike the least squares filtering that enhances the speech signal itself (in the time domain), the feature level noise robustness techniques as such do not enhance the speech signal but rather boosts the noise-robustness of the speech features that usually are non-linear functions of the speech signal's power spectrum. The techniques investigated in this thesis provided a significant improvement in the ASR performance for the clean as well noisy acoustic conditions.
- Research Article
3
- 10.1044/leader.ftr2.10042005.8
- Mar 1, 2005
- The ASHA Leader
You have accessThe ASHA LeaderFeature1 Mar 2005Aural Habilitation Update: The Role of Speech Production Skills of Infants and Children With Hearing Loss Sheila R. Pratt Sheila R. Pratt Google Scholar More articles by this author https://doi.org/10.1044/leader.FTR2.10042005.8 SectionsAbout ToolsAdd to favorites ShareFacebookTwitterLinked In It is well known that the development of speech is extremely limited without adequate auditory input and feedback. An obvious example is that hearing loss in infancy and early childhood usually affects all as pects of speech production unless there is early and consistent use of sensory aids as well as substantive sensorimotor and linguistic training. The speech development of infants and children with hearing loss hinges on their abilities to use audition not only to learn the sounds of their language, but also to use their articulators to produce those sounds and make use of auditory feedback to refine their speech over time. As such, the speech of children with prelingual hearing loss is particularly susceptible to delay and disorder, es pecially if the severity of the hearing loss is substantial and intervention is delayed or inadequate. Speech Development During the first six months of life (and possibly in utero) auditory perceptual learning is vital for acquiring oral language and speech, although the maturation timeline for the speech production in normal-hearing children is relatively lengthy. This protracted timeline may account for the long-term training and treatment needs of many children with hearing loss, even those identified and fitted early with sensory aids (Yoshinaga-Itano & Sedey, 2000). Young children with normal hearing typically begin babbling around 5–6 months of age and start verbal expression around 12 months of age. However, their speech production skills continue to be refined through the school-age years and well beyond when their basic phonological inventories have been established. For example, vowel space, voice-onset times, and vocal control adjust throughout early childhood (Assmann & Katz, 2000; Koenig, 2001; Lee, Pontamianos, & Naray anan, 1999). Furthermore, substantial acoustic variability is a hallmark of children’s speech production until late childhood. Although the research is somewhat mixed on the development of coarticulation, children appear to be less able than adults to coarticulate their speech gestures in a consistent manner, and as a consequence, their speech is less intelligible than that of adults (Katz, Kripke, & Tallal, 1991; Nittrouer, 1993). The refinement of auditory processing of speech has a similar developmental timeline. Child ren may apply different rules or weights to speech cues than adults, and these weights change throughout childhood (Nittrouer, 2003; Nit trouer, Crowther, & Miller, 1998). Their auditory processing of speech also appears to be more susceptible to acoustic and linguistic perturbations than is observed with adults. Children are more adversely affected than adults by background noise, reverberation, talker variability, re ductions in signal bandwidth, and the number of signal channels (Eisenberg et al., 2000; Ryalls & Pisoni, 1997; Kortekaas & Stelmachowicz, 2000). The Role of Audition in Speech Development and Production For mature speakers, audition acts as an error detector and a means of monitoring speaking conditions. It is considered to be slower than other forms of sensory information (i.e., proprioception) generated during speech, and therefore is likely limited to a feedback role (Perkell et al., 1997). Speakers use audition to determine if their articulators have produced sounds that are acoustically off-target. Audition also provides information for corrective adjustments, and as a consequence, is a contributor to the maintenance of speech integrity. Studies of frequency and spectrally shifted speech feedback have shown that adults rapidly adjust to minor acoustic perturbations with compensatory and/or matching strategies (Bauer & Larson, 2003; Houde & Jordan, 2002; Jones & Munhall, 2002, 2003). They appear to adjust their articulators so that their speech productions match their internal representations. In addition to acting as an error detector, hearing is used by mature speakers to determine how they should adjust their speech in various acoustic, linguistic, and social environments. For example, adults know when to speak slower, louder, softer, or more precisely in order to accommodate their listener or the environmental conditions (Perkell et al., 1997). In contrast, many young children are unable to adjust the clarity of their speech, even when explicitly directed to do so (Ide-Helvie et al., 2004). Audition also allows the development of articulatory organization by providing information about how to position, move, and coordinate the articulators for speech, movements that can differ from those associated with vegetative functions of the mechanisms (Moore & Ruark, 1996). For ex ample, infants use audition to learn how to shift from a vegetative breathing pattern to a pattern that can support speech. They learn how to position and move their tongues and to judge the acoustic consequences of those gestures. Coord ination of the larynx with the vocal tract and upper airway articulators is refined over years but requires an intact auditory system (Koenig, 2001; Tye-Murray, 1992). The lip and jaw movements associated with speech in infants and young children are highly variable but distinct from sucking, chewing, and smiling (Green et al., 2000; Green, Moore, & Reilly, 2002; Moore & Ruark, 1996). The implication is that although the same peripheral mechanisms are used across oral and respiratory functions, the differing goals require substantially distinct coordination and feedback efforts. The coordination needed to chew and swallow efficiently develops over early childhood but is largely independent of hearing, whereas the coordination required to move between vowel and consonant gestures, particularly in a coordinated and coarticulated manner, is strongly influenced by hearing (Baum & Waldstein, 1991; Guenther, 1995; Tye-Murray, 1992; Waldstein & Baum, 1991). Audition has a primary sensorimotor role in the development of speech, but it also is fundamental to infants and young children learning the sounds of their language. Furthermore, it helps them learn how specific speech events relate to their phonology, so that with development, young children become more able to use their hearing to inform them about the sequencing of speech gestures and the correctness of subsequent productions. Over time children learn to use audition to monitor ongoing speech, detect errors, and make corrective adjustments. Hearing Loss and Speech Production Hearing loss is common in the general population but its effects on speech production are most pronounced with individuals whose hearing loss is congenital or acquired in early childhood. Most adults who acquire their hearing losses later in life suffer little or no deterioration in intelligibility, likely because their residual hearing provides sufficient feedback since their mature speech production systems rely more on orosensory than auditory information to maintain proper control (Guenther, 1995; Goehl & Kaufman, 1984; Perkell et al., 1997). The speech differences that they do exhibit are subtle and usually imperceptible, even in cases of complete or nearly complete adventitious hearing loss. Nonetheless, some adventitiously deafened adults exhibit reduced speaking rate, and compromised articulatory and phonatory precision (Kishon-Rabin et al., 1999; Lane & Webster, 1991; Lane et al., 1995; Leder et al., 1987; Waldstein, 1990; Perkell et al., 1992). These speech differences are similar in nature, but not in severity, to those observed with prelingually deafened speakers. Most infants and young children with hearing loss demonstrate disordered phonation and articulation, as well as delays in the acquisition of sound categories. The entire speech production system can be affected, from respiratory support to the coarticulation of ongoing speech (Pratt & Tye-Murray, 1997). This is especially true if the hearing loss is identified late or after a period of protracted hearing loss. Furthermore, the overlap and interaction of disordered sound production and linguistic delay contribute to poor speech integrity and restricted speech development. Babbling generally does not appear before 12 months of age (Oller & Eilers, 1988; Oller et al., 1985) and canonical babbling has been observed as late as 31 months in this population (Lynch, Oller, & Steffens, 1989). Infants also produce fewer instances of canonical babble and include a more limited range of consonants in their babble (Stoel-Gammon, 1988; Stoel-Gammon & Otomo, 1986; Wallace, Menn, & Yoshinaga-Itano, 2000). However, later speech intelligibility is better predicted by the consonant inventory used in emerging spoken language during the second year of life than during babble (Obenchain, Menn, & Yoshinaga-Itano, 2000). The phonetic repertoires of infants with severe-to-profound hearing loss often are restricted when compared to their normal-hearing peers, although there is abundant individual variability (Lach, Ling, Ling & Ship, 1970; Stoel-Gammon & Otomo, 1986; Wallace et al., 2000; Yoshinaga-Itano & Sedey, 2000). The early speech inventories of infants with severe-to-profound hearing loss predominately consist of motorically easy sounds such as vowels and bilabial consonants. The sounds of their inventories also contain more low frequency information, which is more audible. For example, the babbling of infants with hearing loss often has a high concentration of nasals and glides, which include low-frequency continuant cues (Stoel-Gammon & Otomo, 1986). Without early intervention and appropriate fitting of sensory aids the speech-sound inventories of many children with hearing loss usually do not attain full maturity. Yoshinaga-Itano and Sedey (2000) found that children with moderate-to-severe hearing losses did not reach an age-appropriate complement of vowel and consonant sounds until about 4 and 5 years respectively, and many children with profound hearing loss had restricted inventories even at 5 years of age. Children with profound hearing loss often reach an early plateau in their speech skill development. For instance, the speech characteristics of many children with severe-to-profound hearing loss demonstrate little improvement in sound inventory and intelligibility after 8 years of age, even with the initiation of extensive training (Hudgins & Number, 1942, McGarr, 1987; Smith, 1975). Such results imply that, like auditory and language interventions, speech production therapy should be an important component of early intervention, and that the common practice of delaying speech training in children with hearing loss until they have functional language is developmentally untenable if the goal is for them to be oral communicators. In addition to the relationship between age-of-onset and speech impairment severity, there also is a moderately positive relationship between the severity of hearing loss and the extent of the associated speech difficulties (Boothroyd, 1969; Levitt, 1987; Smith, 1975). For example, children with mild-to-moderate hearing loss, particularly if well aided, tend to exhibit speech differences that are mild (Elfenbein, Hardin-Jones, & Davis, 1994; Oller & Kelly, 1974; West & Weber, 1973). Elfenbein and colleagues found that children with mild-to-moderate hearing loss exhibit good intelligibility but had higher than normal rates of affricate and fricative substitutions. Mild hoarseness and resonance problems also are present in 20% to 30% of this group of children. Moreover, they tend to have increased rates of voicing irregularities, difficulties with /r/ production, and omissions of back and word-final consonants. Early studies of children with profound prelingual hearing loss showed that most rarely acquired speech skills sufficient to interact easily using spoken language. On average, less than 20% of their words were intelligible to listeners who were not familiar with their speech (Hidgins & Numbers 1942; Markides, 1970; Smith, 1975). Smith (1975) evaluated 40 children with varying levels of hearing loss and, on average, only 18.7% (0% to 76%) of their words could be identified by inexperienced listeners. As expected, overall intelligibility was inversely related to the frequency of segmental and suprasegmental errors. However, with early identification of hearing loss and early intervention (i.e., fitting of sensory devices, behavioral training, and parent counseling), the numbers of children with severe-to-profound hearing loss and intelligible speech has increased (Uchanski & Geers, 2003). Many more children are developing sufficient speech perception to support development of speech production and oral language, but these advances may have added to the overall heterogeneity of the population (Higgins et al., 2003). Other factors contribute to the diversity of speech production skills observed with these children. For instance, cognitive skill (particularly nonverbal intelligence) has been found to be an important predictor of functional speech and oral language in children with hearing loss (Geers et al., 2002; Tobey et al., 2003). Auditory experience in infancy and early childhood, even of limited duration, positively influences the speech production skills of children who have severe-to-profound hearing loss (Geers, 2004). The use of sensory aids has a substantial impact on speech outcomes, but somewhat surprisingly, the age at which infants and young children are fitted with cochlear implants has not surfaced in studies of speech production as a significant predictor of later speech intelligibility (Geers et al., 2002; Tobey et al., 2003). Early implantation (less than 2 years) is, however, related to more normal oral communication development as a whole (both speech and oral language) (Geers, 2004). It may be that the age of implantation is not easily separated from other influences of intervention, like the orientation of the habilitation program and parent involvement, which relate strongly to children being auditory perceptual learners and users of auditory feedback. Another consideration is that many early-implanted children may be implanted too late to observe a clear impact on speech production. The critical ages at which hearing aids should be fitted has not been investigated, but like cochlear implants, it is assumed that earlier is better. The oromotor integrity and language skills are additional factors that often are neglected in studies of speech development in children with hearing loss. A substantial number of infants and children with hearing loss present with secondary handicapping conditions, such as neurological disorders. When these neurological disorders include the speech mechanism, the development of functional speech is difficult even if audition is optimized. As such, is it not unusual for a child with hearing loss to have a coexisting dysarthria along with the speech impairment secondary to the hearing loss. A subset of children with hearing loss also may have an apraxia of speech, but separating the impact of hearing loss from an apraxia of speech is difficult because the associated speech characteristics overlap (McNeil, Robin & Schmidt, 1997). Language disorders also are commonly observed in children with hearing loss, and are frequently evidenced in phonological disorder and lexical delay. As a result, extricating the sensorimotor impact of hearing loss on speech production from the influences of language disorder in individual children is not always straightforward (Peng et al., 2004). Habilitation: Sensory Aids and Treatment Most speech training approaches are dependent on optimizing the use of residual hearing although some approaches use other modalities (Pratt, Heintzelman, & Deming, 1993; Pratt & Tye-Murray, 1997). Correspondingly, it is generally believed that speech is learned most easily if infants and children learn and monitor their speech through their auditory systems. Therefore, the proper and early fitting, and consistent use of sensory aids, along with auditory and language training are important components of speech production training. In support of this auditory-based approach is the relationship between the severity of prelingual hearing loss and the extent of speech delay/disorder found in children (Boothroyd, 1969; Levitt, 1987; Smith, 1975), as well as any history of previous hearing (Geers, 2004). The relationship between audiometric configuration and speech intelligibility also argues for the importance of audition if the goal for a child is oral communication (Levitt, 1987; Osberger, Maso, & Sam, 1993). There is a growing literature supporting the positive impact of cochlear implants on speech development, as well as the role that auditory-oral-based training programs play in communication outcomes of children fitted with cochlear implants (Geers et al., 2002; Tobey et al., 2003). There is, however, limited efficacy data for children with less severe hearing loss who are typically fitted with hearing aids. The lack of research in this area is glaring because wearable electroacoustic hearing aids have been available for more than 50 years (Lybarger, 1988) and are a fundamental component of treatment approaches for most children with hearing loss. Furthermore, more infants and children are fitted with hearing aids than cochlear implants. Preliminary data reported by Stemachowicz and her colleagues (2004) on three infants fitted early with hearing aids suggested delays in sound category acquisition consistent with patterns previously reported in the literature. Sound inventories were impoverished, consonants were more affected than vowels, and sound containing high-frequency cues were particularly limited. Additional data by Pittman and colleagues (2003) observed that the amplitude of high-frequency speech cues directed to and produced by children wearing hearing aids may not be sufficient, although they did not connect their results directly to speech production outcomes. Pratt, Grayhack, Palmer, and Sabo (2003) found that differences in hearing aid configuration could alter vowel spacing of children even though the children in their study had intelligible speech, and the speech tokens measured were limited to acceptable productions. Their data indicated that hearing aids could alter the speech of children, but provided little information about the impact that hearing aids may have on speech development. Given the paucity of data-as well as the expansion of universal infant hearing screening programs-it is critical that more research be done in this area. Increasing numbers of infants with hearing loss will be identified shortly after birth and, if we are to effectively treat them, more should be known about the impact that hearing aids and other sensory aids have on speech and auditory system development. Aural Habilitation References Assmann P. F., & Katz W. F. (2000). Time-varying spectral change in the vowels of children and adults.Journal of the Acoustic Society of America, 108, 1856–1866. CrossrefGoogle Scholar Baum S., & Waldstein R. (1991). Perseveratory coarticulation in the speech of profoundly hearing-impaired and normally hearing children.Journal of Speech and Hearing Research, 34, 1286–1292. LinkGoogle Scholar Bauer J. J., & Larson C. R. (2003). Audio-vocal responses to repetitive pitch-shift stimulation during a sustained vocalization: Improvements in methodology for the pitch-shifting technique.Journal of the Acoustical Society of America, 114, 1048–1054. CrossrefGoogle Scholar Boothroyd A. (1969). Distribution of hearing levels in the student population of the Clarke School for the Deaf. Northampton, MA: Clarke School for the Deaf. Google Scholar Elfenbein J., Hardin-Jones M., & Davis J. (1994). Oral communication skills of children who are hard of hearing.Journal of Speech and Hearing Research, 37, 216–226. LinkGoogle Scholar Eisenberg L., Shannon R., Martinez A. S., & Wygonski J. (2000). Speech recognition with reduced spectral cues as a function of age.Journal of the Acoustical Society of America, 107, 2704–2710. CrossrefGoogle Scholar Geers A., Brenner C., Nicholas J., Uchanski R., Tye-Murray N., & Tobey E. (2002). Rehabilitation factors contributing to implant benefit in children.Annals of Otology, Rhinology, and Laryngology—Supplement, 189, 127–130. CrossrefGoogle Scholar Goehl H., & Kaufman D. (1984). the effects of adventitious include disordered of Speech and Hearing LinkGoogle Scholar J. R., Moore C. A., M., & R. W. (2000). The development of speech and jaw of & Hearing Research, LinkGoogle Scholar J. R., Moore C. A., & J. (2002). The development of jaw and lip control for of and Hearing Research, LinkGoogle Scholar F. Speech sound coarticulation, and effects in a of speech CrossrefGoogle Scholar E. A., A. & (2003). in children’s speech and after cochlear and CrossrefGoogle Scholar Houde J. F., & of speech and of and Hearing Research, LinkGoogle Scholar C., & Numbers F. An of the intelligibility of speech of the Google Scholar D. L., W. A., C., A. J., & used to speech clarity by normal-hearing children.Journal of the Acoustical Society of America, CrossrefGoogle Scholar Jones J. A., & (2002). The role of auditory feedback during Studies of of CrossrefGoogle Scholar Jones J. A., & (2003). to produce speech with an vocal The role of auditory of the Acoustical Society of America, CrossrefGoogle Scholar Katz W. F., C., & P. (1991). coarticulation in the speech of adults and young perceptual and of Speech and Hearing Research, 34, LinkGoogle Scholar L., R., & The of hearing on speech production of deafened adults with cochlear of the Acoustical Society of America, CrossrefGoogle Scholar characteristics of in children’s and of developmental of Speech and Hearing Research, LinkGoogle Scholar Kortekaas R., & P. (2000). effects on children’s perception of the Acoustical auditory and clarity of and Hearing Research, LinkGoogle Scholar R., Ling Ling L., & Early speech development in of the Google Scholar Lane H., & J. W. (1991). Speech deterioration in deafened adults.Journal of the Acoustical Society of America, CrossrefGoogle Scholar Lane H., J., M., M., & Perkell J. in the speech of cochlear implant An of of the Acoustical Society of America, CrossrefGoogle Scholar Leder S., J., J. C., C., & F. of adventitiously cochlear implant of the Acoustical Society of CrossrefGoogle Scholar S., A., & Acoustic of children’s of and spectral of the Acoustical Society of America, CrossrefGoogle Scholar the speech and language H., N., & D. Development of language and communication skills of hearing-impaired children. ASHA Google Scholar A R. of hearing aid MA: Google Scholar M., Oller & Development of in a child with congenital of The of CrossrefGoogle Scholar skills of hearing-impaired children in for the H., N., & D. Development of language & communication in hearing children. ASHA Google Scholar R., Robin D. A., & R. A. of and of sensorimotor speech disorders Google Scholar A. The speech of and hearing children with to factors of of CrossrefGoogle Scholar Moore C. A., & J. speech from earlier oral of Speech and Hearing Research, LinkGoogle Scholar (2002). to fricative perception and how it the of the Acoustical Society of America, CrossrefGoogle Scholar S., C. S., & E. The of acoustic in the perception of by children and & CrossrefGoogle Scholar L., & Yoshinaga-Itano C. (2000). speech development at months in children with hearing loss be predicted from information available in the second year of Google Scholar Oller R., & A. of a A with normal of Speech and Hearing Research, LinkGoogle Scholar Oller & C. of a of Speech and Hearing LinkGoogle Scholar J., M., & Speech intelligibility of children with cochlear implants, aids, or hearing of Speech and Hearing Research, LinkGoogle Scholar S., A. L., H., & production and language skills in children with cochlear of and CrossrefGoogle Scholar Perkell J., Lane H., M., & J. Speech of cochlear implant A study of vowel of the Acoustical Society of America, CrossrefGoogle Scholar Perkell J. S., L., Lane H., F. H., R., J., & P. Speech Acoustic auditory feedback and internal CrossrefGoogle Scholar Pittman A. L., Stemachowicz P. D. & (2003). characteristics of speech at the for in children.Journal of and Hearing Research, LinkGoogle Scholar Pratt R., A. & E. The efficacy of using the to treat young children with hearing of Speech and Hearing Research, LinkGoogle Scholar Pratt R., & Tye-Murray A. Speech impairment secondary to hearing of sensorimotor speech disorders Google Scholar Smith C. hearing and speech production in the of Speech and Hearing Research, LinkGoogle Scholar P. Pittman A. L., M., D. & P. The importance of high-frequency in the speech and language development of children with hearing of and CrossrefGoogle Scholar Stoel-Gammon C. of hearing-impaired & normally hearing A of of Speech and Hearing LinkGoogle Scholar Stoel-Gammon C., & Babbling development of hearing-impaired and normally hearing of Speech and Hearing LinkGoogle Scholar Tobey E. A., Geers A. Brenner C., & (2003). associated with development of speech production skills in children implanted by age and CrossrefGoogle Scholar Uchanski R. M., & Geers A. E. (2003). Acoustic characteristics of the speech of young cochlear implant A with normal-hearing and CrossrefGoogle Scholar Waldstein R. of on speech for the role of auditory of the Acoustical Society of America, CrossrefGoogle Scholar Waldstein R., & Baum (1991). coarticulation in the speech of profoundly hearing-impaired and normally hearing children.Journal of Speech and Hearing Research, 34, LinkGoogle Scholar Wallace L., & Yoshinaga-Itano C. (2000). babble the to speech for all A study of children who are or hard of Google Scholar Sheila R. Pratt, is in the of & at the of her at With With Additional to in Mar &
- Supplementary Content
4
- 10.17635/lancaster/thesis/938
- Apr 21, 2020
- University of Lancaster
Research has shown that phonetic features can index social meaning, yet less is known about whether this phenomenon occurs in the same way in speech production and speech perception. In particular, one of the factors that most seems to affect variables’ capacity for social meaning-making is the notion of salience. This thesis addresses the question of how phonetic variation points to social meaning in speech production and perception and what role salience plays in influencing this process. I investigate these issues using a sociophonetic study of two phonetic variables currently undergoing change in the South of England – /t/-glottalling and GOOSE-fronting – as produced and perceived by adolescents at a state school and a private school in Hampshire, UK. While the former is reported to be highly salient with strong socio-indexical relations, the latter is said not to be very salient and to lack associations with speakers’ social characteristics. The production results show that /t/-glottalling displays macro-sociological variation in the community, while GOOSE-fronting varies between peer groups within the private school. Both features can be used to index stances in interaction, but this effect is much stronger for /t/-glottalling. In perception, listeners were easily able to notice glottal /t/ in auditory stimuli and consistently associated it with a set of related social meanings, yet this was not the case for fronted GOOSE. The findings have implications for our understanding of how the social meanings of phonetic variables are produced and perceived by the same individuals, especially in the contexts of adolescent peer groups at school and social stratification between different types of school. I argue that researchers employing the construct of salience in sociolinguistics should acknowledge the limitations and different dimensions of the concept and operationalise these in their study design.
- Book Chapter
18
- 10.1145/3015783.3015797
- Apr 24, 2017
Chances are that most of us have experienced difficulty in listening to our interlocutor during face-to-face conversation while in highly noisy environments, such as next to heavy traffic or over the background of high-intensity speech babble or loud music. In such occasions, we may have found ourselves looking at the speaker's lower face, while our interlocutor articulates speech, in order to help us enhance speech intelligibility. In fact, what we resort to in such circumstances is known as lipreading or speechreading, namely the recognition of the so-called speech modality and its combination (fusion) with the available noisy audio data.Similar to humans, automatic speech recognition (ASR) systems also face difficulties in noisy environments. In recent years, ASR technology has made remarkable strides following the adoption of deep-learning techniques [Hinton et al. 2012, Yu and Deng 2015]. This has led to advanced ASR systems bridging the gap with human performance [Xiong et al. 2017], compared to their significant lag 20 years earlier, as established by Lippmann [1997]. Nevertheless, the quest for ASR noise robustness, particularly when noise is non-stationary and mismatched to training data, remains an active research topic [Li et al. 2015].To help us mitigate the aforementioned problem, the question naturally arises as to whether or not machines can be designed to mimic human speech perception in noise. Namely, can they successfully incorporate visual speech into the ASR pipeline, especially since this represents an additional information source unaffected by the acoustic environment. At Bell-Labs, Petajan [1984] was the first to develop and implement an early audio-visual automatic speech recognition (AVASR) system. Since then, the area has witnessed significant research activity, paralleling the advances in traditional audio-only ASR, while also utilizing progress in the computer vision and machine learning fields. Not surprisingly, adoption of deep learning techniques has created renewed interest in the field, resulting in remarkable progress on challenging domains, even surpassing human lipreading performance [Chung et al. 2017].Since the very early works in the field [Stork and Hennecke 1996], design of AVASR systems has generally followed the basic architecture of Figure 12.1. There, a visual front-end module is depicted to provide speech-informative features that are extracted from the video of the speaker's face. These are subsequently fused with acoustic features into the speech recognition process. Clearly, compared to audio-only ASR, visual speech information extraction and audio-visual fusion (or integration) constitute two additional distinct components on which to focus. Indeed, their robustness under a wide range of audio-visual conditions and their efficient implementation represent significant challenges that, to date, remain the focus of active research. It should be noted that rapid recent advances, leading to so-called end-to-end AVASR systems [Assael et al. 2016, Chung et al. 2017], have somewhat blurred the distinction between these two components. Nevertheless, this division remains valuable to both the systematic exposure of the relevant material, as well as to the research and development of new systems.In this chapter, we concentrate on AVASR while also addressing other related problems, namely audio-visual speech activity detection, diarization, and synchrony detection. In order to address such subjects, we first provide additional motivation in Section 12.2, discussing bimodality of human speech perception and production. In Section 12.3, we overview AVASR research in view of its potential application scenarios to multimodal interfaces, visual sensors employed, and audio-visual databases typically used. In Section 12.4, we cover visual feature extraction and, in Section 12.5, we discuss audio-visual fusion for ASR, also providing examples of experimental results achieved by AVASR systems. In Section 12.6, we offer a glimpse into additional audio-visual speech applications. We conclude the chapter by enumerating Focus Questions for further study. In addition, we provide a brief Glossary of the chapter's core terminology, serving as a quick reference.
- Book Chapter
1
- 10.1016/b978-0-12-374825-6.00003-4
- Jan 1, 2010
- Multimodal Signal Processing
Chapter 3 - Speech Processing
- Research Article
- 10.1121/1.2029117
- Nov 1, 1990
- The Journal of the Acoustical Society of America
Six hot topics will be described, two topics in each of the following three areas: speech perception, speech production, and signal processing by machines. (1) In speech perception, new studies on the representation of speech suggest that listeners' stored descriptions of phonetic units take the form of a prototype of the speech category. Adult data address the language specificity of prototypes and the effects of contextual factors on prototypes. Infant data address the developmental issue of whether phonetic prototypes are innate or whether they derive from linguistic experience. (2) Also in the area of speech perception, there are new data on the effects of talker variability. Adults, and even young infants, are capable of normalizing the speech produced by different talkers, but listeners at both ages are nonetheless adversely affected by talker variability, suggesting that attentional factors play a role in speech perception. (3) In the area of speech production, there is new emphasis on the study of articulatory timing in speech motor control. The pattern of phase relations between different articulators and its relation to linguistic structure, as well as the general coordination of speech movements, are being investigated. (4) Also in speech production, there are new advances in our understanding of laryngeal mechanisms. These advances stem from models of the glottal volume-velocity source, findings on aerodynamic phenomena at the glottis, and analyses of the mechanical properties of vocal-fold tissue. (5) In the area of signal processing, neural networks are being applied to difficult problems in speech recognition, such as the recognition of phonetic units. Computational models composed of large numbers of interconnected processing elements (nodes) have been successfully applied to the analysis of dynamic information distributed over several phonetic units. (6) Also in signal processing, the application of speech coding to hearing aids and auditory prostheses has resulted in improved speech reception. Cross-fertilization across areas in speech communication has furthered our research endeavors.
- Single Book
319
- 10.1002/9780470757024
- Jan 1, 2005
List of Contributors. Preface: Michael Studdert-Kennedy (Haskins Laboratories). Introduction: David B. Pisoni (Indiana University) and Robert E. Remez (Barnard College). Part I: Sensing Speech. 1. Acoustic Analysis and Synthesis of Speech: James R. Sawusch (University at Buffalo). 2. Perceptual Organization of Speech: Robert E. Remez (Barnard College). 3. Primacy of Multimodal Speech Perception: Lawrence D. Rosenblum (University of California, Riverside). 4. Phonetic Processing by the Speech Perceiving Brain: Lynne E. Bernstein (House Ear Institute). 5. Event-related Evoked Potentials (ERPs) in Speech Perception: Dennis Molfese, Alexandra P. Fonaryova Key, Mandy J. Maguire, Guy O. Dove and Victoria J. Molfese (all University of Louisville). Part II: Perception of Linguistic Properties. 6. Features in Speech Perception and Lexical Access: Kenneth N. Stevens (Massachusetts Institute of Technology). 7. Speech Perception and Phonological Contrast: Edward Flemming (Stanford University). 8. Acoustic Cues to the Perception of Segmental Phonemes: Lawrence J. Raphael (Adelphi University). 9. Clear Speech: Rosalie M. Uchanski (CID at Washington University School of Medicine). 10. Perception of Intonation: Jacqueline Vaissiere (Laboratoire de Phonetique et de Phonologique, Paris). 11. Lexical Stress: Anne C. Cutler (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands). 12. Slips of the Ear: Z. S. Bond (Ohio University). Part III: Perception of Indexical Properties. 13. Perception of Dialect Variation: Cynthia Clopper and David B. Pisoni (both Indiana University). 14. Perception of Voice Quality: Jody Kreiman (UCLA), Diana Vanlancker-Sidtis (New York University) and Bruce R. Gerratt (UCLA). 15. Speaker Normalization in Speech Perception: Keith A. Johnson (Ohio State University). 16. Perceptual Integration of Linguistic and Non-Linguistic Properties of Speech: Lynne C. Nygaard (Emory University). Part IV: Speech Perception by Special Listeners. 17. Speech Perception in Infants: Derek M. Houston (Indiana University School of Medicine). 18. Speech Perception in Childhood: Amanda C. Walley (University of Alabama, Birmingham). 19. Age-related Changes in Spoken Word Recognition: Mitchell S. Sommers (Washington University). 20. Speech Perception in Deaf Children with Cochlear Implants: David B. Pisoni (Indiana University). 21. Speech Perception following Focal Brain Injury: William Badacker (Johns Hopkins University). 22. Cross-Language Speech Perception: Nuria Sebastian-Galles (Parc Cientific de Barcelona - Hospital de San Joan de Deu). 23. Speech Perception in Specific Language Impairment: Susan Ellis Weismer (University of Wisconsin, Madison). Part V: Recognition of Spoken Words. 24. Spoken Word Recognition: The Challenge of Variation: Paul A. Luce and Conor T. McLennan (State University of New York, Buffalo). 25. Probabilistic Phonotactics in Spoken Word Recognition: Edward T. Auer, Jr. (House Ear Institute) and Paul A. Luce (State University of New York, Buffalo). Part VI: Theoretical Perspectives. 26. The Relation of Speech Perception and Speech Production: Carol A. Fowler and Bruno Galantucci (both Haskins Laboratories). 27. A Neuroethological Perspective on the Perception of Vocal Communication Signals: Timothy Gentner (University of Chicago) and Gregory F. Ball (Johns Hopkins University). Index
- Front Matter
1
- 10.3389/fnhum.2015.00305
- May 27, 2015
- Frontiers in Human Neuroscience
This Research Topic consists of 14 manuscripts discussing the cognitive and neural organization of speech processing. The contributions are grouped around four themes: (1) Spoken language comprehension under difficult listening conditions; (2) Sub-lexical processing; (3) Sensorimotor processing of speech; (4) Speech production. Seven papers addressed speech perception under challenging listening conditions. Van Engen and Peelle (2014) discuss the effects of processing speech in an unfamiliar regional or foreign accent. They argue that, as perceiving accented speech incurs a processing cost, just like other types of distortions such as background noise, it should also be regarded as representing a challenging listening condition. Neger et al. (2014) focused on plasticity of speech processing in statistical and perceptual learning tasks in aging. They conclude that perceptual and statistical learning share mechanisms of implicit regularity detection, but that the ability to detect statistical regularities is impaired in older adults for fast visual sequences. Dekerle et al. (2014) examined whether speech perception in a multi-speaker background relies on semantic interference between the background and target speaker using a semantic priming paradigm in three experiments. Their results indicate that higher-level linguistic processes such as semantic priming may not be as automatic as commonly thought but are subjected to the limits of cognitive resources such as working memory and attention. Yi et al. (2014) evaluate how processing of foreign-accented speech relates to social cognition. It was concluded that foreign-accented speech perception engages greater activation of neural systems underlying speech perception, and that implicit Asian-foreign association is related to with decreased neural efficiency in early spectrotemporal processing. Vitello et al. (2014) used fMRI to address the question of how semantic ambiguities are resolved during speech comprehension. Straus et al. (2014) examined through literature review whether neural oscillations in the alpha frequency range (~ 10 Hz) act as a neural mechanism to selectively inhibit the processing of noise to improve auditory selective attention to task-relevant speech signals. Ding and Simon (2014) discuss whether cortical entrained activity is related more closely to speech perception or to auditory encoding that is not specific to speech, by reviewing evidence regarding various hypotheses about the functional roles of cortical entrainment to speech. Three papers focused on perception of speech at sub-lexical levels. Deschamps and Tremblay (2014) studied perception of sub-lexical information by examining the neural bases of processing of simple syllables and more complex syllabic structures using fMRI, while Yu et al. (2014) used MEG to study the neural processing of disgust in anterior insula by presenting listeners with syllables with differed intended emotional meanings. Finally Chen et al. (2014) investigated processing of acoustic and phonological information in lexical tones in Mandarin Chinese using EEG. Two papers addressed sensorimotor processing of speech. Komeilipoor et al. (2014) report higher motor excitability as measured using Transcranial Magnetic Stimulation (TMS) in the tongue area during the presentation of meaningful gestures (noun-associated). Sowman et al. (2014) demonstrate that appropriately timed TMS to the hand area, paired with auditorily mediated excitation of the motor cortex, induces an enhancement of motor cortex excitability that lasts beyond the time of stimulation. Two papers focused on speech production. Etchell et al. (2014) provide a review of the stuttering literature and Hernandez-Pavon et al. (2014) present a neuronavigated TMS study exploring the neural locus of aspects of picture naming in healthy participants. This Frontiers Research Topic allows new insights into the neurobiology of speech perception and production, and demonstrates how the field of speech science is now addressing issues at its very core. We believe that the future of the research in the field lies in the effective combination of research methods, e.g., EEG and TMS, or fMRI and EEG, as research will benefit from the strengths of each method. In conclusion, this Research Topic consists of 14 excellent contributions, and we are convinced the Topic will provide readers with novel ideas for future studies that will elucidate the cognitive and neural architecture of speech processing.