A classic problem in spoken language comprehension is how listeners perceive speech as being composed of discrete words,given the variable time-course of information in continuous signals. We propose asyllable inference account of spoken word recognition and segmentation, according to which alternative hierarchical models of syllables, words, and phonemes are dynamicallyposited, which are expected to maximally predict incoming sensory input. Generativemodels are combined with current estimates of context speech rate drawn from neuraloscillatory dynamics, which are sensitive to amplitude rises. Over time, models which result in local minima in error between predictedand recently experienced signals give rise to perceptions of hearing words. Threeexperiments using the visualworld eye-tracking paradigm with a picture-selection task tested hypotheses motivated by this framework. Materials were sentences that were acoustically ambiguous in numbers ofsyllables, words, and phonemes they contained (cf. English plural constructions, such as "saw(a) raccoon(s) swimming," which have two loci of grammatical information). Time-compressing, or expanding, speech materials permitted determination of how temporal information at, orin the context of, each locus affected looks to, and selection of, pictures with a singular or plural referent (e.g., one or more than one raccoon). Supportingour account, listeners probabilistically interpreted identical chunks of speech as consistentwith a singular or plural referent to a degree that was based on the chunk's gradient rate in relation to its context. We interpret these results as evidence that arriving temporal information, judged in relation tolanguage model predictions generated from context speech rate evaluated on a continuousscale, informs inferences about syllables, thereby giving rise to perceptual experiences ofunderstanding spoken language as words separated in time.