Adults heard recordings of two spatially separated speakers reading newspaper and magazine articles. They were asked to listen to one of them and ignore the other, and EEG was recorded to assess their neural processing. Machine learning extracted neural sources that tracked the target and distractor speakers at three levels: the acoustic envelope of speech (delta- and theta-band modulations), lexical frequency for individual words, and the contextual predictability of individual words estimated by GPT-4 and earlier lexical models. To provide a broader view of speech perception, half of the subjects completed a simultaneous visual task, and the listeners included both native and non-native English speakers. Distinct neural components were extracted for these levels of auditory and lexical processing, demonstrating that native English speakers had greater target-distractor separation compared to non-native English speakers on most measures, and that lexical processing was reduced by the visual task. Moreover, there was a novel interaction of lexical predictability and frequency with auditory processing; acoustic tracking was stronger for lexically harder words, suggesting that people listened harder to the acoustics when needed for lexical selection. This demonstrates that speech perception is not simply a feedforward process from acoustic processing to the lexicon. Rather, the adaptable context-sensitive processing long known to occur at a lexical level has broader consequences for perception, coupling with the acoustic tracking of individual speakers in noise.Significance Statement In challenging listening conditions, people use focused attention to help understand individual talkers and ignore others, which changes their neural processing for speech at auditory through lexical levels. However, lexical processing for natural materials (e.g., conversations, audiobooks) has been difficult to measure, because of limitations of tools to estimate the predictability of individual words in longer discourses. The present investigation uses a contemporary large language model, GPT-4, to estimate word predictability, and demonstrates that listeners make online adaptations to their auditory neural processing in accord with these predictions; neural activity more closely tracks the acoustics of the target talker when words are less predictable from the context.