ObjectivesThe mental parsing of linguistic hierarchy is crucial for language comprehension, and while there is growing interest in the cortical tracking of auditory speech, the neurophysiological substrates for tracking written language are still unclear. MethodsWe recorded electroencephalographic (EEG) responses from participants exposed to auditory and visual streams of either random syllables or tri-syllabic real words. Using a frequency-tagging approach, we analyzed the neural representations of physically presented (i.e., syllables) and mentally constructed (i.e., words) linguistic units and compared them between the two sensory modalities. ResultsWe found that tracking syllables is partially modality dependent, with anterior and posterior scalp regions more involved in the tracking of spoken and written syllables, respectively. The cortical tracking of spoken and written words instead was found to involve a shared anterior region to a similar degree, suggesting a modality-independent process for word tracking. ConclusionOur study suggests that basic linguistic features are represented in a sensory modality-specific manner, while more abstract ones are modality-unspecific during the online processing of continuous language input. SignificanceThe current methodology may be utilized in future research to examine the development of reading skills, especially the deficiencies in fluent reading among those with dyslexia.