Student presentations conducted with the help of PowerPoint represent communicative events that are inherently multimodal by incorporating visuals on slides accompanied by spoken commentary. However, despite their ubiquity in higher education, few studies have investigated the rhetorical relations between their auditory and visual modes. This study attempts to address this gap by applying theoretical frameworks for logico-semantics and image–text relations to student presentations conducted online at a private university in Japan. Analysis of over 5 hours of recorded data revealed how clauses in students’ spoken commentary related to visible entities on the screen primarily through exposition and, to a lesser extent, specification, summary, extension and enhancement. A further comparison of different stages within the presentations showed that summary and expansion played a bigger role whenever students provided background information or concluded the presentation. Further, qualitative discussion on selected excerpts sheds light on how the reading path alternated between exposition of visual text through repetition or synonymy and embellishment through specification or enhancement. However, comparing students also indicated that genre-specifics are not yet established for all students. The selection and configuration of logico-semantic relations influenced how slides compensated for verbal deficiencies, which has implications for the English as a Foreign Language context.