AbstractStudents' on‐task engagement during adaptive learning activities has a significant effect on their performance, and at the same time, how these activities influence students' behavior is reflected in their effort exertion. Capturing and explaining effortful (or effortless) behavior and aligning it with learning performance within contemporary adaptive learning environments, holds the promise to timely provide proactive and actionable feedback to students. Using sophisticated machine learning (ML) algorithms and rich learner data, facilitates inference‐making about several behavioral aspects (including effortful behavior) and about predicting learning performance, in any learning context. Researchers have been using ML methods in a “black‐box” approach, ie, as a tool where the input data is the learner data and the output is a given class from the chosen construct. This work proposes a methodological shift from the “black‐box” approach to a “grey‐box” approach that bridges the hypothesis/literature‐driven (feature extraction) “white‐box” approach with the computation/data‐driven (feature fusion) “black‐box” approach. This will allow us to utilize data features that are educationally and contextually meaningful. This paper aims to extend current methodological paradigms, and puts into practice the proposed approach in an adaptive self‐assessment case study taking advantage of new, cutting‐edge, interdisciplinary work on building pipelines for educational data, using innovative tools and techniques. Practitioner NotesWhat is already known about this topic Capturing and measuring learners' engagement and behavior using physiological data has been explored during the last years and exhibits great potential. Effortless behavioral patterns commonly exhibited by learners, such as “cheating,” “guessing” or “gaming the system” counterfeit the learning outcome. Multimodal data can accurately predict learning engagement, performance and processes. What this paper adds Generalizes a methodology for building machine learning pipelines for multimodal educational data, using a modularized approach, namely the “grey‐box” approach. Showcases that fusion of eye‐tracking, facial expressions and arousal data provide the best prediction of effort and performance in adaptive learning settings. Highlights the importance of fusing data from different channels to obtain the most suited combinations from the different multimodal data streams, to predict and explain effort and performance in terms of pervasiveness, mobility and ubiquity. Implications for practice and/or policy Learning analytics researchers shall be able to use an innovative methodological approach, namely the “grey‐box,” to build machine learning pipelines from multimodal data, taking advantage of artificial intelligence capabilities in any educational context. Learning design professionals shall have the opportunity to fuse specific features of the multimodal data to drive the interpretation of learning outcomes in terms of physiological learner states. The constraints from the educational contexts (eg, ubiquity, low‐cost) shall be catered using the modularized gray‐box approach, which can also be used with standalone data sources.