This study attempted to optimize a computer-based learning environment designed to teach learners how to solve word problems by incorporating an animated pedagogical agent. The agent was programmed to deliver instructional explanations either textually or aurally, while simultaneously using gaze and gesture to direct the learners to focus their attention on the relevant part of the example. In Experiment 1, learners presented with an agent delivering explanations aurally (voice plus agent) outperformed their control peers on measures of transfer. In Experiment 2, learners in the voice-plus-agent condition outperformed their peers presented with textual explanations on a variety of measures, including far transfer. In sum, an animated agent programmed to deliver instructions aurally can help optimize learning from examples. A worked example is an instructional device that provides a model for solving a particular type of problem by presenting the solution in a step-by-step fashion. It is intended to provide the learner with an expert’s solution, which the learner can use as a model for his or her own problem solving. To date, in most experiments, worked examples have been visually fixed; that is, the examples simultaneously presented a problem and an expert’s solution steps. As such, these worked examples are similar to those found in traditional mathematics and science texts; however, instructional materials delivered on multimedia computer systems need not be limited in this way. For example, Stark (1999) and Renkl (1997) suggested that example processing can be enhanced by sequentially presenting problem states. According to Stark and Renkl, this type of presentation encourages learners to explain the examples to themselves by anticipating the next step in an example solution, then checking to determine whether the predicted step corresponded to the actual step—a phenomenon Renkl termed anticipative reasoning. According to Catrambone (1994, 1996, 1998), worked examples should be structured so they emphasize conceptually related solution steps (i.e., subgoals) by visually isolating them, by labeling them, or both. With regard to presenting examples that require learners to reference multiple sources of information, Mousavi and his colleagues (Mousavi, Low, & Sweller, 1995) offer a simple solution: Some segments of instructional information should be presented visually, whereas other segments should be presented aurally (i.e., mixed-mode format). One advantage of using the computer to deliver instruction is that it enables instructional designers to combine multiple instructional principles or components in a worked example, which may prove to enhance its efficacy. According to Mayer’s (1997) generative theory of multimedia learning, computers—in contrast to a book-based medium—also provide a more favorable environment in which to implement some forms of effective instruction, such as the coordination of the visual presentation of sequential problem states with an auditory description of each of those states. For example, in Atkinson and Derry (2000), one way to structure an example within a computer-based multimedia environment so that learning can be maximized was to create a multicomponent worked example that (a) was sequential, in that it consisted of a sequential presentation of problem states; (b) was constructed to emphasize problem subgoals (i.e., it is subgoal oriented); and (c) incorporated a second modality that is coordinated with the sequential presentation of problem states (i.e., visually presented steps coupled with verbal instructional explanations). Learners exposed to these sequential, subgoal-oriented (SE–SO) examples with dual modes outperformed learners who were exposed to more traditional, simultaneous, non-subgoal-oriented examples on conceptually based measures of problem-solving transfer. Moreover, this difference occurred despite the fact that the examples in the latter condition were also dual mode.