When we open our eyes, we do not see a jumble of light or colorful patterns. There lies a great distance from the raw inputs sensed at our retinas to what we experience as the contents of our perception. How in the brain are incoming sense inputs transformed into rich, discrete structures that we can think about and plan with? These "world models" include representations of objects with kinematic and dynamical properties, scenes with navigational affordances, and events with temporally demarcated dynamics. Real world scenes are complex, but given a momentary task, only a fraction of this complexity is relevant to the observer. Attention allows us to selectively form these world models, driving flexible action as task-driven, simulatable state-spaces. How in the mind and brain do we build and use such internal models of the world from raw visual inputs? In this talk, I will begin to address this question by presenting two new computational modeling frameworks. First, in high-level vision, I will show how we can reverse-engineer population-level neural activity in the macaque visual cortex in the language of three-dimensional objects and computer graphics, by combining generative models with deep neural networks. Second, I will present a novel account of attention based on adaptive computation that situates vision in the broader context of an agent with goals, and show how it explains internal representations and implicit goals underlying the selectivity of scene perception.