In this paper we propose a single image-based 3D model retrieval and pose estimation method for indoor scenes. By simulating the scene context of the input image, our method is able to handle several challenging scenarios featuring cluttered backgrounds and severe occlusions. To use our system, the user only needs to drag a few semantic bounding boxes for the query objects. The proposed approach then retrieves the most similar 3D models from the ShapeNet model repository, and aligns them with the corresponding objects automatically. This requires that the 3D models are represented by calibrated view-dependent visual elements learned from the rendered views. With the estimated occlusion relationships, the rendered view images are stacked at the corresponding locations to simulate the scene context. By conducting matching between these composed scenes and the input image, the most similar 3D models under the approximate poses are retrieved. Experimental results on public datasets demonstrate the effectiveness of our proposed method. Moreover, we show that the retrieving time can be significantly reduced based on a novel greedy algorithm. Extensive experiments on synthesized images show that the proposed algorithm achieves a high retrieval accuracy, outperforming state-of-the-art methods.