There has been an increasing demand for housework robots to handle various objects. It is, however, difficult to achieve object-shape-oriented tasks in conventional research owing to the requirement for dealing with multiple surfaces, invisible area, and occlusion; moreover, robots must perceive shapes and adjust movements even if they cannot be seen directly. Humans usually tackle questions by integrating several sensory information; inspired by this perception mechanism of humans, in this study, we considered the effective utilization of image/force/tactile data in constructing a multimodal deep neural networks (DNN) model for the shape of an object perception and motion generation. As an example, we constructed a robot to wipe around the outside of objects that are imitating light shades. The wiping motions include the moment when the hands of the robot must be away from the surface as well as the turning directions required to wipe the next surface, even though some parts of the surfaces, such as the backside or parts occluded by the arm of the robot, may not be seen directly. If DNN model uses continuous visual information, it is badly influenced by the occluded images. Hence, the best-performing DNN model is the one that uses an image of the initial time-step to approximately perceive the shape and size and then generate motions by integrating the perception and sense of tactile and force. We conclude that the effective approach to object-shape-oriented manipulation is to initially utilize image to outline the target shape and, thereafter, to use force and tactile to understand concrete features while performing tasks.