In order to perform complex tasks in realistic human environments, robots need to be able to learn new concepts in the wild , incrementally, and through their interactions with humans. This article presents an end-to-end pipeline to learn object models incrementally during the human–robot interaction (HRI). The pipeline we propose consists of three parts: 1) recognizing the interaction type; 2) detecting the object that the interaction is targeting; and 3) learning incrementally the models from data recorded by the robot sensors. Our main contributions lie in the target object detection, guided by the recognized interaction, and in the incremental object learning. The novelty of our approach is the focus on natural, heterogeneous, and multimodal HRIs to incrementally learn new object models. Throughout the article, we highlight the main challenges associated with this problem, such as high degree of occlusion and clutter, domain change, low-resolution data, and interaction ambiguity. This article shows the benefits of using multiview approaches and combining visual and language features, and our experimental results outperform standard baselines. Note to Practitioners —This article was motivated by challenges in recognition tasks for dynamic and varying scenarios. Our approach learns to recognize new user interactions and objects. To do so, we use multimodal data from the user–robot interaction; visual data are used to learn the objects and speech is used to learn the label and help with the interaction-type recognition. We use state-of-the-art deep learning (DL) models to segment the user and the objects in the scene. Our algorithm for incremental learning is based on a classic incremental clustering approach. The pipeline we propose works with all sensors mounted on the robot, so it allows mobility on the system. This article uses the data recorded from a Baxter robot, which enables the use of the manipulation arms in future steps, but it would work with any robot that can have the same sensors mounted. The sensors used are two RGB-D cameras and a microphone. The pipeline currently has high computational requirements to run the two DL-based steps. We have tested it with a desktop computer, including a GTX 1060 and 32 GB of RAM.