This paper presents a human–robot interaction system (HRIS) that utilizes human perception and action recognition to enable the robot to understand human intentions and flexibly interact with humans. A monocular multi-person three-dimensional (3D) pose estimation method is first proposed to perceive multi-person two-dimensional (2D) and 3D poses in interaction scenarios. Furthermore, a 3D skeleton poses tracking approach is adopted to locate the identity of each person in consecutive frames and enhance interactive stability. Then, an action recognition model is developed, which exploits tracked pose features to recognize the intentions of humans. An action-controlled interaction system is built with a modular approach to ensure flexibility in meeting multiple task requirements and facilitating flexible interaction. In the system, a distance-based safety solution is designed to avoid collisions between humans and robots. Finally, experimental results are presented to demonstrate the feasibility and effectiveness of the proposed methods and system.