AbstractConstruction robots play a pivotal role in enabling intelligent processes within the construction industry. User‐friendly interfaces that facilitate efficient human–robot collaboration are essential for promoting robot adoption. However, most of the existing interfaces do not consider contextual information in the collaborative environment. The situation where humans and robots work together in the same jobsite creates a unique environmental context. Overlooking contextual information would limit the potential to optimize interaction efficiency. This paper proposes a novel context‐aware method that utilizes a two‐stream network to enhance human–robot interaction in construction settings. In the proposed network, the first‐person view‐based stream focuses on the relevant spatiotemporal regions for context extraction, while the motion sensory data‐based stream obtains features related to hand motions. By fusing the vision context and motion data, the method achieves gesture recognition for efficient communication between construction workers and robots. Experimental evaluation on a dataset from five construction sites demonstrates an overall classification accuracy of 92.6%, underscoring the practicality and potential benefits of the proposed method.