Safety in the construction industry has always been a focus of attention. Existing methods of detecting unsafe behavior of workers relied primarily on manual detection. Not only did it consume significant time and money, but it also inevitably produced omissions. Currently, automated techniques for detecting unsafe behaviors rely only on the unsafe factors of workers’ ontology to judge their behaviors, making it difficult to understand unsafe behaviors in complex scenes. To address the presented problems, this study proposed a method to automatically extract workers’ unsafe behaviors by combining information from complex scenes—an image captioning based on an attention mechanism. First, three different sets of image captioning models were constructed using convolutional neural network (CNN), which are widely used in AI. These models could extract key information from complex scenes was constructed. Then, two datasets dedicated to the construction domain were created for method validation. Finally, three sets of experiments were conducted by combining the datasets and the three different sets of models. The results showed that the method could detect the worker’s job type and output the interaction behavior between the worker and the target (unsafe behavior) based on the environmental information in the construction images. We introduced environmental information into the determination of workers’ unsafe behaviors for the first time and not only output the worker’s job type but also determine the worker’s behavior. This allows the model output to be better for ergonomic analysis.