WiFi-based human activity recognition (HAR) plays an essential role in various applications such as security surveillance, health monitoring, and smart home. Existing HAR methods, though yielding promising performance in indoor scenarios, highly depend on a massive labeled dataset for training which is extremely difficult to acquire in practical applications. In this paper, we present an automatic data labeling and HAR system, termed AutoDLAR. Taking a semi-supervised cross-modal learning framework with a hybrid loss function as the core, AutoDLAR transfers rich visual information to automatically label WiFi signals for WiFi-based HAR. Specifically, we devise a lightweight and multi-view WiFi sensing model with a parallel feature embedding method to accurately identify activities and accelerate recognition speed. Then, we exploit the video data to fine-tune a well-established visual HAR model, generating effective pseudo-labels for guiding the WiFi model’s training. We also build a synchronized Video-WiFi dataset with seven types of human activities under different scenarios to enable training and validating the semi-supervised HAR system. Extensive experiments on our collected activity dataset and the emotion recognition benchmark demonstrate that AutoDLAR attains an average accuracy of over 95.89% without manual labeling and only spends the inference time of 3.35 ms, outperforming the state-of-the-art (SOTA) methods.