As cameras and Wi-Fi access points are widely deployed in public places, new mobile applications and services can be developed by connecting live video analytics to the mobile Wi-Fi-enabled devices of the relevant users. To achieve this, a critical challenge is to identify the person who carries a device in the video with the mobile device's network ID, e.g., MAC address. To address this issue, we propose RFCam, a system for human identification with a fusion of Wi-Fi and camera data. RFCam uses a multi-antenna Wi-Fi radio to collect CSI of Wi-Fi packets sent by mobile devices, and a camera to monitor users in the area. With low sampling rate CSI data, RFCam derives heterogeneous embedding features on location, motion, and user activity for each device over time, and fuses them with visual user features generated from video analytics to find the best matches. To mitigate the impacts of multi-user environments on wireless sensing, we develop video-assisted learning models for different features and quantify their uncertainties, and incorporate them with video analytics to rank moments and features for robust and efficient fusion. RFCam is implemented and tested in indoor environments for over 800 minutes with 25 volunteers, and extensive evaluation results demonstrate that RFCam achieves real-time identification average accuracy of 97.01% in all experiments with up to ten users, significantly outperforming existing solutions.