Tracking in dense indoor environments where several thousands of people move around is an extremely challenging problem. In this paper, we present a system — DenseTrack for tracking people in such environments. DenseTrack leverages data from the sensing modalities that are already present in these environments — Wi-Fi (from enterprise network deployments) and Video (from surveillance cameras). We combine Wi-Fi information with video data to overcome the individual errors induced by these modalities. More precisely, the locations derived from video are used to overcome the localization errors inherent in using Wi-Fi signals where precise Wi-Fi MAC IDs are used to locate the same devices across different levels and locations inside a building. Typically, localization in dense environments is a computationally expensive process when done with just video data; hence hard to scale. DenseTrack combines Wi-Fi and video data to improve the accuracy of tracking people that are represented by video objects from non-overlapping video feeds. DenseTrack is a scalable and device-agnostic solution as it does not require any app installation on user smartphones or modifications to the Wi-Fi system. At the core of DenseTrack, is our algorithm — inCremental Association of Independent Variables under Uncertainty (CAIVU). CAIVU is inspired by the multi-armed bandits model and is designed to handle various complex features of practical real-world environments. CAIVU matches the devices reported by an off-the-shelf Wi-Fi system using connectivity information to specific video blobs obtained through a computationally efficient analysis of video data. By exploiting data from heterogeneous sources, DenseTrack offers an effective real-time solution for individual tracking in heavily populated indoor environments. We emphasize that no other previous system targeted nor was validated in such dense indoor environments. We tested DenseTrack extensively using both simulated data, as well as two real-world validations using data from an extremely dense convention center and a moderately dense university environment. Our simulation results show that DenseTrack achieves an average video-to-Wi-Fi matching accuracy of up to 90% in dense environments with a matching latency of 60 s on the simulator. When tested in a real-world extremely dense environment with over 500,000 people moving between different non-overlapping camera feeds, DenseTrack achieved an average match accuracy of 83% to within a 2-people distance with an average latency of 48 s.