The burgeoning of Internet of Things (IoT) and camera-equipped mobile devices contributes a tremendous amount of video data generated at the edge of the network. At the same time, we have witnessed the fast deployment of many video-based application services, such as plate recognition for public safety, intelligent transportation, Industry 4.0 and so on. The success of these services, in turn, requires large-scale video data being learned, stored, and retrieved in a more efficient way. A generic software and hardware framework for large-scale IoT video analysis and service support is still missing. To address this challenge, we present π-Hub, PerceptIn’s robotic cloud solution which supports large-scale video data analysis, storage, and query by implementing the learn-store-retrieve paradigm. Interestingly, we found that among the learning, storage, and retrieval services each of them stresses one type of resources on heterogeneous computing servers, i.e., GPU, CPU, and Memory, respectively, therefore it is extremely cost-efficient to co-locate these services together to fully utilize the resources. In addition, several optimization techniques for data writing, reading, and data reduction are proposed and evaluated. The evaluation results show that these techniques improve the performance of the learning, storage and retrieval services significantly as well as notably reduce the cost of the system. We also verify π-Hub’s scalability by reliably running a 1000-machine deployment to support up to one million users. Finally, we conclude the paper by discussing several lessons learned from this study and future work.