Abstract. Traditional indoor positioning technologies mostly require advanced installation of hardware devices, resulting in high costs and long-term maintenance. With advancements in image recognition and deep learning technologies, indoor visual positioning based on image recognition has become increasingly mature. This method offers the benefits of low cost and does not require additional hardware installation. However, it still has inherent defects, such as cumbersome data collection, complex algorithms, and universality. To minimize indoor information pre-collection cost, improve versatility, and enable rapid deployment in low-performance mobile devices, this paper proposes a lightweight indoor positioning system based on multiple self-learning features and key frame classification. The system is divided into two stages: preprocessing and real-time positioning. In the preprocessing stage, image information is collected for the entire indoor environment, and a key-frame recognizer is trained based on the image information. Simultaneously, an environmental feature information database is established. In the real-time positioning stage, the system first uses mobile devices such as smartphones to obtain real-time video streams. A key frame recognizer based on convolutional neural networks identifies key frames in each video stream frame, thereby obtaining approximate positions for rough positioning. Second, feature points are identified in each frame of the video stream and matched with feature points with location information in the built environmental feature information database to calculate precise positions for fine positioning. It has significant optimizations compared with conventional visual solutions in terms of preprocessing data collection, algorithm performance consumption, and versatility.