Existing indoor localization methods have bottleneck constraints such as multipath effect for Wi-Fi based methods, high cost for ultra-wide-band based methods and poor anti-interference for Bluetooth-based methods and so on. In order to avoid these problems, a vision-based indoor localization method is proposed. Firstly, the whole deployment environment is departed into several regions and each region is assigned to a location center. Then, in offline mode, the VGG16NET is pre-trained by ImageNet dataset and it is fine-tuned by images on a custom dataset towards indoor localization. In online mode, the fully trained and converged VGG16NET takes as input a video stream captured by the front RGB camera of a mobile robot and outputs features specific to the current location. The features are then used as input to an ArcFace classifier which outputs the current location of the mobile robot. Experimental results show that our method can estimate the location of a mobile object with imaging capability accurately in cluttered unstructured scenes without any other additional device. The localization accuracy can reach to 94.7%.