Location information plays a key role in pervasive computing and application, especially indoor location-based service, even though a mass of systems have been proposed, an accurate and practical indoor localization system remains unsettled. To tackle this issue, in this paper, we present a new localization scheme, SITE, combining acoustic Signals and Images to achieve accurate and robust indoor locaTion sErvice. Relying on a pre-deployed platform of acoustic sources with different frequencies, using proactively generated Doppler effect signals, SITE could track relative directions between the phone and the sources. Given m () relative directions, SITE can use the angle differences to compute a set of locations corresponding to different subsets of sources. Then, based on a key observation—while the simultaneously estimated locations using different sets of acoustic anchors are within a small circle, the results converge to a point near the true location—SITE proposes a decision scheme that confirms whether these locations satisfy the demand of localization accuracy and can be used to search the user’s location. If not, SITE utilizes VSFM(Visual Structure from Motion) technique to achieve a set of relative locations using some images captured by the phone’s camera. By exploiting the synergy between the set of relative locations and the set of initial locations computed by relative directions, an optimal transformation relationship is obtained and applied to refine the initial calculated results. The refined result will be regarded as the user’s location. In the evaluation, we implemented a prototype and deployed a real platform of acoustic sources in different scenarios. Experimental results show that SITE has excellent performance of localization accuracy, robustness and feasibility in practical application.