This work introduces a paradigm for acoustic imaging in which a metasurface converts the acoustic waves scattered by remote scenes into coherent light focused into images by conventional optical cameras. The metasurface is composed of acousto-optical unit cells that sense the local acoustic pressure and use the resulting signal to modulate the amplitude of the electric field produced by a laser on an optical aperture. We derive the general design requirements for the image reconstruction in the optical domain and validate the concept through acoustic field measurements of the ultrasound scattered from an object submerged in water followed by numerical simulations predicting how this field is processed by the metasurface and camera. We show that this approach has two main advantages compared to traditional acoustic imaging systems. First, the acoustic-to-optic wavelength down-conversion leads to effective acoustical apertures very large compared to the physical size. Second, the unit cells are not synchronized electronically and thus the complexity of the metasurface increases only linearly with the number of unit cells, which is a significantly slower increase compared to conventional synchronized arrays. This work shows that these advantages lead to compact acoustic cameras providing image resolutions higher than possible with conventional acoustic imaging methods.