The growing demand for highly accessible interaction technologies to effectively interact with smart devices has led to the increasing popularity of voice user interfaces (VUIs). However, VUIs face interpretation challenges stemming from the variability of natural language input, such as speech clarity issues, linguistic variability, and speech impediments. As an alternative, non-verbal sound-based interaction techniques emerge as highly advantageous for smart device control, mitigating the inherent challenges of VUIs. In this article, we introduce EchoTap, a novel audio interface that harnesses the distinctive sound responses generated by knock and tap gestures on target objects. Employing deep neural networks, EchoTap recognizes both the type and location of these gestures based on their unique sound signatures. Through offline evaluation, EchoTap demonstrated competitive classification accuracy (88% on average) and localization precision (93% on average). Moreover, a user study involving 12 participants validated EchoTap’s practical effectiveness and user-friendliness in real-world scenarios. This study highlights EchoTap’s potential for various daily interaction contexts and discusses further design implications for leveraging auditory interfaces based on simple gestures.