Audio-based violence detection is a critical research area for enhancing public safety and security. This paper delves into comparing machine learning models, specifically Convolutional Neural Networks and Shallow Networks in the context of audio violence detection. We evaluate these models under varying training set configurations and data augmentation techniques, analyzing their impact on model performance and robustness under varying real-world conditions. Specifically, we address the issue of domain shifts, exploring how models perform under different types of noise and reverberation. Our results highlight scenarios where Shallow Networks, despite their lower computational costs, exhibit performance nearly on par with that of high-cost CNNs. Introducing tailored data augmentation techniques significantly enhances the models' performance and stability against domain shifts, providing a promising direction for improving system robustness. Our research underscores the value of careful model selection for real-world audio-based violence detection applications, recognizing the importance of an optimal trade-off between computational cost and performance, especially in resource-constrained scenarios. This research provides valuable insights for researchers and practitioners in developing more efficient, robust and accurate audio-based violence detection systems.