In detecting sensitive media, violence is one of the hardest to define objectively, and thus, a significant challenge to detect automatically. While many studies were conducted in detecting aspects of violence, very few try to approach the general concept. We propose a method that aims to enable machines to understand a high-level concept of violence by first breaking it down into smaller, more objective ones, such as fights, explosions, blood, and gunshots, to combine them later, leading to a better understanding of the scene. For this, we leverage characteristics of each individual sub-concept of violence (relying upon custom-tailored convolutional neural networks) to guide how they should be described. A fight scene should incorporate temporal features that a scene with blood does not need to describe. A scene with explosions or gunshots should weigh more on its audio features. With this multimodal approach, we trained visual and auditory feature detectors and later combined them into a decision neural network to give us a violence detector that considers several different aspects of the problem. This robust and modular approach allows different cultures and users to adapt the detector to their specific needs.