Physical violence in the educational environment by students often occurs and leads to criminal acts. Apart from that, repeated acts of physical violence can be considered non-verbal bullying. This bullying can hurt the victim, causing physical disorders, mental health, impaired social relationships and decreased academic performance. However, monitoring activities against acts of violence currently being carried out have weaknesses, namely weak supervision by the school. A deep Learning-based physical violence detection system, namely LSTM Network, is the solution to this problem. In this research, we develop a Convolutional Neural Network to detect acts of violence. Convolutional Neural Network extracts features at the frame level from videos. At the frame level, the feature uses long short-term memory in the convolutional gate. Convolutional Neural Networks and convolutional short-term memory can capture local spatio-temporal features, enabling local video motion analysis. The performance of the proposed feature extraction pipeline is evaluated on standard benchmark datasets in terms of recognition accuracy. A comparison of the results obtained with state-of-the-art techniques reveals the promising capabilities of the proposed method for recognising violent videos. The model that has been trained and tested will be integrated into a violence detection system, which can provide ease and speed in detecting acts of violence that occur in the school environment.