Video Anomaly Detection (VAD) aims to identify events in videos that deviate from typical patterns. Given the scarcity of anomalous samples, previous research has primarily focused on learning regular patterns from datasets exclusively containing normal behaviors, and treating deviations from these patterns as anomalies. However, most of these methods are constrained by coarse-grained modeling approaches that renders them incapable of learning highly-discriminative features, which are necessary to effectively distinguish between the subtle differences between normal and abnormal behaviors. To better capture these features, we propose an innovative method. Initially, pseudo-anomalous samples for appearance and motion are generated through geometric transformations (2D rotations) and the scrambling of video sequences. Subsequently, a dual-branch network featuring spatio-temporal decoupling is proposed, in which the spatial and temporal branches each handle a specific proxy task. These tasks are designed to distinguish between normal and pseudo-anomalous samples, involving operations such as predicting patch-based 2D rotation angles and classifying video frame triplets as total-anomaly, left-anomaly, right-anomaly, and non-anomaly. Our approach employs an end-to-end training methodology, without relying on pre-trained models (except for the object detector). Evaluations on the UCSD Ped2, Avenue, and ShanghaiTech datasets show that our method achieved AUC scores of 99.1%, 91.9%, and 81.1%, respectively, demonstrating its effectiveness. The code is publicly accessible at the following link: https://spatio-temporal-tasks.