Intelligent video surveillance continues to be a vibrant research domain within the field of computer vision. However, existing representation learning frameworks primarily focus on static information extraction frame by frame such as appearance features, they often overlook the valuable dynamic information like optical flow feature inherent in the video data, which is most essential characteristics of sequence data. To mining dynamic features and bridge this gap, our paper introduces a novel anomaly detection framework that balance dynamic information with static information and construct a relationship between appearance features and corresponding optical flow features, where we sets strong consistency constraints, which reduce the loss between dynamic information and corresponding static information, and we leverages collaborative teaching network to ensure a consistent representation of both static and dynamic information for predict. The proposed framework consists of two sets of encoder–decoder pairs complemented by a memory storage module. Operating in parallel with the dual encoder network is a Co-teaching network, with the shared memory module serving as the cornerstone for collaborative training. The Consistency constrained condition guarantees the strong consistency of dynamic and static information in the learned representations. In our experimental phase, we present compelling results that showcase the superior performance of our algorithm across three publicly available datasets.