In this paper, a robust fingerprinting scheme for video content authentication with two-dimensional attention mechanism and spatio-temporal weighted fusion called TASTNet is proposed, which can automatically extracts key spatio-temporal features from the input video and maps them to the corresponding fingerprint. Detailedly, the two-dimensional attention mechanism is applied to resist different kinds of digital manipulations for robustness enhancement. To incorporate perceptual characteristics, a spatio-temporal weighted fusion method based on LTSM is presented to integrate frame-level features into video-level features while retaining the temporal order. In the process of fusion, key frames are allocated with larger weights according to inter-frame correlation. With these two steps, we can obtain representative video features that contain principal perception information. In addition, the proposed scheme utilizes deep metric learning for training, and we design multiple constraints to make the generated fingerprint more compact and discriminable. Extensive experiments demonstrate that our scheme can achieve superior performances with respect to robustness and discrimination compared with some state-of-the-art schemes.