The training of temporal action localization models relies heavily on a large amount of manually annotated data. Video annotation is more tedious and time-consuming compared with image annotation. Therefore, the semi-supervised method that combines labeled and unlabeled data for joint training has attracted increasing attention from academics and industry. This study proposes a method called pseudo-label refining (PLR) based on the teacher-student framework, which consists of three key components. First, we propose pseudo-label self-refinement which features in a temporal region interesting pooling to improve the boundary accuracy of TAL pseudo label. Second, we design a module named boundary synthesis to further refined temporal interval in pseudo label with multiple inference. Finally, an adaptive weight learning strategy is tailored for progressively learning pseudo labels with different qualities. The method proposed in this study uses ActionFormer and BMN as the detector and achieves significant improvement on the THUMOS14 and ActivityNet v1.3 datasets. The experimental results show that the proposed method significantly improve the localization accuracy compared to other advanced SSTAL methods at a label rate of 10% to 60%. Further ablation experiments show the effectiveness of each module, proving that the PLR method can improve the accuracy of pseudo-labels obtained by teacher model reasoning.
Read full abstract