Sound event detection (SED) is a core technology in smart home projects that rely on detected sound events to trigger specific actions. SED systems face two major challenges: high labeling costs and complex acoustic environments. To reduce labeling costs, some semi-supervised systems extract both global and local features for classification. However, these methods treat global and local features equally, not accounting for their varying importance when recognizing different types of sound events. Furthermore, to address complex acoustic environments, some studies use multitask learning frameworks to introduce SED-related tasks as auxiliaries to improve detection performance. However, these methods fail to align tasks within the framework, leading to conflicting outputs that may limit system performance. To address these issues, in this paper we propose a “two-to-one” teacher-student learning based semi-supervised SED system. This system employs a gating mechanism to selectively enhance global and local features, improving adaptability to different types of sound events, and incorporates a cross-task alignment module to interact SED with related tasks, reducing the risk of performance degradation caused by conflicting outputs. Experimental results on two datasets demonstrate that our system achieves the best performance in all metrics, with EB-F1 scores of 48.1 % and 64.7 %, representing improvements of 15.3 % and 10.6 % over the baseline ConformerSED system, respectively. Our work offers an effective SED solution for smart home projects by providing a semi-supervised SED system that performs well while reducing labeling costs.
Read full abstract