Most previous works on sound event detection (SED) are based on binary hard labels of sound events, leaving other scales of information underexplored. To address this problem, we introduce multiple granularities of knowledge into the system to perform hierarchical acoustic information fusion for SED. Specifically, we present an interactive dual-conformer (IDC) module to adaptively fuse the medium-grained and fine-grained acoustic information based on the hard and soft labels of sound events. In addition, we propose a scene-dependent mask estimator (SDME) module to extract the coarse-grained information from acoustic scenes, introducing the scene-event relationships into the SED system. Experimental results show that the proposed IDC and SDME modules efficiently fuse the acoustic information at different scales and therefore further improve the SED performance. The proposed system achieved Top 1 performance in DCASE 2023 Challenge Task 4B.