Recently, referring image segmentation has attracted wide attention given its huge potential in human-robot interaction. Networks to identify the referred region must have a deep understanding of both the image and language semantics. To do so, existing works tend to design various mechanisms to achieve cross-modality fusion, for example, tile and concatenation and vanilla nonlocal manipulation. However, the plain fusion usually is either coarse or constrained by the exorbitant computation overhead, finally causing not enough understanding of the referent. In this work, we propose a fine-grained semantic funneling infusion (FSFI) mechanism to solve the problem. The FSFI introduces a constant spatial constraint on the querying entities from different encoding stages and dynamically infuses the gleaned language semantic into the vision branch. Moreover, it decomposes the features from different modalities into more delicate components, allowing the fusion to happen in multiple low-dimensional spaces. The fusion is more effective than the one only happening in one high-dimensional space, given its ability to sink more representative information along the channel dimension. Another problem haunting the task is that the instilling of high-abstract semantic will blur the details of the referent. Targetedly, we propose a multiscale attention-enhanced decoder (MAED) to alleviate the problem. We design a detail enhancement operator (DeEh) and apply it in a multiscale and progressive way. Features from the higher level are used to generate attention guidance to enlighten the lower-level features to more attend to the detail regions. Extensive results on the challenging benchmarks show that our network performs favorably against the state-of-the-arts (SOTAs).