Although one can recognize the environment by soundscape substituting vision to auditory signal, whether subjects could perceive the soundscape as visual or visual-like sensation has been questioned. In this study, we investigated hierarchical process to elucidate the recruitment mechanism of visual areas by soundscape stimuli in blindfolded subjects. Twenty-two healthy subjects were repeatedly trained to recognize soundscape stimuli converted by visual shape information of letters. An effective connectivity method called dynamic causal modeling (DCM) was employed to reveal how the brain was hierarchically organized to recognize soundscape stimuli. The visual mental imagery model generated cortical source signals of five regions of interest better than auditory bottom-up, cross-modal perception, and mixed models. Spectral couplings between brain areas in the visual mental imagery model were analyzed. While within-frequency coupling is apparent in bottom-up processing where sensory information is transmitted, cross-frequency coupling is prominent in top-down processing, corresponding to the expectation and interpretation of information. Sensory substitution in the brain of blindfolded subjects derived visual mental imagery by combining bottom-up and top-down processing.