As important recreational spaces for urban residents, urban microgreen parks enhance the urban living environment and alleviate psychological pressure on residents. The visual, auditory, and olfactory senses are crucial forms of perception in human interaction with nature, and the sustainable perceptual design of miniature green parks under their interaction has become a recent research hotspot. This study aimed to investigate the effects of the visual, acoustic, and olfactory environments (e.g., aromatic green vegetation) on human perception in miniature green parks. Participants were evenly divided into eight groups, including single-sensory groups, multi-sensory interaction groups, and a control group. Eye-tracking technology, blood pressure monitoring, and the Semantic Differential (SD) scales and Profile of Mood State (POMS) were used to assess the effectiveness of physical and mental perception recovery in each group. The results revealed that in an urban microgreen space environment with relatively low ambient noise, visual–auditory, visual–olfactory, and visual–auditory–olfactory interactive stimuli were more effective in promoting the recovery of visual attention than single visual stimuli. Additionally, visual–auditory–olfactory interactive stimuli were able to optimize the quality of spatial perception by using positive sensory inputs to effectively mask negative experiences. Simultaneously, environments with a high proportion of natural sounds had the strongest stimuli, and in the visual–auditory group, systolic blood pressure at S7 and heart rate at S9 significantly decreased (p < 0.05), with reductions of 18.60 mmHg and 20.15 BPM, respectively. Aromatic olfactory sources were more effective in promoting physical and mental relaxation compared to other olfactory sources, with systolic blood pressure reductions of 24.40 mmHg (p < 0.01) for marigolds, 23.35 mmHg (p < 0.01) for small-leaved boxwood, and 27.25 mmHg (p < 0.05) for camphor trees. Specific auditory and olfactory conditions could guide visual focus, such as birdsong directing attention to trees, insect sounds drawing attention to herbaceous plants, floral scents attracting focus to flowers, and leaf scents prompting observation of a wider range of natural vegetation. In summary, significant differences exist between single-sensory experiences and multi-sensory modes of spatial perception and interaction in urban microgreen parks. Compared to a silent and odorless environment, the integration of acoustic and olfactory elements broadened the scope of visual attention, and In the visual–auditory–olfactory interactive perception, the combination of natural sounds and aromatic camphor tree scents had the best effect on attention recovery, thereby improving the quality of spatial perception in urban microgreen parks.