When the machinery device operates abnormally, it is not sufficient for fault detection only via extracting fault features from a single sensor due to the latent fault information may be scattered across multiple sensors. Multi-sensory fusion techniques with deep learning framework have attracted increasing attention from researchers due to the exploiting and integration of fault information between multiple sensors. Nevertheless, there are two remaining shortcomings in most existing multi-sensory fusion technologies. (1) Most existing fusion methods merely concentrate on conducting multi-sensory information fusion from time-domain or frequency-domain to achieve fault diagnosis, which are often unsatisfactory in the face of strong noise environments. (2) The collaborative fusion between several vibration sensors is generally considered in the past works, whereas the complementary information fusion between multi-sensory vibro-acoustic heterogeneous data are rarely studied. To address these deficiencies, this paper proposes a novel coarse-to-fine dual-scale time-frequency attention fusion network (CDTFAFN) for machinery fault diagnosis, which not only adequately considers the complementary information fusion of vibro-acoustic signal, but also has robust feature learning capabilities in a noisy scenario. Firstly, the signal-to-image encoding unit (SIEU) containing the improved constant-Q non-stationary Gabor transform (ICQ-NSGT) is introduced to convert the collected raw vibro-acoustic heterogeneous signal into time-frequency representation (TFR) and achieve the coarse-grained feature fusion. Secondly, the time-frequency attention feature fusion unit (TFA-FFU) is designed to concurrently learn the fine-grained features at two scales from the low-level fused features which are meaningful for fault diagnosis. Finally, the coarse-to-fine features are sequentially concatenated and fed into softmax classifier to preferably promote the network learning performance and automatically implement fault classification. The performance of the proposed approach is validated against those state-of-the-art results on two groups of multi-sensory vibro-acoustic data in different experimental platforms. Experiment results show that the proposed method with the diagnosis accuracy of 99 % above outperforms other several representative fusion technologies (i.e., 2MNet, MFF-GBFD, MSCNN-BiLSTM, MFAN-VAF, 1D-CNN-VAF, MI-CNN-TFT and TFFN-VAF) in the raw noise-free addition scenario. Moreover, the average testing accuracy of the proposed method can still reach 97 % above in the noisy scenarios with Gaussian white noises, which shows its competitive superiority and strong robustness against noises in machinery fault diagnosis. According to the five ensemble macro-average performance evaluation metrics (i.e., accuracy, precision, sensitivity, specificity and F1-score) and the receiver operator characteristic (ROC) analysis, our findings also emphasize the superiority of applying our method for machinery fault diagnosis under the colored noises compared with other fusion technologies (i.e., 2MNet, MFF-GBFD, MSCNN-BiLSTM, MFAN-VAF, 1D-CNN-VAF, MI-CNN-TFT and TFFN-VAF) reported in this paper.