Although existing reconstruction-based multivariate time series anomaly detection (MTSAD) methods have shown advanced performance, most assume the training data is clean. When faced with noise or contamination in training data, they can also reconstruct the anomaly well, weakening the distinction between normal and anomaly. Some probabilistic generation-based methods have been used to address this issue because of their implicit robust structure to noise, but the training process and suppression of anomalous generalization are not stable. The recently proposed explicit method based on the memory module would also sacrifice the reconstruction effect of normal patterns, resulting in limited performance improvement. Moreover, most existing MTSAD methods use a single fixed-length window for input, which weakens their ability to extract long-term dependency. This paper proposes a robust multi-scale feature extraction framework with the dual memory module to comprehensively extract features fusing different levels of semantic information and lengths of temporal dependency. First, this paper designs consecutive neighboring windows as inputs to allow the model to extract local and long-term dependency information. Secondly, a dual memory-augmented encoder is proposed to extract global typical patterns and local common features. It ensures the reconstruction ability of normal data while suppressing the generalization of the anomaly. Finally, this paper proposes a multi-scale fusion module to fuse latent variables representing different levels of semantic information and uses the reconstructed latent variables to reconstruct samples for anomaly detection. Experimental results on five datasets from diverse domains show that the proposed method outperforms 16 typical baseline methods.