Sleep stage classification plays a crucial role in sleep quality assessment and sleep disorder prevention. Nowadays, many studies have developed algorithms for this purpose, but they still face two challenges. The first is noise in physiological signals from various devices. The second challenge is that most studies simply concatenate multi-modal features without considering their correlations. To this end, we propose a framework, namely Diff-SleepNet, to efficiently classify sleep stages from multi-modal input. This framework begins with a diffusion model with peak signal-to-noise ratio (PNSR) loss function that adaptively filters noise. The filtered signals are then transformed into a multi-view spectrum through data pre-processing. These spectra are processed by a transformer-based backbone to extract multi-modal features. The production is fed into the following multi-scale attention module for robust feature fusion. The sleep stage category is finally determined by a fully connected layer. Our framework is trained and validated on three typical datasets, i.e., SHHS, Sleep-EDF-SC, and Sleep-EDF-X. Experimental results demonstrate that it is effective and has advantages over other peer methods.