In order to achieve remote voice acquisition, a 9.75-GHz microwave Doppler radar with a high-sensitivity coherent homodyne demodulator is used to detect voice signals. However, the combined noise sources, including background noise and shock noise, are included in the Doppler audio radar-captured voice, which seriously damages the quality of speech. In the current work, a multi-stage enhancement method is used to enhance Doppler audio radar-captured voice. The first stage is dictionary learning, which uses the K-singular value decomposition algorithm to train speech and noise dictionaries. The noise training data come from the Doppler audio radar-captured shock noise signals under the influence of different relative motion modulations. The second stage generates a position mask of the shock noise. The third stage is the removal of background noise and shock noise. The removal of background noise adopts our proposed constrained low-rank sparse matrix decomposition algorithm based on power spectral density. The final speech spectrum with shock noise removed is obtained through sparse restoration using the LARC algorithm. The experimental results show that the 9.75-GHz Doppler audio radar can effectively acquire human voice signals, and the proposed multi-stage enhancement method can effectively remove the background noise and shock noise in the radar speech, thereby improving the quality of the audio radar voice. The obtained results imply the potential to implement practical Doppler audio radar systems for human voice signal acquisition.