In masking-based deep neural network (DNN) speech enhancement, the time–frequency masking value cannot be estimated accurately because the potential structure information of speech is ignored. In this paper, a speech enhancement method is proposed by combining adaptive sparse non-negative matrix factorization (NMF) feature extraction and soft mask to optimize DNN, using the advantages of the sparse matrix in catching the protruding structure of speech and combining with optimized masking-based prediction. First, considering the dominance of speech and noise interference in different noisy speech signals, this paper proposes a new method for estimating soft mask value, and the initial soft mask value is estimated by using speech cochleagram and noise cochleagram. Then, speech cochleagram and noise cochleagram are learned separately by the sparse NMF (SNMF) to obtain a joint dictionary. The noisy speech is sparsely represented on the joint dictionary, and the adaptive adjustment factor related to the changes of speech and noise dictionary is added to obtain the sparse coefficient. The sparse coefficient is used as the input of the DNN model, and the initial soft mask value is used as the learning label to estimate the final soft mask value. Finally, the estimated soft mask value is combined with the noisy speech cochleagram to obtain enhanced speech. Compared with other methods, the results show that 1.6039 dB increases the average signal-to-noise ratio (SNR) of the proposed method, the average perceptual evaluation of speech quality (PESQ) is increased by 0.1994, and the average short-time objective intelligibility (STOI) is improved by 0.0271, which fully illustrate the superiority of the proposed algorithm.