In IP audio systems, audio quality is degraded by environmental noise, poor network quality, and encoding–decoding algorithms. Therefore, there is a need for a continuous automatic quality evaluation of the transmitted audio. Speech quality monitoring in VoIP systems enables autonomous system adaptation. Furthermore, there are diverse IP audio transmitters and receivers, from high-performance computers and mobile phones to low-memory and low-computing-capacity embedded systems. This paper proposes MiniatureVQNet, a single-ended speech quality evaluation method for VoIP audio applications based on a lightweight deep neural network (DNN) model. The proposed model can predict the audio quality independent of the source of degradation, whether noise or network, and is light enough to run in embedded systems. Two variations of the proposed MiniatureVQNet model were evaluated: a MiniatureVQNet model trained on a dataset that contains environmental noise only, referred to as MiniatureVQNet–Noise, and a second model trained on both noise and network distortions, referred to as MiniatureVQNet–Noise–Network. The proposed MiniatureVQNet model outperforms the traditional P.563 method in terms of accuracy on all tested network conditions and environmental noise parameters. The mean squared error (MSE) of the models compared to the PESQ score for ITU-T P.563, MiniatureVQNet-Noise, and MiniatureVQNet–Noise–Network was 2.19, 0.34, and 0.21, respectively. The performance of both the MiniatureVQNet–Noise–Network and MiniatureVQNet-Noise model depends on the noise type for an SNR greater than 0 dB and less than 10 dB. In addition, training on a noise–network-distorted speech dataset improves the model prediction accuracy in all VoIP environment distortions compared to training the model on a noise-only dataset.