Generating sound effects that people want is an important topic. However, there are limited studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a token-decoder, and a vocoder. The framework first uses the token-decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform. We found that the token-decoder significantly influences the generation performance. Thus, we focus on designing a good token-decoder in this study. We begin with th21e traditional autoregressive (AR) token-decoder, which has shown state-of-the-art performance in previous sound generation works. However, the AR token-decoder always predicts the mel-spectrogram tokens one by one in order, which may introduce the unidirectional bias and accumulation of errors problems. Moreover, with the AR token-decoder, the sound generation time increases linearly with the sound duration. To overcome the shortcomings introduced by AR token-decoders, we propose a non-autoregressive token-decoder based on the discrete diffusion model, named Diffsound. Specifically, the Diffsound model predicts all of the mel-spectrogram tokens in one step and then refines the predicted tokens in the next step, so the best-predicted results can be obtained by iteration. Our experiments show that our proposed Diffsound model not only produces better text-to-sound generation results when compared with the AR token-decoder but also has a faster generation speed, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i> , MOS: 3.56 <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">v.s</i> 2.786, and the generation speed is five times faster than the AR decoder. Furthermore, to automatically assess the quality of generated samples, we define three different objective evaluation metrics <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i> , Fréchet Inception Distance (FID), Kullback-Leibler (KL), and audio caption loss, which can comprehensively assess the relevance and fidelity of the generated samples.