With the emergence of more advanced separation networks, significant progress has been made in time-domain speech separation methods. These methods typically use a temporal encoder–decoder structure to encode speech feature sequences, thereby accomplishing the separation task. However, due to the limitation of traditional encoder–decoder structure, the separation performance decreases sharply when the encoded sequence is short, and when encoded sequence is sufficiently long, the separation performance improves, but which leads to an increase in computational complexity and training cost. Therefore, this paper compresses and reconstructs the speech feature sequence through a multi-layer convolution structure, and proposes a multi-layer encoder–decoder time-domain speech separation model (MLED). In this model, our encoder–decoder structure can compress speech sequence to a short length while ensuring the separation performance does not decrease. And combined with our multi-scale temporal attention (MSTA) separation network, MLED achieves efficient and precise separation of short encoded sequences. Therefore, compared to previous advanced time-domain separation methods, our experiments show that MLED achieves competitive separation performance with smaller model size, lower computational complexity, and training cost.