As of present, the progress of conversational AI research has been greatly propelled by large-scale pre-trained language models. In particular, task-oriented dialogue systems have gained widespread attention owing to their immense potential in helping individuals accomplish diverse objectives, such as booking hotels, making restaurant reservations, and purchasing train tickets. In the past, task-oriented dialogue systems were typically viewed as a multi-step process that included spoken language understanding, dialogue state tracking, dialogue policy learning, and natural language generation. More recently, large-scale pre-trained language models enables the development of end-to-end neural pipeline task-oriented dialogue systems, which combine multiple steps into a single model, allowing for joint optimization and preventing error propagation. However, in order to explicitly retrieve information from databases to ensure the interpretability of the system, almost all end-to-end neural pipeline methods inevitably require predicting dialogue state as an intermediate result specialized for the domain or task, which results in significant challenges for generalization. To solve the problem above, we propose One-Step Task-Oriented Dialogue (OSTOD) in this paper, which models task-oriented dialogue by synchronously generating activated states and retelling responses, where activated states refer to slot values that contribute to database access, and retelling responses are system responses that contain activated state information. Specifically, first, automatic methods are designed to build data containing activated states and retelling responses. Then, a joint generation model that synchronously predicts activated states and retelling responses in a single step is proposed for task-oriented dialogue modelling. Based on empirical results obtained from the MultiWOZ 2.0 and MultiWOZ 2.1 datasets, our OSTOD model demonstrates comparable performance to state-of-the-art baselines. Moreover, our model exhibits exceptional generalization capabilities in few-shot learning and domain transfer scenarios.