This paper proposes a novel Chinese speech cloning model named VStyclone, which consists of three stages: multi-speaker training, target speaker encoding, and target speaker synthesis. In this work, we design an efficient tone extractor, which can reallocate resources to the sequences of log-mel spectrogram frames obtained from multiple speakers’ speech, thus allowing the network to learn multiple speakers’ features differently. This approach allows the network to focus more on the voice features of the target speaker and extract the target features accurately. To cluster the voices of the same speaker and sparse the voices of different speakers, we build an optimal softmax loss to optimize the model. Then, we develop a style synthesizer, which adopts the idea of transformer instead of recurrent neural network, so that the model can not only process text information in parallel, but also improve the model's ability to process long-distance contextual information. Meanwhile, we embed a style extraction module in the style synthesizer to dynamically capture style ranges in an unsupervised manner. In addition, the VStyclone model uses generative adversarial networks as the base generation model of the vocoder to improve the generation speed, which runs 1.2 times faster than the real-time generation speed on CPU, and finally the VStyclone model obtains the SOTA effect.