Multi-turn text-to-image synthesis task aims to manipulate desired visual content according to the user's intention step by step, which has recently attracted a lot of research interest in the community of language and vision. Different from traditional text-to-image synthesis, multi-turn text-to-image synthesis is more challenging as 1) it needs to continuously recognize the user's intention from spoken instruction and perceive the visual information from the source image; 2) it requires reasoning about the position, appearance, and characteristics of fresh modifications in target images as well as connecting objects in instructions with visual components in source images. To deal with this issue, in this paper, we propose a Dual Semantic-stream Guidance with global and local linguistics Generative Adversarial Network (DSG-GAN), which reasons and learns the user's intention from text description and iteratively manipulates visual information. Specifically, we design a novel dual semantic-stream discriminator, which combines with a hierarchical instruction encoder to evaluate the logic and quality between human intention in linguistic instruction and generates visual content from the perspective of global and fine-grained consistency matching. Meanwhile, the discriminator's backpropagation gradient is used to optimize the instruction encoder, which incentivizes it to purify the user's intention into global and local information that is consistent with the manipulation's visual representation. Extensive experiments show that even when producing high-resolution images and making deep iterative turns, our method performs significantly better due to local fine-grained linguistic information being combined with cross-modal correlation.