Automatic medical report generation is an essential task in applying artificial intelligence to the medical domain, which can lighten the workloads of doctors and promote clinical automation. The state-of-the-art approaches employ Transformer-based encoder-decoder architectures to generate reports for medical images. However, they do not fully explore the relationships between multi-modal medical data, and generate inaccurate and inconsistent reports. To address these issues, this paper proposes a Multi-modal Memory Transformer Network (MMTN) to cope with multi-modal medical data for generating image-report consistent medical reports. On the one hand, MMTN reduces the occurrence of image-report inconsistencies by designing a unique encoder to associate and memorize the relationship between medical images and medical terminologies. On the other hand, MMTN utilizes the cross-modal complementarity of the medical vision and language for the word prediction, which further enhances the accuracy of generating medical reports. Extensive experiments on three real datasets show that MMTN achieves significant effectiveness over state-of-the-art approaches on both automatic metrics and human evaluation.