Text-based person re-identification aims to find the target person from a large pedestrian gallery with the given natural language description. Previous works mainly focus on embedding salient textual and visual representations in a common latent space by utilizing the dual-path structure or parameter-shared network. However, they still lack the ability to effectively extract fine-grained unimodal features as well as fuse the cross-modal data, leading to the increase of misaligned cases. To settle these issues, we propose a text-and-image implicit learning Transformer (TILT) to eliminate textual anisotropy and enhance the cross-modal alignment from both domains based on the bi-direction multimodal encoders. Specifically, we apply the pre-trained multimodal embedding module to overcome the unimodal anisotropy problem with contrastive learning, and map fine-grained features with dual encoder in bidirectional masking. Then, we design the cross-modal interaction encoder to comprehensively mine implicit cross-modal relations by reconstructing masked tokens, and fuse rich multi-modal knowledge in a common space. In addition, the cross-modal similarity matching module is proposed to optimize the intra-domain classification and decrease the inter-domain divergence. Extensive experiments are conducted on three public benchmarks CUHK-PEDES, ICFG-PEDES and RSTPReid to verify the effectiveness of our proposed framework. And results prove that our model outperforms state-of-the-art methods on all metrics.