Image-text matching of natural scenes has been a popular research topic in both computer vision and natural language processing communities. Recently, fine-grained image-text matching has shown its significant advance in inferring the high-level semantic correspondence by aggregating pairwise region-word similarity, but it remains challenging mainly due to insufficient representation of high-order semantic concepts and their explicit connections in one modality as its matched in another modality. To tackle this issue, we propose a relationship-enhanced semantic graph (ReSG) model, which can improve the image-text representations by learning their locally discriminative semantic concepts and then organizing their relationships in a contextual order. To be specific, two tailored graph encoders, visual relationship-enhanced graph (VReG) and textual relationship-enhanced graph (TReG), are respectively exploited to encode the high-level semantic concepts of corresponding instances and their semantic relationships. Meanwhile, the representations of each graph node are optimized by aggregating semantically contextual information to enhance the node-level semantic correspondence. Further, the hard-negative triplet ranking loss, center hinge loss, and positive-negative margin loss are jointly leveraged to learn the fine-grained correspondence between the ReSG representations of image and text, whereby the discriminative cross-modal embeddings can be explicitly obtained to benefit various image-text matching tasks in a more interpretable way. Extensive experiments verify the advantages of the proposed fine-grained graph matching approach, by achieving the state-of-the-art image-text matching results on public benchmark datasets.