Structured video text information extraction is a crucial part of video understanding for exploring the structured text fields from different category-specific videos such as scores in basketball games or identities in news. Recent natural language models and text detectors have demonstrated state-of-the-art performance in video text detection and recognition. However, understanding text from unstructured video frames is challenging in practice due to a variety of video text and dynamic text layout changes. Limited work has focused on the solutions that efficiently extract structured information from the video text. In this paper, we address this task by modeling a multi-modal attention graph on the video text. Specifically, we encode both the visual and textual features of detected text regions as nodes of the graph; the spatial layout relationship of the text regions is modeled as edges of the graph. The structured information extraction is solved by iteratively propagating text region messages along graph edges and reasoning the structured categories of graph nodes. To promote the representation capacity of the graph, we further introduce a contrastive loss on the visual embeddings of the text regions in a self-supervised manner. In order to roundly evaluate our proposed method as well as boost future research, we release a new dataset collected and annotated from several standard NBA regular seasons and playoff match videos. Experimental results demonstrate the superior performance of the proposed method over several state-of-the-art methods.