Describing video content automatically in natural language sentences is a long-standing challenge in computer vision. Although existing methods that capture relational information among objects have made significant strides in the past years, the detailed geometrical and temporal information of objects remains to be further explored. To address this problem, a novel Spatio-Temporal Aware Graph is proposed to capture more elaborate visual representations, which are able to exploit the detailed spatio-temporal clues of the extracted object features. By performing graph-structured aggregation, the proposed model is capable of capturing not only the interactions among objects but also the detailed spatio-temporal relations. Meanwhile, a Frame Similarity Graph is constructed on frame features to learn comprehensive representations, which aim to extract the global information that the object feature lacks. Moreover, to capture rich video semantics from different perspectives, multiple video representations, that is appearance and motion information, are utilised to learn discriminative representations. Experiments on the prevalent benchmarks: Microsoft Video Description Corpus and Microsoft Research Video to Text demonstrate that the proposed approach achieves state-of-the-art performance in several widely used evaluation metrics: BLEU-4, METEOR, ROUGE, and CIDEr.