AbstractVideo captioning aims at automatically generating a natural language caption to describe the content of a video. However, most of the existing methods in the video captioning task ignore the relationship between objects in the video and the correlation between multimodal features, and they also ignore the effect of caption length on the task. This study proposes a novel video captioning framework (ORMF) based on the object relation graph and multimodal feature fusion. ORMF uses the similarity and Spatio‐temporal relationship of objects in video to construct object relation features graph and introduce graph convolution network (GCN) to encode the object relation. At the same time, ORMF also constructs a multimodal features fusion network to learn the relationship between different modal features. The multimodal feature fusion network is used to fuse the features of different modals. Furthermore, the proposed model calculates the length loss of the caption, making the caption get richer information. The experimental results on two public datasets (Microsoft video captioning corpus [MSVD] and Microsoft research‐video to text [MSR‐VTT]) demonstrate the effectiveness of our method.