Understanding drug–drug interactions (DDIs) plays a vital role in the fields of drug disease treatment, drug development, preventing medical error, and controlling health care-costs. Extracting potential from biomedical corpora is a major complement of existing DDIs. Most existing DDI extraction (DDIE) methods do not consider the graph and structure of drug molecules, which can improve the performance of DDIE. Considering the different advantages of bi-directional gated recurrent units (BiGRU), Transformer, and attention mechanisms in DDIE tasks, a multimodal feature fusion model combining BiGRU and Transformer (BiGGT) is here constructed for DDIE. In BiGGT, the vector embeddings of medical corpora, drug molecule topology graphs, and structure are conducted by Word2vec, Mol2vec, and GCN, respectively. BiGRU and multi-head self-attention (MHSA) are integrated into Transformer to extract the local–global contextual DDIE features, which is important for DDIE. The extensive experiment results on the DDIExtraction 2013 shared task dataset show that the BiGGT-based DDIE method outperforms state-of-the-art DDIE approaches with a precision of 78.22%. BiGGT expands the application of multimodal deep learning in the field of multimodal DDIE.