Accurate projections of carbon emissions are essential for climate change responses and environmental policy development. However, the complexity of carbon emission projections has increased due to various factors like technology and policy, posing challenges for accuracy. Many studies rely on time-series data alone, overlooking the combined impact of multiple information sources on carbon emissions. To overcome these bottlenecks, a hybrid model is proposed, integrating multimodal data, convolutional neural network inceptionv2 (CNN-inceptionv2), text convolutional neural network (TextCNN) optimized by attention mechanism, and bi-directional long and short-term memory (BiLSTM). Firstly, the time series feature set and text feature set are extracted using the CNN-inceptionv2 and the TextCNN, respectively. Subsequently, the feature sets of the two modal information are fused and outputted into a multimodal feature set, which are ultimately predicted using BiLSTM. The validity of the study model was also verified by different methods such as ablation experiments, single model and combined model. The coefficient of determination (R2) values were 0.986 for the validation set and 0.933 for the prediction set. In comparison to other models, the present model showed an average improvement of 16.6% in R2 values, fully indicating its superior predictive performance. The high point and interval predictions of carbon emissions further verified the superior prediction accuracy of the present model. The present model can accurately predict carbon emissions, providing data references for carbon emission decision makers. Additionally, it further enriches the methodological system of carbon emission prediction and aids in strategic planning for future carbon emission and energy management.