Pathological images and molecular omics are important information for predicting diagnosis and prognosis. The two kinds of heterogeneous modal data contain complementary information, and the effective fusion of the two modals can better reveal the complex mechanisms of cancer. However, due to the different representation learning methods, the expression strength of different modals in different tasks varies greatly, so that many multimodal fusions do not achieve the best results. In this paper, MBFusion is proposed, to achieve multiple tasks such as prediction of diagnosis and prognosis through multi-modal balanced fusion. The MBFusion framework uses two kinds of specially constructed graph convolutional network to extract the features of molecular omics data, and uses ResNet to extract the features of pathological image data and retain important deep features by using attention and clustering, which effectively improves both kinds of the features representation, making their expressive ability balanced and comparable. The features of these two modal data are then fused through cross-attention Transformer, and the fused features are used to learn both tasks of cancer subtype classification and survival analysis by using multi-task learning. In this paper, MBFusion and other state of the art methods are compared on two public cancer datasets, and MBFusion shows an improvement of up to 10.1% by three kinds of evaluation metrics. In the ablation experiment, MBFusion explores the contribution of each modal data and each framework module to the performance. Furthermore, the interpretability of MBFusion is explained in detail to show the value of application.