For the diagnosis and outcome prediction of gastric cancer (GC), machine learning methods based on whole slide pathological images (WSIs) have shown promising performance and reduced the cost of manual analysis. Nevertheless, accurate prediction of GC outcome may rely on multiple modalities with complementary information, particularly gene expression data. Thus, there is a need to develop multimodal learning methods to enhance prediction performance. In this paper, we collect a dataset from Ruijin Hospital and propose a multimodal learning method for GC diagnosis and outcome prediction, called GaCaMML, which is featured by a cross-modal attention mechanism and Per-Slide training scheme. Additionally, we perform feature attribution analysis via integrated gradient (IG) to identify important input features. The proposed method improves prediction accuracy over the single-modal learning method on three tasks, i.e., survival prediction (by 4.9% on C-index), pathological stage classification (by 11.6% on accuracy), and lymph node classification (by 12.0% on accuracy). Especially, the Per-Slide strategy addresses the issue of a high WSI-to-patient ratio and leads to much better results compared with the Per-Person training scheme. For the interpretable analysis, we find that although WSIs dominate the prediction for most samples, there is still a substantial portion of samples whose prediction highly relies on gene expression information. This study demonstrates the great potential of multimodal learning in GC-related prediction tasks and investigates the contribution of WSIs and gene expression, respectively, which not only shows how the model makes a decision but also provides insights into the association between macroscopic pathological phenotypes and microscopic molecular features.