Identifying the time-varying control schemes that maximize storage performance is critical to the commercial deployment of geological carbon storage (GCS) projects. However, the optimization process typically demands extensive resource-intensive simulation evaluations, which poses significant computational challenges and practical limitations. In this study, we presented the multimodal latent dynamic (MLD) model, a novel deep learning framework for fast flow prediction and well control optimization in GCS operations. The MLD model implicitly characterizes the forward compositional simulation process through three components: a representation module that learns compressed latent representations of the system, a transition module that approximates the evolution of the system states in the low-dimensional latent space, and a prediction module that forecasts the flow responses for given well controls. A novel model training strategy combining a regression loss and a joint-embedding consistency loss was introduced to jointly optimize the three modules, which enhances the temporal consistency of the learned representations and ensures multi-step prediction accuracy. Unlike most existing deep learning models designed for systems with specific parameters, the MLD model supports arbitrary input modalities, thereby enabling comprehensive consideration of interactions between diverse types of data, including dynamic state variables, static spatial system parameters, rock and fluid properties, as well as external well settings. Since the MLD model mirrors the structure of a Markov decision process (MDP) that computes state transitions and rewards (i.e., economic calculation for flow responses) for given states and actions, it can serve as an interactive environment to train deep reinforcement learning agents. Specifically, the soft actor-critic (SAC) algorithm was employed to learn an optimal control policy that maximizes the net present value (NPV) from the experiences gained by continuous interactions with the MLD model. The efficacy of the proposed approach was first compared against commonly used simulation-based evolutionary algorithm and surrogate-assisted evolutionary algorithm on a deterministic GCS optimization case, showing that the proposed approach achieves the highest NPV, while reducing the required computational resources by more than 60%. The framework was further applied to the generalizable GCS optimization case. The results indicate that the trained agent is capable of harnessing the knowledge learned from previous scenarios to provide improved decisions for newly encountered scenarios, demonstrating promising generalization performance.