Deep neural nets are opaque black-box models with little to no understanding of underlying model dynamics. This issue is more prevalent in the case of multimodal artificial intelligence (AI) systems, where model explainability and interpretability are prime concerns due to data integration from heterogeneous data streams and complex inter and intramodal interactions. However, the traditional explainable models are challenging to apply in the multimodal scenario. We propose a co-learning-based solution for fostering model explainability for the natural language processing (NLP)-based multimodal sentiment analysis application to address this issue. The proposed approach employs explainability by obeying the co-learning principles of dealing with noisy and missing modality either at train or test time to find the modality dominance by extracting the local and global model explanations. The proposed approach is validated with post hoc explainability methods such as local interpretable model-agnostic explanations (LIME) and SHapley Additive exPlanations (SHAP) gradient-based explanations to model the modality contributions and interactions at the fusion level. The co-learning-based system ensures trust and robustness in the model by providing some degree of model explainability along with robustness. The kind of explanations provided is multifaceted and is obtained through a peek inside the black box, hence is specifically helpful for the system designers and model developers to understand the complex model dynamics that are far more challenging in the case of multimodal applications.