Emotion recognition is an aspect of human-computer interaction, affective computing, and social robotics. Conventional unimodal approaches for emotion recognition, depending on single data sources such as facial expressions or speech signals often fall short in capturing the complexity and context-dependent nature of emotions. Multimodal Emotion Recognition (MER), which integrates information from multiple modalities, has emerged as a promising solution to overcome these limitations. In recent years, Transformers-based approaches have gathered significant attention in the fields of natural language processing and computer vision, highlighting their ability to capture long-range dependencies and semantic representations. These models have rapidly achieved the MER state-of-the-art. However, current survey papers that cover MER lack a specific focus on Transformer-based techniques. To bridge this research gap, this review paper provides a comprehensive investigation of Transformers-based approaches for MER. It explores various Transformer architectures and proposes several scenarios for using Transformers at different stages of MER process. In addition, it examines datasets suitable for MER, discusses fusion mechanisms, and introduces novel taxonomies in both MER and Transformer technologies. The review also addresses challenges and future research directions. Through this review, we aim to provide researchers with an inclusive understanding of the current state-of-the-art in Transformers-based approaches for MER, paving the way for further advancements in this rapidly developing field.