In recent years, the field of few-shot action recognition (FSAR) has garnered significant attention. Although many methods primarily rely on mono-modal data, there is a growing trend towards utilizing multi-modal data. However, existing FSAR methods often employ simplistic fusion techniques, such as concatenation or comparison, which may not fully leverage the potential of multi-modal information. We aim to explore multi-modal information from different views and subsequently perform multi-view fusion. Based on the textual and visual modality information extracted by the CLIP backbone, we propose an MDMF method that comprehensively utilizes these modalities at two levels. At the first level, we extract visual information from two views: Local Temporal Context and Global Temporal Context. Within each view, we fuse visual features with textual information through concatenation and Cross-Transformer operations. We then employ metric comparison to derive probability distributions for classification under the meta-learning paradigm for each view. At the second level, we fuse the probability distributions from both views to make the final decision. Concurrently, during training, for each query, we calculate the posterior distributions of the textual and visual modalities within each view using text information distance and visual information distance. Based on these distributions, we group query samples with higher view reliability. Subsequently, we enhance the representation of the less reliable view of specific samples through mutual distillation. By delving deep into multi-modal data through a multi-view approach, our few-shot action recognition model demonstrates the potential for achieving higher accuracy and enhanced robustness. Our code is available at the URL: https://github.com/cofly2014/CLIP-MDMF.