Multimodal Sentiment Analysis (MSA) comprehensively utilizing data from multiple modalities to obtain more accurate sentiment attribute, has important applications in other fields, such as social media analysis, user experience evaluation and medical health, etc. It is worth noting that previous studies have paid little attention to the inconsistency of the initial representation granularity between verbal (textual) and nonverbal (acoustic and visual) modalities. As a result, the imbalanced emotional information between them complicates the interaction process, and ultimately affects the model’s performance. To solve this problem, this paper proposes a Frame-level Nonverbal feature Enhancement Network (FNENet) to improve performance on MSA by reducing the gap and integrating asynchronous affective information between modalities. Specifically, Vector Quantization (VQ) is applied to nonverbal modalities to reduce the granularity differences and improve the performance of the model. Additionally, nonverbal information is integrated through the Sequence Fusion mechanism (SF) into a pre-trained language model to enhance the textual representation, which benefits the word-level semantic expression according to the asynchronous affective cues preserved in unaligned frame-level nonverbal features. Extensive experiments on three benchmark datasets demonstrate that FNENet significantly outperforms baseline methods. It indicates that our model has potential application on MSA.