Multimodal sentiment analysis aims to extract and integrate information from different modalities to accurately identify the sentiment expressed in multimodal data. How to effectively capture the relevant information within a specific modality and how to fully exploit the complementary information among multiple modalities are two major challenges in multimodal sentiment analysis. Traditional approaches fail to obtain the global contextual information of long time-series data when extracting unimodal temporal features, and they usually fuse the features from multiple modalities with the same method and ignore the correlation between different modalities when modeling inter-modal interactions. In this paper, we first propose an Attentional Temporal Convolutional Network (ATCN) to extract unimodal temporal features for enhancing the feature representation ability, then introduce a Multi-layer Feature Fusion (MFF) model to improve the effectiveness of multimodal fusion, which fuses the different-level features by different methods according to the correlation coefficient between the features, and cross-modal multi-head attention is used to fully explore the potential relationship between the low-level features. The experimental results on SIMS and CMU-MOSI datasets show that the proposed model achieves superior performance on sentiment analysis tasks compared to state-of-the-art baselines.