To explore the heterogeneous sentiment information in each modal feature and improve the accuracy of sentiment analysis, this paper proposes a Multimodal Sentiment Analysis based on Text-Centric Sharing-Private Affective Semantics (TCSP). First, the Deep Canonical Time Wrapping (DCTW) algorithm is employed to effectively align the timing deviations of Audio and Picture modalities. Then, a cross-modal shared mask matrix is designed, and a mutual attention mechanism is introduced to compute the shared affective semantic features of Audio-picture-to-text. Following this, the private affective semantic features within Audio and Picture modalities are derived via the self-attention mechanism with LSTM. Finally, the Transformer Encoder structure is improved, achieving deep interaction and feature fusion of cross-modal emotional information, and conducting emotional analysis. Experiments are conducted on the IEMOCAP and MELD datasets. By comparing with current state-of-the-art models, the accuracy of the TCSP model reaches 82.02%, fully validating the effectiveness. In addition, the rationality of the design of each structure within the model is verified through ablation experiments.