The lack of vocal emotional expression is a major deficit in social communication disorders. The current scenario of artificial intelligence focuses on collaborative training of deep learning models without losing data privacy. The primary objective of this paper is to propose a federated learning-based classification model to identify and analyze the emotional capabilities of individuals with vocal emotion deficits. The methodology has developed a collaborative and privacy-preserved approach using federated learning for training the deep learning models. The proposed methodology utilizes Mel-frequency Cepstral Coefficients (MFCC) to preprocess audio recordings. The four datasets (RAVDESS, CREMA, TESS, SAVEE) including emotion-based classified audio recordings were collected from open sources. The collected audio recordings are 3 s each and the total data set has 668376 audio files with happy - 175119 files, sad – 172611 files, angry – 176346 files, and normal - 144300 files. Further, the input audio was pre-processed to generate MFCC features. The study began with extracting features from multiple pre-trained DL models as its base model. Then, the performance of the federated learning (FL) model was tested on independent and identically distributed (IID) and non-IID data. Further, this paper presents a federated deep learning-based multimodal system for verbal communication emotions classification that uses audio datasets to meet data privacy requirements by DL on the FL ecosystem. As per the findings, the federated learning trained model provides nearly similar parametric results in comparison to base model training. For IID data, the model had 99.71 % validation accuracy, precision (99.73 %), recall (99.69 %), and validation loss (0.01). The FL architecture with non-IID data outperformed these measures with validation accuracy (99.97 %), precision (99.97 %), recall (99.97 %), and least loss (0). Hence the acquired results support the utilization of federated learning ecosystem-based trained models with identically and non-identically distributed audio features from emotion identification without losing parametric results. In conclusion, the proposed techniques could be applied to identify verbal emotional deficits in individuals and could support developing emerging technological interventions for their well-being.