In this paper, we introduce an innovative investigation in speech emotion recognition (SER). The proposed model combines deep learning-based and handcrafted audio features to achieve optimal accuracy. The proposed model employs an iterative feature selection and majority voting pipeline to obtain better results by fusing deep learning-based and handcrafted features. The wav2vec2 model and the openSmile audio processing library are used in order to extract audio features from audio data. Then the feature selection and majority voting techniques are used to identify the optimal feature selection methods for diverse feature sets and to combine their strengths. The experiments are performed using a diverse and extensive corpus to ensure the robustness of the proposed method. In the construction of this multi-corpus dataset, we used four well-known benchmark datasets, namely Ravdess, Savee, Crema-D, and Tess. All of these datasets are publicly available. These datasets are combined on six common emotions: sadness, happiness, fear, anger, surprise, and disgust. The resultant dataset comprises 11,511 samples across these categories. The proposed method has been shown to achieve results comparable to those reported in the existing literature. The experimental results indicate that the proposed pipeline leads to a 3 % improvement in classification accuracy. The highest achieved accuracy on the multi-corpus dataset is 92.55 %.