Automatic sound classification attracts increasing research attention owing to its vast applications, such as robot navigation, environmental sensing, musical instrument classification, medical diagnosis, and surveillance. In this research, we propose an ensemble convolutional bidirectional Long Short-Term Memory (CBiLSTM) network with optimal hyper-parameter selection for undertaking sound classification. We first transform each audio signal into a spectrogram representation using the Short-time Fourier transform (STFT). A Particle Swarm Optimization (PSO) variant is subsequently proposed to optimize the learning rate, weight decay, numbers of filters and hidden units in the convolutional and BiLSTM layers, respectively, in order to extract effective spatial–temporal characteristics from the spectrogram inputs. To tackle the issue of stagnation in optimization, the proposed algorithm incorporates local exploitation using secant and Newton–Raphson methods, promising leader generation using regular and irregular super-ellipse formulae, and three-dimensional spherical search coefficients. Moreover, it takes into account multiple fused elite signals in conjunction with numerical analysis based exploitation to balance between diversification and intensification. A variety of CBiLSTM networks with distinctive optimized settings are devised. An ensemble model is then constructed by incorporating a set of three yielded networks based on a majority voting scheme. Evaluated using several audio data sets, our ensemble CBiLSTM networks outperform those with default and optimal settings identified by other search methods, existing deep architectures and state-of-the-art related studies. In addition to sound classification tasks, the proposed PSO algorithm also outperforms a number of classical and advanced search methods for solving diverse unimodal and multimodal benchmark functions with statistical significance.