Human-Computer Interaction (HCI) is a multidisciplinary field focused on designing and utilizing computer technology, underlining the interaction interface between computers and humans. HCI aims to generate systems that allow consumers to relate to computers effectively, efficiently, and pleasantly. Multiple Spoken Language Identification (SLI) for HCI (MSLI for HCI) denotes the ability of a computer system to recognize and distinguish various spoken languages to enable more complete and handy interactions among consumers and technology. SLI utilizing deep learning (DL) involves using artificial neural networks (ANNs), a subset of DL models, to automatically detect and recognize the language spoken in an audio signal. DL techniques, particularly neural networks (NNs), have succeeded in various pattern detection tasks, including speech and language processing. This paper develops a novel Coot Optimizer Algorithm with a DL-Driven Multiple SLI and Detection (COADL-MSLID) technique for HCI applications. The COADL-MSLID approach aims to detect multiple spoken languages from the input audio regardless of gender, speaking style, and age. In the COADL-MSLID technique, the audio files are transformed into spectrogram images as a primary step. Besides, the COADL-MSLID technique employs the SqueezeNet model to produce feature vectors, and the COA is applied to the hyperparameter range of the SqueezeNet method. The COADL-MSLID technique exploits the SLID process’s convolutional autoencoder (CAE) model. To underline the importance of the COADL-MSLID technique, a series of experiments were conducted on the benchmark dataset. The experimentation validation of the COADL-MSLID technique exhibits a greater accuracy result of 98.33% over other techniques.