The widespread availability of cutting‐edge computer technologies has shed light on the relevance of artificial intelligence (AI) applications in almost all sectors of the economy. As a result of the incorporation of voice control processing into many Internet of Things (IoT) devices, many of these IoT devices may be operated using spoken commands. The environment that is controlled by speech may include several devices, each of which may be used for a separate activity; yet, all of the devices may collect and process the same command at the same time. This may be the case if the devices can communicate with one another. Because other devices may choose to ignore orders that are intended for particular devices if those devices are not equipped to deal with those orders, only the device that is designed to carry out the activity and process the command will be able to carry out the activity. This is because only the device that is designed to carry out the activity and process the command will be able to carry out the activity. On the other hand, when all of the voice‐controlled devices capture the command through the microphone, there is a greater chance that it will mix with other sounds coming from a variety of sources. This is because the microphone is being used to capture the command from all of the voice‐controlled devices. These noises may include those that are emanating from the television, music systems, and other sounds that are created by activities taking on inside the family, among other things. During the identification of instructions via processing, any blending of other sounds that are not the primary command is regarded as noise and has to be deleted. This is because any such blending is deemed to be noise. The direction of arrival (also known as DOA) of the sound waves is given primary consideration by this approach. This is done at the same time as the performance of the system, and the proposal for it are being evaluated. Based on the angle of arrival estimate, a specific room impulse response (RIR) from a collection of defined RIR is identified as a room acoustic characteristic, and source separation is carried out using the technique of independent component analysis (ICA). Following the completion of the analysis of the signals produced by the split command speech, the characteristics of the speech are retrieved from the signals. The Mel‐frequency cepstral coefficients (MFCC) approach is used so that the operation of feature extraction may be carried out. This is the goal of the technique. Following that, a support vector machine classifier is used to the data in order to further split these characteristics into a large range of distinct groups. Comparisons are made between the performance of the SVM classifier and the performance of a large number of different classifiers, including decision trees, which are often used in applications that incorporate machine learning (DT). After analyzing its performance, the multiclass SVM classifier is found to have an accuracy of 91%, according to the conclusions of the study. Utilizing a classifier that is based on a probabilistic neural network, which is sometimes referred to as a PNN, is one way in which the accuracy of future classifications may be enhanced. This particular classifier is made up of three layers: one layer of gated recurrent units (GRU), one layer of long short‐term memory (LSTM), and one layer that integrates the two of those different kinds of memory. This classification seems to have obtained an accuracy of 94.5 percent, which is higher than the classification accuracy attained by the multiclass SVM classifier.
Read full abstract