Deep Neural Spectral Subtraction Centroid Mel Frequency Powered Secure Multimodal Biometric Authentication

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Objectives: To propose a Deep Neural Spectral Subtraction Centroid Mel Frequency Spatial-Temporal (DN-SCMFST) framework for robust multimodal biometric authentication (face and voice). Method: The system uses the Speaking Faces dataset (142 subjects, 13,000 samples). Preprocessing is performed with a Kushner–Stratonovich filter and spectral subtraction. Features are extracted via spectral centroid and spatio-temporal descriptors, followed by score-level fusion. Findings: DN-SCMFST achieved accuracy improvements of 5–18%, PSNR gains of 18–28%, FAR reduction of 20–37%, FRR reduction of 18–28%, and recognition time reduction of 15–27% compared with CNN+RNN and DL-MBA models. Novelty: Integrates deep neural spectral subtraction with centroid-based mel frequency and temporal-spatial fusion, offering resilience against noise and improved real-time recognition efficiency. Keywords: Multimodal biometric authentication, Deep neural network, Spectral subtraction, Spectral centroid, Mel frequency, Score-level fusion, Kushner–Stratonovich

Similar Papers
  • Research Article
  • Cite Count Icon 14
  • 10.5664/jcsm.7366
Effects of Increased Pharyngeal Tissue Mass Due to Fluid Accumulation in the Neck on the Acoustic Features of Snoring Sounds in Men.
  • Oct 15, 2018
  • Journal of Clinical Sleep Medicine
  • Shumit Saha + 4 more

Snoring sounds are generated by the vibration of pharyngeal tissue due to the upper airway narrowing. While recorded by a microphone placed over the neck, snoring can pass through the pharyngeal tissue surrounding the upper airway. Thus, changes in the pharyngeal tissue content may change the acoustic properties of the snoring sounds. Rostral fluid shift and the consequent increases in neck fluid volume (NFV) and neck circumference (NC) can increase pharyngeal tissue mass. Therefore, the goal of this study was to investigate the relationship between increases in pharyngeal tissue mass, as assessed by increased NFV and NC, and snoring sounds features. We obtained data from a previous study where 20 males who were not obese participated in a daytime polysomnography and their NC and NFV were measured before and after sleep. During sleep, snoring sounds were recorded with a microphone placed over the neck. Spectral centroid of the snoring sounds was estimated. Then, the first five snoring segments were selected from the first and last 30 minutes of stage N2 sleep. We found a significant decrease in the snoring spectral centroid from the beginning to end of sleep. We also found that spectral centroid from the end of sleep in frequency ranges below 200 Hz was inversely correlated with the increases in NFV and NC from before to after sleep. These results suggest that snoring spectral centroid can be used as a noninvasive and convenient method to assess variations in the pharyngeal tissue mass.

  • Research Article
  • Cite Count Icon 18
  • 10.1080/03772063.2017.1369369
Detection and Characterization of Bearing Faults from the Frequency Domain Features of Vibration
  • Sep 21, 2017
  • IETE Journal of Research
  • P Arun + 2 more

ABSTRACTThe characteristics of vibrations is one which is widely used for the non-intrusive inspection and health monitoring of bearings. However, automated methods, intended for predicting the health status of bearings greatly depend on the features extracted from the vibration signal. In this paper, the ability of frequency domain features such as spectral role-off (SR), median frequency (MF), spectral centroid (SC), dominant frequency (DF), and spectral flux (SF) of the bearing vibration data corresponding to healthy, inner race failure (IRF), roller element defect (RED), and outer race failure (ORF) to identify the state of the bearing is analyzed. The SF, DF, and SC are identified directly from the vibration spectra. The MF and SR are computed from the power spectral density estimate using an analytical method. Before computing the spectrum, the vibration signal is preconditioned with offset elimination and normalization. The normalized data is windowed with Hanning window to suppress the ripples induced in the spectrum during the computation of fast Fourier transform. It has been observed that among the features, MF and SC characterize the status of bearing and the type of faults better than other features. MF is useful to distinguish healthy bearing from IRF and IRF from RED. SC is useful to distinguish IRF from RED and IRF from ORF. The SR, MF, SC, DF, and SF corresponding to the vibrations acquired from normal and faulty bearings differ with a “P” value of 2.22045 × 10−16, ≈ 0, 1.11022 × 10−16, 0.0008, and 2.35957 × 10−8, respectively, for a level of significance 0.05. SR, MF, and SC are statistically more significant than DF and SF.

  • Conference Article
  • Cite Count Icon 42
  • 10.1109/icaecc50550.2020.9339502
Spectral Features for Emotional Speaker Recognition
  • Dec 11, 2020
  • P Sandhya + 3 more

Speaker recognition in an emotive environment is a bit challenging task because of influence of emotions in a speech. Identifying the speaker from the speech can be done by analyzing the features of the speech signal. In normal conditions, identifying a speaker is not a tedious task. Whereas, identifying the speaker in an emotional environment such as happy, sad, anger, surprise, sarcastic, fear etc. is really challenging, since speech becomes altered under emotions and noise. The spectral features of speech signal include Mel Frequency Cepstral Co-efficients(MFCC), Shifted Delta Cepstral Coefficients (SDCC), spectral centroid, spectral roll off, spectral flatness, spectral contrast, spectral bandwidth, chroma-stft, zero crossing rate, root mean square energy, Linear Prediction Cepstral Coefficients (LPCC), spectral subband centroid, Teager energy based MFCC, line spectral frequencies, single frequency cepstral coefficients, formant frequencies, Power Normalized Cepstral Coefficients (PNCC), etc. The features that are extracted from the speech signal are classified using classifiers. Support Vector Machine(SVM), Gaussian Mixture Model, Gaussian Naive Bayes, K-Nearest Neighbour, Random Forest and a simple Neural Network using Keras is used for classification. The important application include security systems in which a person can be identified by biometrics that is voice of the person. The work aims to identify the speaker in an emotional environment using spectral features and classify using any of the classification techniques and to achieve a high speaker recognition rate. Feature combinations can also be used to improve accuracy. The proposed model performed better than most of the state-of-the-art methods.

  • Research Article
  • Cite Count Icon 20
  • 10.1016/j.csl.2023.101549
Speech enhancement approach for body-conducted unvoiced speech based on Taylor–Boltzmann machines trained DNN
  • Jul 10, 2023
  • Computer Speech & Language
  • C Karthikeyan + 5 more

Speech enhancement approach for body-conducted unvoiced speech based on Taylor–Boltzmann machines trained DNN

  • Conference Article
  • Cite Count Icon 21
  • 10.1109/wispnet.2017.8299943
Speaker recognition and verification using artificial neural network
  • Mar 1, 2017
  • Neha Chauhan + 1 more

Speaker recognition is a biometrie technique which uses individual voice samples for recognition purpose. Speaker recognition is mainly divided into speaker identification and speaker verification. In this paper, a comparative study is made between various combinations of features for speaker identification. Mel frequency Cepstral Coefficient (MFCC) features are combined with spectral centroid and spectral subtraction and tested for improvement in efficiency. Feed forward artificial neural network is used as a classifier. System was tested for 30 speakers. For speaker identification, an average identification rate of 65.3% is achieved when MFCC is combined with centroid features and an identification rate of 60% is achieved when MFCC is combined with spectral subtraction. For speaker verification, an average verification rate of 65.7% is achieved when MFCC is combined with spectral subtraction and a verification rate of 75.3% is achieved when MFCC is used along with centroid.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/icacdot.2016.7877702
Score level fusion based Multimodal biometric identification using Thepade's Sorted Ternary Block Truncation coding with variod proportion of Iris, Palmprint, Left Fingerprint & Right Fingerprint with asorted similarity measures & different Colorspaces
  • Sep 1, 2016
  • Manisha S Madane + 1 more

Biometric is used to automate the measurement of biological data. The measurement and recording of the physical characteristics of an individual for the use in subsequent personal identification. Multimodal biometric authentication is system which uses two or more biometric trait. It provides more immunity against spoofing. To reduce feature vector size and to improve genuine acceptance rate proposed technique is developed. In this paper fusion of Left Fingerprint, Right Fingerprint, Iris, Palmprint are experimented. Multimodal biometrics having different level fusion such as score level fusion, feature level fusion, decision level fusion. In this paper score level fusion is considered for experimentation. Score level fusion contains more gratified and worthful information. Here different score proportions are experimented and performance efficiency is measured using genuine acceptance ratio (GAR). True acceptance rate of recognition is increased because of multiple biometrics characters. In proposed technique features are extracted using Thepade's sorted ternary block truncation coding. Using TSTBTC and matching score proportion Iris: Palm print: Left Fingerprint: Right Fingerprint(40∶2∶1∶1) gives better performance as indicated by higher GAR values observed by 71.86%.

  • Research Article
  • 10.24012/dumf.1675408
EEG-Based Comparative Study of Brain Activity during Imagined Natural and Induced Water and Saliva Swallowing
  • Sep 30, 2025
  • Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi
  • Sevgi Gökçe Aslan

Dysphagia often makes eating and drinking painful, stressful, and socially isolating, potentially leading to malnutrition, dehydration, weight loss, and respiratory infections. In this study, the relationship between swallowing and brain signals was examined to contribute to the electrophysiological understanding of the imagination of swallowing and rehabilitation of dysphagia patients. To examine the swallowing event, three different experiments were conducted. The experiments included (i) natural water swallowing, (ii) swallowing saliva in an induced manner, and (iii) swallowing a sip of water in an induced manner. Visual cues on a computer monitor were used to induce the perception of swallowing and imagination. EEG data from 16 channels obtained during 15 trials of these experimental paradigms from 30 subjects (15 men) were subjected to different processes such as noise removal, selection of signal segments corresponding to the imagination of swallowing, extraction of frequency domain features, and statistical analysis. Eleven features such as spectral centroid, mean and median frequency, delta, theta, alpha and beta band powers, and relative band powers obtained from 16 channels (a total of 176 features) were first subjected to the Shapiro-Wilks normality test individually. As a result of this test, the statistical analyses were carried out with the help of repeated measures one-way ANOVA test for the features with normal distribution (spectral centroid from 11 channels), and the Friedman test for the features with non-normal distribution (spectral centroid from the remaining 5 channels and all other features from 16 channels). As a result of these tests, it is seen that 76.7% of all features yield statistically significant differences between 3 different swallowing approaches. We suggest that identifying discriminative EEG-based features could significantly contribute to the development of novel brain-machine interface applications for dysphagia rehabilitation.

  • Research Article
  • Cite Count Icon 107
  • 10.1016/j.ijleo.2014.07.027
Multimodal biometric authentication based on score level fusion of finger biometrics
  • Oct 25, 2014
  • Optik
  • Jialiang Peng + 3 more

Multimodal biometric authentication based on score level fusion of finger biometrics

  • Research Article
  • Cite Count Icon 80
  • 10.1016/j.specom.2011.01.005
Investigation of spectral centroid features for cognitive load classification
  • Jan 18, 2011
  • Speech Communication
  • Phu Ngoc Le + 4 more

Investigation of spectral centroid features for cognitive load classification

  • Research Article
  • Cite Count Icon 6
  • 10.1002/cpe.5065
Improvised emotion and genre detection for songs through signal processing and genetic algorithm
  • Dec 3, 2018
  • Concurrency and Computation: Practice and Experience
  • R. Geetha Ramani + 1 more

SummaryMusical tunes are bundle of chords representing emotion which impart diverse of genres. Past history highlighted copious amount of research work emotion and genre classification with still increasingly rapid advancement. Music has various emotional forms as happy, sad, anger and fear. Its various genre forms are Classical, Country, Disco, Hip‐hop, Jazz, and Rock. These emotions and genre can be segregated by identifying the frequency of chords notes (swarams in Tamil music). This paper deals with identifying emotions and genre for classical music both western and south Indian classical music, viz, Carnatic music. The music was clipped and segmented to determining frequency of notes using Shortest Fast Fourier Transformation (STFT). Music features such as mel frequency, pitch beat, zero crossing rate, and spectral centroid were derived from the obtained frequency. Based on the audio features, emotion and genre were identified for the given data set with genetic algorithm as a classification technique. The MIREX‐Mood classification dataset was considered for listing out emotions. The songs from Million song data set and emotion classification repository were considered as ground truth for western classical music and group of Illayaraja Tamil film songs was considered to identify Carnatic music emotions. The classification was done using genetic algorithm. Mel frequency, pitch, and zero crossing rate were considered as individual representations to get best fit ratio and it is found to give accuracy percentage of 99.03%.

  • Research Article
  • Cite Count Icon 44
  • 10.1121/1.4743257
A cross-linguistic acoustic study of fricatives
  • Nov 1, 2000
  • The Journal of the Acoustical Society of America
  • Matthew K Gordon + 2 more

This work presents results of an acoustic study of fricatives in 7 languages (Aleut, Chickasaw, Hupa, Montana Salish, Scottish Gaelic, Toda, and Western Apache), all of which contrast fricatives made at several places of articulation. Measurements of the frequency of spectral peaks and centroid frequencies indicate many similarities between the languages in the acoustic properties defining the fricatives. Some of the principal findings are the following. Alveolar sibilants typically have the highest spectral peak and centroid frequency. Lateral and palatoalveolar fricatives have spectral peaks and centroids intermediate in frequency between alveolar sibilants and backer fricatives. Among the back fricatives, peaks and centroids of uvulars are characteristically lower than those of velars. Rounding of back fricatives induces further lowering of peaks and/or centroids. Contrasts in backness and rounding among the back fricatives are also associated with differences in F2 of the following vowels: F2 values are lower following uvulars than velars, and lower following rounded than unrounded fricatives. Labiodental fricatives typically have flat spectra with poorly defined spectral peaks. Finally, the contrast between lateral fricatives and palatoalveolar sibilants is variably realized, depending on language and speaker, as a difference in the location of spectral peaks and/or centroid frequency.

  • Research Article
  • Cite Count Icon 9
  • 10.22266/ijies2016.0930.03
Two Levels Fusion Based Multimodal Biometric Authentication Using Iris and Fingerprint Modalities
  • Sep 30, 2016
  • International Journal of Intelligent Engineering and Systems
  • Vedururu Sireesha + 1 more

In the document, we are intending to present an innovative technique for the multimodal biometric authentication. Initially the input image is preprocessed then offered to feature extraction, where the modified local binary pattern is effectively utilized. Thereafter, the extracted features are furnished to the feature level and score level fusions. In feature level fusion, extracted features are offered to the GSO where the optimal features are shortlisted, and are furnished to the optimized neural network which effectively detects the iris and fingerprint image. In score level fusion, extracted features from the iris image are offered to the PSO and naive bayes classifier here one score value is achieved. After that, extracted features from the fingerprint image are applied to the AGFS and then one score value is attained. Finally, both the score values are combined. The evaluation tools utilized precision, FAR and FRR. The proposed method implemented in MATLAB platform.

  • Conference Article
  • Cite Count Icon 101
  • 10.21437/interspeech.2013-203
Speech activity detection on youtube using deep neural networks
  • Aug 25, 2013
  • Neville Ryant + 2 more

Speech activity detection (SAD) is an important first step in speech processing. Commonly used methods (e.g., frame-level classification using gaussian mixture models (GMMs)) work well under stationary noise conditions, but do not generalize well to domains such as YouTube, where videos may exhibit a diverse range of environmental conditions. One solution is to augment the conventional cepstral features with additional, hand-engineered features (e.g., spectral flux, spectral centroid, multiband spectral entropies) which are robust to changes in environment and recording condition. An alternative approach, explored here, is to learn robust features during the course of training using an appropriate architecture such as deep neural networks (DNNs). In this paper we demonstrate that a DNN with input consisting of multiple frames of mel frequency cepstral coefficients (MFCCs) yields drastically lower frame-wise error rates (19.6%) on YouTube videos compared to a conventional GMM based system (40%).

  • Research Article
  • Cite Count Icon 64
  • 10.2478/s11772-008-0054-8
Multimodal biometric authentication based on score level fusion using support vector machine
  • Nov 18, 2008
  • Opto-Electronics Review
  • F Wang + 1 more

Multimodal biometric authentication based on score level fusion using support vector machine

  • Research Article
  • Cite Count Icon 1
  • 10.17654/0974165825009
MULTIMODAL BIOMETRIC ENROLMENT AND AUTHENTICATION SYSTEM (MBEAS) WITH MODIFIED SCORE-LEVEL FUSION AND TRIBLENDNN-BASED TEMPLATE MATCHING
  • Nov 30, 2024
  • Advances and Applications in Discrete Mathematics
  • Abdulgader Zaid Almaymuni + 3 more

This article proposed a multimodal biometric enrolment and authentication system (MBEAS) with modified score-level fusion and TriBlendNN-based template matching approach. It consists of four phases: (i) enrolment phase, (ii) security phase, (iii) storage phase, and (iv) authentication phase. Initially, the raw data of the iris, face, hand, speech, signature, handwriting, fingerprint, and keystroking are collected from the BiosecurID database, and then the raw images are pre-processed via resizing and cropping. Raw signal of speech is pre-processed via wavelet denoising and spectral subtraction. The raw data of key stroking is pre-processed via Z-score normalization. Then the pre-processed images of iris, face, signature, hand, handwriting, and fingerprints are segmented via optimized watershed segmentation. The pre-processed speech signal is segmented via VAD. From the segmented data, the optimal features are extracted for iris (LBP), face (IncepV3), signature and handwriting (shape features and GLCM), speech (MFCC), fingerprints (minutiae extraction), hand (palm print features), and keystroking. From the feature-extracted data, the feature fusion is then performed via modified score-level fusion. After the enrolment phase, the feature fused data is secured using watermarking. Then the watermarked data is stored in cloud storage. The final stage is the authentication stage, wherein the template matching is processed via the newly proposed TriBlendNN model. The proposed TriBlendNN model is a combination of the CNN, RNN, and Bi-LSTM. The final outcome comes from template matching. The proposed model is implemented in Python and its accuracy at learning rates of 70% and 80% are 96.89% and 97.76%, respectively.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant