Enhanced Multimodal Speech Processing for Healthcare Applications: A Deep Fusion Approach
Enhanced Multimodal Speech Processing for Healthcare Applications: A Deep Fusion Approach
- Conference Article
5
- 10.1117/12.2262295
- May 1, 2017
Person re-identification is the task of correctly matching visual appearances of the same person in image or video data while distinguishing appearances of different persons. The traditional setup for re-identification is a network of fixed cameras. However, in recent years mobile aerial cameras mounted on unmanned aerial vehicles (UAV) have become increasingly useful for security and surveillance tasks. Aerial data has many characteristics different from typical camera network data. Thus, re-identification approaches designed for a camera network scenario can be expected to suffer a drop in accuracy when applied to aerial data. In this work, we investigate the suitability of features, which were shown to give robust results for re- identification in camera networks, for the task of re-identifying persons between a camera network and a mobile aerial camera. Specifically, we apply hand-crafted region covariance features and features extracted by convolutional neural networks which were learned on separate data. We evaluate their suitability for this new and as yet unexplored scenario. We investigate common fusion methods to combine the hand-crafted and learned features and propose our own deep fusion approach which is already applied during training of the deep network. We evaluate features and fusion methods on our own dataset. The dataset consists of fourteen people moving through a scene recorded by four fixed ground-based cameras and one mobile camera mounted on a small UAV. We discuss strengths and weaknesses of the features in the new scenario and show that our fusion approach successfully leverages the strengths of each feature and outperforms all single features significantly.
- Research Article
7
- 10.1061/(asce)he.1943-5584.0001694
- Jul 18, 2018
- Journal of Hydrologic Engineering
Reservoir inflow forecast plays a crucial part in programming, development, operation, and management of water resource systems. To better reveal the complex properties of daily reservoir inflow, a clustered deep fusion (CDF) approach is proposed in this paper. First, variational mode decomposition (VMD) is used to decompose the daily reservoir inflow series into multiple modes, which are clustered into different sets by fuzzy c-means according to the Xie-Beni index in view of attribute domain. In each cluster, a deep autoencoder model (DAE) is developed for deep representations of the attributes in the deep domain. DAE outputs are finally fused at the synthesis domain into the forecasting results using random forest (RF). In this way, the inflow time series may be successively observed in the attribute domain, deep domain, and synthesis domain, which results in a clearer understanding of reservoir inflow trend. The present approach is modeled and evaluated using historical data collected from the Three Gorges Reservoir, China. For comparison, two kinds of learning patterns—deep learning (VMD-DAE-RF and DAE) and shallow learning (feed-forward neural network, least-squares support regression, and RF)—are applied to the same case. The results indicate that the proposed CDF model outperforms all comparison models in terms of mean absolute percentage error (6.174%), root mean-square error (1,077.428 m3/s), and correlation coefficient criteria (0.987). Thus, it is concluded that deep learning in the cluster fusion architecture is more promising.
- Conference Article
- 10.1109/bigdata47090.2019.9006395
- Dec 1, 2019
Fusion has been widely used in machine learning community, especially for problems dealing with multiple input sources and classifiers. The general strategy for information fusion in deep neural network is to directly concatenate the embedding features on the latent space of input sources. However, it is very hard to capture the relative importance of fused sources. It is also impossible to learn the correlation among fused multimodalities inputs, e.g., intra-class and inter-class similarities. Besides, most existing deep learning fusion approaches use universal fusion weights strategy, which cannot fully exploit the relative importance of different inputs. In order to address these problems, in this work we propose an Adaptive Weighted Deep Fusion scheme (AWDF) to capture potential relationships among various input sources. It integrates the feature level and decision level fusion in one framework. Furthermore, in order to address the limitations of existing fusing models with fixed weights, we propose a new scheme named Cross Decision Weights Method (CDWM). It can dynamically learn the weight for each input branch during the fusion process instead of utilizing pre-defined weights. To evaluate the performance of AWDF, we conduct experiments on three different real-world datasets: Wild Business Terms (WBT) Dataset, Iceberg Detection Dataset and CareerCon Dataset. Our experimental results demonstrate the superiority of AWDF over other fusion approaches.
- Research Article
33
- 10.1007/s00034-019-01094-1
- Mar 21, 2019
- Circuits, Systems, and Signal Processing
Recently, neural network-based deep learning methods have been popularly applied to computer vision, speech signal processing and other pattern recognition areas. Remarkable success has been demonstrated by using the deep learning approaches. The purpose of this article is to provide a comprehensive survey for the neural network-based deep learning approaches on acoustic event detection. Different deep learning-based acoustic event detection approaches are investigated with an emphasis on both strongly labeled and weakly labeled acoustic event detection systems. This paper also discusses how deep learning methods benefit the acoustic event detection task and the potential issues that need to be addressed for prospective real-world scenarios.
- Research Article
1
- 10.53759/0088/jbsha202101016
- Jul 5, 2021
- Journal of Biomedical and Sustainable Healthcare Applications
Digital image fusion has advanced significantly in governments and civil domains since its introduction in the late 1980s, certainly image fusion of infrared light, materials characterization, remote sensing data fusion, visions segmentation techniques, and brain tumor detection fusion. In medical diagnostics, imaging technology is critical. Because single medical pictures cannot match the demands of diagnostic techniques, which necessitate a huge quantity of data, image fusion study has become a hot subject. Single-mode integration and multi - modal fusion is the two types of medical image processing. Due to the limitations of single-modal fusion's data, many scientists are investigating multidimensional fusion. Brain tumor detection fusion represents the operations of integrating multiple images from imaging modality to formulate fused images with larger volume of data, allowing medical images to be more clinically useful. In this article, we focus on providing a survey of multi-modal image fusion approaches with central focus on novel developments in the domain based on the present fusion approaches, incorporating deep learning fusion approaches. Lastly, this concludes that contemporary multi-modal image fusion study findings are significantly fundamental, and the development trends is on the increase, however there are several hurdles in the study area.
- Conference Article
5
- 10.1109/iros47612.2022.9981835
- Oct 23, 2022
Motion estimation approaches typically employ sensor fusion techniques, such as the Kalman Filter, to handle individual sensor failures. More recently, deep learning-based fusion approaches have been proposed, increasing the performance and requiring less model-specific implementations. However, current deep fusion approaches often assume that sensors are synchronised, which is not always practical, especially for low-cost hardware. To address this limitation, in this work, we propose AFT-VO, a novel transformer-based sensor fusion architecture to estimate VO from multiple sensors. Our framework combines predictions from asynchronous multi-view cameras and accounts for the time discrepancies of measurements coming from different sources. Our approach first employs a Mixture Density Network (MDN) to estimate the probability distributions of the 6-DoF poses for every camera in the system. Then a novel transformer-based fusion module, AFT-VO, is introduced, which combines these asynchronous pose estimations, along with their confidences. More specifically, we introduce Discretiser and Source Encoding techniques which enable the fusion of multi-source asynchronous signals. We evaluate our approach on the popular nuScenes and KITTI datasets. Our experiments demonstrate that multi-view fusion for VO estimation provides robust and accurate trajectories, outperforming the state of the art in both challenging weather and lighting conditions.
- Conference Article
3
- 10.1109/irc.2020.00067
- Nov 1, 2020
Understanding and interpreting a scene is a key task of environment perception for autonomous driving, which is why autonomous vehicles are equipped with a wide range of different sensors. Semantic segmentation of sensor data provides valuable information for this task and is often seen as key enabler. In this paper, we are presenting a deep learning approach for 3D semantic segmentation of lidar point clouds. The proposed architecture uses a range view representation of 3D point clouds and additionally exploits camera features to increase accuracy and robustness. In contrast to other approaches, which fuse lidar and camera feature maps once, we fuse them iteratively and at different scales inside our network architecture. We demonstrate the benefits of the presented iterative deep fusion approach over single fusion approaches on a large benchmark dataset. Our evaluation shows considerable improvements, resulting from the additional use of camera features. Furthermore, our fusion strategy outperforms the current state-of-the-art strategy by a considerable margin. Despite the use of camera features, the presented approach is also trainable solely with point cloud labels.
- Research Article
1
- 10.1002/ima.22925
- May 29, 2023
- International Journal of Imaging Systems and Technology
Fighting against <scp>COVID</scp>‐19: Innovations and applications
- Research Article
3
- 10.1016/j.iintel.2023.100061
- Oct 2, 2023
- Journal of Infrastructure Intelligence and Resilience
Bridge condition rating is a challenging task as it largely depends on the experience-level of the manual inspection and therefore is prone to human errors. The inspection report often consists of a collection of images and sequences of sentences (text) explaining the condition of the considered bridge. In a routine manual bridge inspection, an inspector collects a set of images and textual descriptions of bridge components and assigns an overall condition rating (ranging between 0 and 9) based on the collected information. Unfortunately, this method of bridge inspection has been shown to yield inconsistent condition ratings that correlate with inspector experience. To improve the consistency among image-text inspection data and further predict the accordant condition ratings, this study first provides a collective image-text dataset, extracted from the collection of bridge inspection reports from the Virginia Department of Transportation. Using this dataset, we have developed novel deep learning-base methods for an automatic bridge condition rating prediction based on data fusion between the textual and visual data from the collected report sets.Our proposed multi modal deep fusion approach constructs visual and textual representations for images and sentences separately using appropriate encoding functions, and then fuses representations of images and text to enhance the multi-modal prediction performance of the assigned condition ratings. Moreover, we study interpretations of the deployed deep models using saliency maps to identify parts of the image-text inputs that are essential in condition rating predictions. The findings of this study point to potential improvements by leveraging consistent image-text inspection data collection as well as leveraging the proposed deep fusion model to improve the bridge condition prediction rating from both visual and textual reports.
- Research Article
84
- 10.1097/aud.0000000000000537
- Jul 1, 2018
- Ear & Hearing
We investigate the clinical effectiveness of a novel deep learning-based noise reduction (NR) approach under noisy conditions with challenging noise types at low signal to noise ratio (SNR) levels for Mandarin-speaking cochlear implant (CI) recipients. The deep learning-based NR approach used in this study consists of two modules: noise classifier (NC) and deep denoising autoencoder (DDAE), thus termed (NC + DDAE). In a series of comprehensive experiments, we conduct qualitative and quantitative analyses on the NC module and the overall NC + DDAE approach. Moreover, we evaluate the speech recognition performance of the NC + DDAE NR and classical single-microphone NR approaches for Mandarin-speaking CI recipients under different noisy conditions. The testing set contains Mandarin sentences corrupted by two types of maskers, two-talker babble noise, and a construction jackhammer noise, at 0 and 5 dB SNR levels. Two conventional NR techniques and the proposed deep learning-based approach are used to process the noisy utterances. We qualitatively compare the NR approaches by the amplitude envelope and spectrogram plots of the processed utterances. Quantitative objective measures include (1) normalized covariance measure to test the intelligibility of the utterances processed by each of the NR approaches; and (2) speech recognition tests conducted by nine Mandarin-speaking CI recipients. These nine CI recipients use their own clinical speech processors during testing. The experimental results of objective evaluation and listening test indicate that under challenging listening conditions, the proposed NC + DDAE NR approach yields higher intelligibility scores than the two compared classical NR techniques, under both matched and mismatched training-testing conditions. When compared to the two well-known conventional NR techniques under challenging listening condition, the proposed NC + DDAE NR approach has superior noise suppression capabilities and gives less distortion for the key speech envelope information, thus, improving speech recognition more effectively for Mandarin CI recipients. The results suggest that the proposed deep learning-based NR approach can potentially be integrated into existing CI signal processors to overcome the degradation of speech perception caused by noise.
- Research Article
- 10.65521/ijacect.v12i2.139
- Apr 15, 2025
- International Journal on Advanced Computer Engineering and Communication Technology
Deep learning approaches have revolutionized the field of speech recognition and synthesis, enabling significant advancements in natural language processing (NLP) technologies. This abstract explores the application of deep learning techniques in speech recognition and synthesis and highlights their impact on various domains, including human-computer interaction, virtual assistants, and accessibility tools. Deep learning models, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformer architectures, have demonstrated remarkable performance in speech recognition tasks by effectively capturing temporal and spatial dependencies in audio data. These models leverage large-scale datasets and sophisticated training techniques, such as transfer learning and data augmentation, to achieve state-of-the-art accuracy and robustness in speech recognition. In addition to speech recognition, deep learning-based approaches have also been instrumental in advancing speech synthesis technologies, commonly known as text-to-speech (TTS) systems. By leveraging neural network architectures, such as WaveNet and Tacotron, these systems can generate natural-sounding speech from text input with human-like intonation and prosody. Furthermore, deep learning techniques have facilitated the development of multilingual and speaker-adaptive speech recognition and synthesis systems, enabling broader accessibility and personalized user experiences across diverse linguistic and demographic backgrounds. These advancements have paved the way for the integration of speech-based interfaces into various applications, including smart speakers, navigation systems, and assistive technologies for individuals with disabilities. Despite the remarkable progress achieved with deep learning approaches, challenges such as data scarcity, domain adaptation, and model interpretability remain areas of active research in the field of speech recognition and synthesis. Future efforts are focused on addressing these challenges and further improving the accuracy, efficiency, and naturalness of speech-based interactions through continued advancements in deep learning methodologies. Overall, deep learning approaches have significantly advanced speech recognition and synthesis capabilities, enabling more natural and intuitive human-machine interactions across a wide range of applications and domains. By leveraging deep learning techniques, researchers and practitioners continue to push the boundaries of what is possible in the realm of speech processing, opening up new opportunities for innovation and impact in the field of NLP.
- Research Article
- 10.53106/199115992024123506008
- Dec 1, 2024
- 電腦學刊
<p>Traditional image fusion algorithms often struggle with slow processing speeds and suboptimal results, particularly when handling non-planar images. In this paper, we present a novel deep learning-based approach for panoramic image fusion. We begin by detailing our dataset construction and preprocessing techniques. To enhance the model&rsquo;s capability with non-planar images, we apply the Thin Plate Spline (TPS) deformation algorithm, allowing effective panoramic fusion across complex image structures. The model architecture is based on a convolutional neural network (CNN) framework, integrated with up- and down-sampling modules to accurately and efficiently capture image features, resulting in higher-quality fusion outcomes. Experimental results demonstrate that this deep learning approach achieves faster fusion speeds and higher quality compared to traditional methods.</p> <p>&nbsp;</p>
- Conference Article
64
- 10.1109/ismar52148.2021.00017
- Oct 1, 2021
Cybersickness prediction is one of the significant research challenges for real-time cybersickness reduction. Researchers have proposed different approaches for predicting cybersickness from bio-physiological data (e.g., heart rate, breathing rate, electroencephalogram). However, collecting bio-physiological data often requires external sensors, limiting locomotion and 3D-object manipulation during the virtual reality (VR) experience. Limited research has been done to predict cybersickness from the data readily available from the integrated sensors in head-mounted displays (HMDs) (e.g., head-tracking, eye-tracking, motion features), allowing free locomotion and 3D-object manipulation. This research proposes a novel deep fusion network to predict cybersickness severity from heterogeneous data readily available from the integrated HMD sensors. We extracted 1755 stereoscopic videos, eye-tracking, and head-tracking data along with the corresponding self-reported cybersickness severity collected from 30 participants during their VR gameplay. We applied several deep fusion approaches with the heterogeneous data collected from the participants. Our results suggest that cybersickness can be predicted with an accuracy of 87.77% and a root-mean-square error of 0.51 when using only eye-tracking and head-tracking data. We concluded that eye-tracking and head-tracking data are well suited for a standalone cybersickness prediction framework.
- Conference Article
19
- 10.23919/icif.2018.8455321
- Jul 1, 2018
US Department of Defense (DoD) big data is extensively multimodal and multiple intelligence (multi-INT), where structured sensor and unstructured audio, video and textual ISR (Intelligence, Surveillance, and Reconnaissance) data are generated by numerous air, ground, and space borne sensors along with human intelligence. Data fusion at all levels “remains a challenging task.” While there are algorithmic stove-piped systems that work well on individual modalities, there is no system to date that is mission and source agnostics and can seamlessly integrate and correlate multi-INT data that includes textual, hyperspectral, and video content. The considerable volume and velocity aspects of big data only compound the aforementioned encountered in fusion. We have developed the concept of “deep fusion”1based on deep learning models adapted to process multiple modalities of big data. Rather than reducing each modality independently and fusing at a higher-level model (feature-level fusion), the deep fusion approach generates a set of multimodal features, thereby maintaining the core properties of the dissimilar signals and resulting in fused models of higher accuracy. We have initiated two deep fusion experiments - one is to automatically generate the caption of an image to help analysts tagging and captioning large volumes of images gathered from collection platforms, and the other is an audio-visual speech classification with potential applications to lip-reading and enhanced object tracking. This paper presents the proof-of-concept demonstration for caption generation. The generative model is based on a deep recurrent architecture combined with the pre-trained image-to-vector model Inception V3 via a Convolutional Neural Network (CNN) and the word-to-vectors model word2vec via a skip-gram model. We make use of the Flickr8K dataset extended with some military specific images to make the demonstration more relevant to the DoD domain. The detailed results from the image captioning experiment is presented here. The captions are generated from test image are subjectively evaluated and the BLEU (bilingual evaluation understudy) scores are compared and found substantial improvements.
- Research Article
14
- 10.1093/bioinformatics/btac532
- Jul 22, 2022
- Bioinformatics
5-Methylcytosine (m5C) is a crucial post-transcriptional modification. With the development of technology, it is widely found in various RNAs. Numerous studies have indicated that m5C plays an essential role in various activities of organisms, such as tRNA recognition, stabilization of RNA structure, RNA metabolism and so on. Traditional identification is costly and time-consuming by wet biological experiments. Therefore, computational models are commonly used to identify the m5C sites. Due to the vast computing advantages of deep learning, it is feasible to construct the predictive model through deep learning algorithms. In this study, we construct a model to identify m5C based on a deep fusion approach with an improved residual network. First, sequence features are extracted from the RNA sequences using Kmer, K-tuple nucleotide frequency component (KNFC), Pseudo dinucleotide composition (PseDNC) and Physical and chemical property (PCP). Kmer and KNFC extract information from a statistical point of view. PseDNC and PCP extract information from the physicochemical properties of RNA sequences. Then, two parts of information are fused with new features using bidirectional long- and short-term memory and attention mechanisms, respectively. Immediately after, the fused features are fed into the improved residual network for classification. Finally, 10-fold cross-validation and independent set testing are used to verify the credibility of the model. The results show that the accuracy reaches 91.87%, 95.55%, 92.27% and 95.60% on the training sets and independent test sets of Arabidopsis thaliana and M.musculus, respectively. This is a considerable improvement compared to previous studies and demonstrates the robust performance of our model. The data and code related to the study are available at https://github.com/alivelxj/m5c-DFRESG.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.