MVIB-Lip: Multi-View Information Bottleneck for Visual Speech Recognition via Time Series Modeling

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Lipreading, or visual speech recognition, is the task of interpreting utterances solely from visual cues of lip movements. While early approaches relied on Hidden Markov Models (HMMs) and handcrafted spatiotemporal descriptors, recent advances in deep learning have enabled end-to-end recognition using large-scale datasets. However, such methods often require millions of labeled or pretraining samples and struggle to generalize under low-resource or speaker-independent conditions. In this work, we revisit lipreading from a multi-view learning perspective. We introduce MVIB-Lip, a framework that integrates two complementary representations of lip movements: (i) raw landmark trajectories modeled as multivariate time series, and (ii) recurrence plot (RP) images that encode structural dynamics in a texture form. A Transformer encoder processes the temporal sequences, while a ResNet-18 extracts features from RPs; the two views are fused via a product-of-experts posterior regularized by the multi-view information bottleneck. Experiments on the OuluVS and a self-collected dataset demonstrate that MVIB-Lip consistently outperforms handcrafted baselines and improves generalization to speaker-independent recognition. Our results suggest that recurrence plots, when coupled with deep multi-view learning, offer a principled and data-efficient path forward for robust visual speech recognition.

Similar Papers
  • Research Article
  • Cite Count Icon 222
  • 10.1016/j.neucom.2021.03.090
Deep multi-view learning methods: A review
  • Mar 30, 2021
  • Neurocomputing
  • Xiaoqiang Yan + 4 more

Deep multi-view learning methods: A review

  • Research Article
  • Cite Count Icon 33
  • 10.1609/aaai.v36i7.20724
Trusted Multi-View Deep Learning with Opinion Aggregation
  • Jun 28, 2022
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Wei Liu + 3 more

Multi-view deep learning is performed based on the deep fusion of data from multiple sources, i.e. data with multiple views. However, due to the property differences and inconsistency of data sources, the deep learning results based on the fusion of multi-view data may be uncertain and unreliable. It is required to reduce the uncertainty in data fusion and implement the trusted multi-view deep learning. Aiming at the problem, we revisit the multi-view learning from the perspective of opinion aggregation and thereby devise a trusted multi-view deep learning method. Within this method, we adopt evidence theory to formulate the uncertainty of opinions as learning results from different data sources and measure the uncertainty of opinion aggregation as multi-view learning results through evidence accumulation. We prove that accumulating the evidences from multiple data views will decrease the uncertainty in multi-view deep learning and facilitate to achieve the trusted learning results. Experiments on various kinds of multi-view datasets verify the reliability and robustness of the proposed multi-view deep learning method.

  • Conference Article
  • Cite Count Icon 63
  • 10.1109/fg47880.2020.00134
Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition
  • Nov 1, 2020
  • Yuanhang Zhang + 4 more

Recent advances in deep learning have heightened interest among researchers in the field of visual speech recognition (VSR). Currently, most existing methods equate VSR with automatic lip reading, which attempts to recognise speech by analysing lip motion. However, human experience and psychological studies suggest that we do not always fix our gaze at each other’s lips during a face-to-face conversation, but rather scan the whole face repetitively. This inspires us to revisit a fundamental yet somehow overlooked problem: can VSR models benefit from reading extraoral facial regions, i.e. beyond the lips? In this paper, we perform a comprehensive study on the evaluation of the effects of different facial regions with state-of-the-art VSR models, including the mouth, the whole face, the upper face, and even the cheeks. Experiments are conducted on both word-level and sentence-level benchmarks with different characteristics. We find that despite the complex variations of the data, incorporating information from extraoral facial regions, even the upper face, consistently benefits VSR performance. Furthermore, we introduce a simple yet effective method based on Cutout to learn more discriminative features for face-based VSR, hoping to maximise the utility of information encoded in different facial regions. Our experiments show obvious improvements over existing state-of-the-art methods that use only the lip region as inputs, a result we believe would probably provide the VSR community with some new and exciting insights.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 6
  • 10.3390/e26030235
Deep Learning for 3D Reconstruction, Augmentation, and Registration: A Review Paper.
  • Mar 7, 2024
  • Entropy (Basel, Switzerland)
  • Prasoon Kumar Vinodkumar + 4 more

The research groups in computer vision, graphics, and machine learning have dedicated a substantial amount of attention to the areas of 3D object reconstruction, augmentation, and registration. Deep learning is the predominant method used in artificial intelligence for addressing computer vision challenges. However, deep learning on three-dimensional data presents distinct obstacles and is now in its nascent phase. There have been significant advancements in deep learning specifically for three-dimensional data, offering a range of ways to address these issues. This study offers a comprehensive examination of the latest advancements in deep learning methodologies. We examine many benchmark models for the tasks of 3D object registration, augmentation, and reconstruction. We thoroughly analyse their architectures, advantages, and constraints. In summary, this report provides a comprehensive overview of recent advancements in three-dimensional deep learning and highlights unresolved research areas that will need to be addressed in the future.

  • Research Article
  • Cite Count Icon 47
  • 10.1016/j.inffus.2023.102217
A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges
  • Dec 30, 2023
  • Information Fusion
  • Khaled Bayoudh

A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges

  • Conference Article
  • Cite Count Icon 12
  • 10.1109/isspit.2005.1577194
Hyper column model vs. fast DCT for feature extraction in visual arabic speech recognition
  • Dec 1, 2005
  • A Sagheer + 3 more

Recently, the multimedia signal processing community has shown increasing interest for research development on visual speech recognition domain. In this paper we present a novel visual speech recognition approach based on our model hyper column model (HCM). HCM is used for feature extraction task. The extracted features are modeled by Gaussian distributions through using hidden Markov model (HMM). The proposed system, HCM and HMM, can be used for any visual recognition task. We use it here to comprise a complete lip-reading system and evaluate its performance using Arabic database set. According to our knowledge, this is the first time that visual speech recognition is applied for Arabic language. Toward fair evaluation we compare our accuracy results with those using fast discrete cosine transform (FDCT) approach, in a separate experiment and using same data set and conditions of HCM experiment. Comparison turns out that HCM shows higher recognition accuracy than FDCT for Arabic sentences and words. HCM does not provide higher accuracy only but also it capable to achieve shift invariant recognition whereas FDCT can not

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.patrec.2024.09.002
Visual speech recognition using compact hypercomplex neural networks
  • Sep 3, 2024
  • Pattern Recognition Letters
  • Iason Ioannis Panagos + 2 more

Visual speech recognition using compact hypercomplex neural networks

  • Research Article
  • Cite Count Icon 3
  • 10.32604/csse.2023.037113
Visual Lip-Reading for Quranic Arabic Alphabets and Words Using Deep Learning
  • Jan 1, 2023
  • Computer Systems Science and Engineering
  • Nada Faisal Aljohani + 1 more

The continuing advances in deep learning have paved the way for several challenging ideas. One such idea is visual lip-reading, which has recently drawn many research interests. Lip-reading, often referred to as visual speech recognition, is the ability to understand and predict spoken speech based solely on lip movements without using sounds. Due to the lack of research studies on visual speech recognition for the Arabic language in general, and its absence in the Quranic research, this research aims to fill this gap. This paper introduces a new publicly available Arabic lip-reading dataset containing 10490 videos captured from multiple viewpoints and comprising data samples at the letter level (i.e., single letters (single alphabets) and Quranic disjoined letters) and in the word level based on the content and context of the book <i>Al-Qaida Al-Noorania</i>. This research uses visual speech recognition to recognize spoken Arabic letters (Arabic alphabets), Quranic disjoined letters, and Quranic words, mainly phonetic as they are recited in the Holy Quran according to Quranic study aid entitled Al-Qaida Al-Noorania. This study could further validate the correctness of pronunciation and, subsequently, assist people in correctly reciting Quran. Furthermore, a detailed description of the created dataset and its construction methodology is provided. This new dataset is used to train an effective pre-trained deep learning CNN model throughout transfer learning for lip-reading, achieving the accuracies of 83.3%, 80.5%, and 77.5% on words, disjoined letters, and single letters, respectively, where an extended analysis of the results is provided. Finally, the experimental outcomes, different research aspects, and dataset collection consistency and challenges are discussed and concluded with several new promising trends for future work.

  • Conference Article
  • Cite Count Icon 335
  • 10.17863/cam.11070
Deep Bayesian active learning with image data
  • Nov 27, 2017
  • Yarin Gal + 2 more

Even though active learning forms an important pillar of machine learning, deep learning tools are not prevalent within it. Deep learning poses several difficulties when used in an active learning setting. First, active learning (AL) methods generally rely on being able to learn and update models from small amounts of data. Recent advances in deep learning, on the other hand, are notorious for their dependence on large amounts of data. Second, many AL acquisition functions rely on model uncertainty, yet deep learning methods rarely represent such model uncertainty. In this paper we combine recent advances in Bayesian deep learning into the active learning framework in a practical way. We develop an active learning framework for high dimensional data, a task which has been extremely challenging so far, with very sparse existing literature. Taking advantage of specialised models such as Bayesian convolutional neural networks, we demonstrate our active learning techniques with image data, obtaining a significant improvement on existing active learning approaches. We demonstrate this on both the MNIST dataset, as well as for skin cancer diagnosis from lesion images (ISIC2016 task).

  • Research Article
  • Cite Count Icon 88
  • 10.1145/3377876
Exploring Deep Learning for View-Based 3D Model Retrieval
  • Feb 17, 2020
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Zan Gao + 2 more

In recent years, view-based 3D model retrieval has become one of the research focuses in the field of computer vision and machine learning. In fact, the 3D model retrieval algorithm consists of feature extraction and similarity measurement, and the robust features play a decisive role in the similarity measurement. Although deep learning has achieved comprehensive success in the field of computer vision, deep learning features are used for 3D model retrieval only in a small number of works. To the best of our knowledge, there is no benchmark to evaluate these deep learning features. To tackle this problem, in this work we systematically evaluate the performance of deep learning features in view-based 3D model retrieval on four popular datasets (ETH, NTU60, PSB, and MVRED) by different kinds of similarity measure methods. In detail, the performance of hand-crafted features and deep learning features are compared, and then the robustness of deep learning features is assessed. Finally, the difference between single-view deep learning features and multi-view deep learning features is also evaluated. By quantitatively analyzing the performances on different datasets, it is clear that these deep learning features can consistently outperform all of the hand-crafted features, and they are also more robust than the hand-crafted features when different degrees of noise are added into the image. The exploration of latent relationships among different views in multi-view deep learning network architectures shows that the performance of multi-view deep learning outperforms that of single-view deep learning features with low computational complexity.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/icitri56423.2022.9970205
Comparison of Single-View and Multi-View Deep Learning for Android Malware Detection
  • Nov 10, 2022
  • Fika Dwi Rahmawati + 2 more

Android is the most rapidly developing smartphone operating system because its open-source nature makes it easier for developers to modify Android applications and functionality. This simplicity has also contributed to a rise in the development of malicious software, sometimes known as malware. Malware typically infects applications to damage the system and steal data, leading to substantial losses for Android users. Therefore, it is essential to take preventative steps to detect malware. Deep learning is one such application method. This study compares single-view and multi-view deep learning architectures to identify Android malware using system calls and permissions. The malware analysis method employed is a hybrid method that combines static and dynamic analysis. Genymotion is used to collect system-call features, whereas Androguard is used to extract permissions. The deep learning base model is created using two unique architectures: LSTM (Long short-term memory) for processing system calls and MLP (Multi-Layer Perceptron) for processing permissions. In single-view deep learning architecture, each feature is treated separately on the model. However, multi-view deep learning features are processed on a concatenated model using the concatenate function. According to the assessment findings, the multi-view deep learning architecture model employing the Adam optimizer and the learning rate parameter of 0.005 achieves an accuracy of 83% and an f1 score of 81%. These findings demonstrate a 2% gain in accuracy over the single view model with the same hyperparameters.

  • Research Article
  • Cite Count Icon 9
  • 10.3390/s22093597
End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC
  • May 9, 2022
  • Sensors
  • Sanghun Jeon + 1 more

Concomitant with the recent advances in deep learning, automatic speech recognition and visual speech recognition (VSR) have received considerable attention. However, although VSR systems must identify speech from both frontal and profile faces in real-world scenarios, most VSR studies have focused solely on frontal face pictures. To address this issue, we propose an end-to-end sentence-level multi-view VSR architecture for faces captured from four different perspectives (frontal, 30°, 45°, and 60°). The encoder uses multiple convolutional neural networks with a spatial attention module to detect minor changes in the mouth patterns of similarly pronounced words, and the decoder uses cascaded local self-attention connectionist temporal classification to collect the details of local contextual information in the immediate vicinity, which results in a substantial performance boost and speedy convergence. To compare the performance of the proposed model for experiments on the OuluVS2 dataset, the dataset was divided into four different perspectives, and the obtained performance improvement was 3.31% (0°), 4.79% (30°), 5.51% (45°), 6.18% (60°), and 4.95% (mean), respectively, compared with the existing state-of-the-art performance, and the average performance improved by 9.1% compared with the baseline. Thus, the suggested design enhances the performance of multi-view VSR and boosts its usefulness in real-world applications.

  • Research Article
  • Cite Count Icon 1
  • 10.1109/embc48229.2022.9871492
Deep Metric Representation Learning for Clinical Resting State fMRI.
  • Jul 11, 2022
  • Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
  • Arunesh Mittal + 2 more

With growing size of resting state fMRI datasets and advances in deep learning methods, there are ever increasing opportunities to leverage progress in deep learning to solve challenging tasks in neuroimaging. In this work, we build upon recent advances in deep metric learning, to learn embeddings of rs-fMRI data, which can then be potentially used for several downstream tasks. We propose an efficient training method for our model and compare our method with other widely used models. Our experimental results indicate that deep metric learning can be used as an additional refinement step to learn representations of fMRI data, that significantly improves performance on downstream modeling tasks.

  • Research Article
  • Cite Count Icon 33
  • 10.1177/1748006x21994446
Temporal signals to images: Monitoring the condition of industrial assets with deep learning image processing algorithms
  • Feb 21, 2021
  • Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability
  • Gabriel Rodriguez Garcia + 4 more

The ability to detect anomalies in time series is considered highly valuable in numerous application domains. The sequential nature of time series objects is responsible for an additional feature complexity, ultimately requiring specialized approaches in order to solve the task. Essential characteristics of time series, situated outside the time domain, are often difficult to capture with state-of-the-art anomaly detection methods when no transformations have been applied to the time series. Inspired by the success of deep learning methods in computer vision, several studies have proposed transforming time series into image-like representations, used as inputs for deep learning models, and have led to very promising results in classification tasks. In this paper, we first review the signal to image encoding approaches found in the literature. Second, we propose modifications to some of their original formulations to make them more robust to the variability in large datasets. Third, we compare them on the basis of a common unsupervised task to demonstrate how the choice of the encoding can impact the results when used in the same deep learning architecture. We thus provide a comparison between six encoding algorithms with and without the proposed modifications. The selected encoding methods are Gramian Angular Field, Markov Transition Field, recurrence plot, grey scale encoding, spectrogram, and scalogram. We also compare the results achieved with the raw signal used as input for another deep learning model. We demonstrate that some encodings have a competitive advantage and might be worth considering within a deep learning framework. The comparison is performed on a dataset collected and released by Airbus SAS, containing highly complex vibration measurements from real helicopter flight tests. The different encodings provide competitive results for anomaly detection.

  • Research Article
  • Cite Count Icon 11
  • 10.1016/j.asoc.2024.111906
AI-based visual speech recognition towards realistic avatars and lip-reading applications in the metaverse
  • Jun 28, 2024
  • Applied Soft Computing
  • Ying Li + 5 more

AI-based visual speech recognition towards realistic avatars and lip-reading applications in the metaverse

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon