MVIB-Lip: Multi-View Information Bottleneck for Visual Speech Recognition via Time Series Modeling
Lipreading, or visual speech recognition, is the task of interpreting utterances solely from visual cues of lip movements. While early approaches relied on Hidden Markov Models (HMMs) and handcrafted spatiotemporal descriptors, recent advances in deep learning have enabled end-to-end recognition using large-scale datasets. However, such methods often require millions of labeled or pretraining samples and struggle to generalize under low-resource or speaker-independent conditions. In this work, we revisit lipreading from a multi-view learning perspective. We introduce MVIB-Lip, a framework that integrates two complementary representations of lip movements: (i) raw landmark trajectories modeled as multivariate time series, and (ii) recurrence plot (RP) images that encode structural dynamics in a texture form. A Transformer encoder processes the temporal sequences, while a ResNet-18 extracts features from RPs; the two views are fused via a product-of-experts posterior regularized by the multi-view information bottleneck. Experiments on the OuluVS and a self-collected dataset demonstrate that MVIB-Lip consistently outperforms handcrafted baselines and improves generalization to speaker-independent recognition. Our results suggest that recurrence plots, when coupled with deep multi-view learning, offer a principled and data-efficient path forward for robust visual speech recognition.
- Research Article
222
- 10.1016/j.neucom.2021.03.090
- Mar 30, 2021
- Neurocomputing
Deep multi-view learning methods: A review
- Research Article
33
- 10.1609/aaai.v36i7.20724
- Jun 28, 2022
- Proceedings of the AAAI Conference on Artificial Intelligence
Multi-view deep learning is performed based on the deep fusion of data from multiple sources, i.e. data with multiple views. However, due to the property differences and inconsistency of data sources, the deep learning results based on the fusion of multi-view data may be uncertain and unreliable. It is required to reduce the uncertainty in data fusion and implement the trusted multi-view deep learning. Aiming at the problem, we revisit the multi-view learning from the perspective of opinion aggregation and thereby devise a trusted multi-view deep learning method. Within this method, we adopt evidence theory to formulate the uncertainty of opinions as learning results from different data sources and measure the uncertainty of opinion aggregation as multi-view learning results through evidence accumulation. We prove that accumulating the evidences from multiple data views will decrease the uncertainty in multi-view deep learning and facilitate to achieve the trusted learning results. Experiments on various kinds of multi-view datasets verify the reliability and robustness of the proposed multi-view deep learning method.
- Conference Article
63
- 10.1109/fg47880.2020.00134
- Nov 1, 2020
Recent advances in deep learning have heightened interest among researchers in the field of visual speech recognition (VSR). Currently, most existing methods equate VSR with automatic lip reading, which attempts to recognise speech by analysing lip motion. However, human experience and psychological studies suggest that we do not always fix our gaze at each other’s lips during a face-to-face conversation, but rather scan the whole face repetitively. This inspires us to revisit a fundamental yet somehow overlooked problem: can VSR models benefit from reading extraoral facial regions, i.e. beyond the lips? In this paper, we perform a comprehensive study on the evaluation of the effects of different facial regions with state-of-the-art VSR models, including the mouth, the whole face, the upper face, and even the cheeks. Experiments are conducted on both word-level and sentence-level benchmarks with different characteristics. We find that despite the complex variations of the data, incorporating information from extraoral facial regions, even the upper face, consistently benefits VSR performance. Furthermore, we introduce a simple yet effective method based on Cutout to learn more discriminative features for face-based VSR, hoping to maximise the utility of information encoded in different facial regions. Our experiments show obvious improvements over existing state-of-the-art methods that use only the lip region as inputs, a result we believe would probably provide the VSR community with some new and exciting insights.
- Research Article
6
- 10.3390/e26030235
- Mar 7, 2024
- Entropy (Basel, Switzerland)
The research groups in computer vision, graphics, and machine learning have dedicated a substantial amount of attention to the areas of 3D object reconstruction, augmentation, and registration. Deep learning is the predominant method used in artificial intelligence for addressing computer vision challenges. However, deep learning on three-dimensional data presents distinct obstacles and is now in its nascent phase. There have been significant advancements in deep learning specifically for three-dimensional data, offering a range of ways to address these issues. This study offers a comprehensive examination of the latest advancements in deep learning methodologies. We examine many benchmark models for the tasks of 3D object registration, augmentation, and reconstruction. We thoroughly analyse their architectures, advantages, and constraints. In summary, this report provides a comprehensive overview of recent advancements in three-dimensional deep learning and highlights unresolved research areas that will need to be addressed in the future.
- Research Article
47
- 10.1016/j.inffus.2023.102217
- Dec 30, 2023
- Information Fusion
A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges
- Conference Article
12
- 10.1109/isspit.2005.1577194
- Dec 1, 2005
Recently, the multimedia signal processing community has shown increasing interest for research development on visual speech recognition domain. In this paper we present a novel visual speech recognition approach based on our model hyper column model (HCM). HCM is used for feature extraction task. The extracted features are modeled by Gaussian distributions through using hidden Markov model (HMM). The proposed system, HCM and HMM, can be used for any visual recognition task. We use it here to comprise a complete lip-reading system and evaluate its performance using Arabic database set. According to our knowledge, this is the first time that visual speech recognition is applied for Arabic language. Toward fair evaluation we compare our accuracy results with those using fast discrete cosine transform (FDCT) approach, in a separate experiment and using same data set and conditions of HCM experiment. Comparison turns out that HCM shows higher recognition accuracy than FDCT for Arabic sentences and words. HCM does not provide higher accuracy only but also it capable to achieve shift invariant recognition whereas FDCT can not
- Research Article
2
- 10.1016/j.patrec.2024.09.002
- Sep 3, 2024
- Pattern Recognition Letters
Visual speech recognition using compact hypercomplex neural networks
- Research Article
3
- 10.32604/csse.2023.037113
- Jan 1, 2023
- Computer Systems Science and Engineering
The continuing advances in deep learning have paved the way for several challenging ideas. One such idea is visual lip-reading, which has recently drawn many research interests. Lip-reading, often referred to as visual speech recognition, is the ability to understand and predict spoken speech based solely on lip movements without using sounds. Due to the lack of research studies on visual speech recognition for the Arabic language in general, and its absence in the Quranic research, this research aims to fill this gap. This paper introduces a new publicly available Arabic lip-reading dataset containing 10490 videos captured from multiple viewpoints and comprising data samples at the letter level (i.e., single letters (single alphabets) and Quranic disjoined letters) and in the word level based on the content and context of the book <i>Al-Qaida Al-Noorania</i>. This research uses visual speech recognition to recognize spoken Arabic letters (Arabic alphabets), Quranic disjoined letters, and Quranic words, mainly phonetic as they are recited in the Holy Quran according to Quranic study aid entitled Al-Qaida Al-Noorania. This study could further validate the correctness of pronunciation and, subsequently, assist people in correctly reciting Quran. Furthermore, a detailed description of the created dataset and its construction methodology is provided. This new dataset is used to train an effective pre-trained deep learning CNN model throughout transfer learning for lip-reading, achieving the accuracies of 83.3%, 80.5%, and 77.5% on words, disjoined letters, and single letters, respectively, where an extended analysis of the results is provided. Finally, the experimental outcomes, different research aspects, and dataset collection consistency and challenges are discussed and concluded with several new promising trends for future work.
- Conference Article
335
- 10.17863/cam.11070
- Nov 27, 2017
Even though active learning forms an important pillar of machine learning, deep learning tools are not prevalent within it. Deep learning poses several difficulties when used in an active learning setting. First, active learning (AL) methods generally rely on being able to learn and update models from small amounts of data. Recent advances in deep learning, on the other hand, are notorious for their dependence on large amounts of data. Second, many AL acquisition functions rely on model uncertainty, yet deep learning methods rarely represent such model uncertainty. In this paper we combine recent advances in Bayesian deep learning into the active learning framework in a practical way. We develop an active learning framework for high dimensional data, a task which has been extremely challenging so far, with very sparse existing literature. Taking advantage of specialised models such as Bayesian convolutional neural networks, we demonstrate our active learning techniques with image data, obtaining a significant improvement on existing active learning approaches. We demonstrate this on both the MNIST dataset, as well as for skin cancer diagnosis from lesion images (ISIC2016 task).
- Research Article
88
- 10.1145/3377876
- Feb 17, 2020
- ACM Transactions on Multimedia Computing, Communications, and Applications
In recent years, view-based 3D model retrieval has become one of the research focuses in the field of computer vision and machine learning. In fact, the 3D model retrieval algorithm consists of feature extraction and similarity measurement, and the robust features play a decisive role in the similarity measurement. Although deep learning has achieved comprehensive success in the field of computer vision, deep learning features are used for 3D model retrieval only in a small number of works. To the best of our knowledge, there is no benchmark to evaluate these deep learning features. To tackle this problem, in this work we systematically evaluate the performance of deep learning features in view-based 3D model retrieval on four popular datasets (ETH, NTU60, PSB, and MVRED) by different kinds of similarity measure methods. In detail, the performance of hand-crafted features and deep learning features are compared, and then the robustness of deep learning features is assessed. Finally, the difference between single-view deep learning features and multi-view deep learning features is also evaluated. By quantitatively analyzing the performances on different datasets, it is clear that these deep learning features can consistently outperform all of the hand-crafted features, and they are also more robust than the hand-crafted features when different degrees of noise are added into the image. The exploration of latent relationships among different views in multi-view deep learning network architectures shows that the performance of multi-view deep learning outperforms that of single-view deep learning features with low computational complexity.
- Conference Article
3
- 10.1109/icitri56423.2022.9970205
- Nov 10, 2022
Android is the most rapidly developing smartphone operating system because its open-source nature makes it easier for developers to modify Android applications and functionality. This simplicity has also contributed to a rise in the development of malicious software, sometimes known as malware. Malware typically infects applications to damage the system and steal data, leading to substantial losses for Android users. Therefore, it is essential to take preventative steps to detect malware. Deep learning is one such application method. This study compares single-view and multi-view deep learning architectures to identify Android malware using system calls and permissions. The malware analysis method employed is a hybrid method that combines static and dynamic analysis. Genymotion is used to collect system-call features, whereas Androguard is used to extract permissions. The deep learning base model is created using two unique architectures: LSTM (Long short-term memory) for processing system calls and MLP (Multi-Layer Perceptron) for processing permissions. In single-view deep learning architecture, each feature is treated separately on the model. However, multi-view deep learning features are processed on a concatenated model using the concatenate function. According to the assessment findings, the multi-view deep learning architecture model employing the Adam optimizer and the learning rate parameter of 0.005 achieves an accuracy of 83% and an f1 score of 81%. These findings demonstrate a 2% gain in accuracy over the single view model with the same hyperparameters.
- Research Article
9
- 10.3390/s22093597
- May 9, 2022
- Sensors
Concomitant with the recent advances in deep learning, automatic speech recognition and visual speech recognition (VSR) have received considerable attention. However, although VSR systems must identify speech from both frontal and profile faces in real-world scenarios, most VSR studies have focused solely on frontal face pictures. To address this issue, we propose an end-to-end sentence-level multi-view VSR architecture for faces captured from four different perspectives (frontal, 30°, 45°, and 60°). The encoder uses multiple convolutional neural networks with a spatial attention module to detect minor changes in the mouth patterns of similarly pronounced words, and the decoder uses cascaded local self-attention connectionist temporal classification to collect the details of local contextual information in the immediate vicinity, which results in a substantial performance boost and speedy convergence. To compare the performance of the proposed model for experiments on the OuluVS2 dataset, the dataset was divided into four different perspectives, and the obtained performance improvement was 3.31% (0°), 4.79% (30°), 5.51% (45°), 6.18% (60°), and 4.95% (mean), respectively, compared with the existing state-of-the-art performance, and the average performance improved by 9.1% compared with the baseline. Thus, the suggested design enhances the performance of multi-view VSR and boosts its usefulness in real-world applications.
- Research Article
1
- 10.1109/embc48229.2022.9871492
- Jul 11, 2022
- Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
With growing size of resting state fMRI datasets and advances in deep learning methods, there are ever increasing opportunities to leverage progress in deep learning to solve challenging tasks in neuroimaging. In this work, we build upon recent advances in deep metric learning, to learn embeddings of rs-fMRI data, which can then be potentially used for several downstream tasks. We propose an efficient training method for our model and compare our method with other widely used models. Our experimental results indicate that deep metric learning can be used as an additional refinement step to learn representations of fMRI data, that significantly improves performance on downstream modeling tasks.
- Research Article
33
- 10.1177/1748006x21994446
- Feb 21, 2021
- Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability
The ability to detect anomalies in time series is considered highly valuable in numerous application domains. The sequential nature of time series objects is responsible for an additional feature complexity, ultimately requiring specialized approaches in order to solve the task. Essential characteristics of time series, situated outside the time domain, are often difficult to capture with state-of-the-art anomaly detection methods when no transformations have been applied to the time series. Inspired by the success of deep learning methods in computer vision, several studies have proposed transforming time series into image-like representations, used as inputs for deep learning models, and have led to very promising results in classification tasks. In this paper, we first review the signal to image encoding approaches found in the literature. Second, we propose modifications to some of their original formulations to make them more robust to the variability in large datasets. Third, we compare them on the basis of a common unsupervised task to demonstrate how the choice of the encoding can impact the results when used in the same deep learning architecture. We thus provide a comparison between six encoding algorithms with and without the proposed modifications. The selected encoding methods are Gramian Angular Field, Markov Transition Field, recurrence plot, grey scale encoding, spectrogram, and scalogram. We also compare the results achieved with the raw signal used as input for another deep learning model. We demonstrate that some encodings have a competitive advantage and might be worth considering within a deep learning framework. The comparison is performed on a dataset collected and released by Airbus SAS, containing highly complex vibration measurements from real helicopter flight tests. The different encodings provide competitive results for anomaly detection.
- Research Article
11
- 10.1016/j.asoc.2024.111906
- Jun 28, 2024
- Applied Soft Computing
AI-based visual speech recognition towards realistic avatars and lip-reading applications in the metaverse
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.