Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks.

Mohammad Alsharid,Lior Drukker,J Alison Noble,Harshita Sharma,Aris T Papageorghiou,Yifan Cai

doi:10.1016/j.media.2022.102630

Abstract

In this work, we present a novel gaze-assisted natural language processing (NLP)-based video captioning model to describe routine second-trimester fetal ultrasound scan videos in a vocabulary of spoken sonography. The primary novelty of our multi-modal approach is that the learned video captioning model is built using a combination of ultrasound video, tracked gaze and textual transcriptions from speech recordings. The textual captions that describe the spatio-temporal scan video content are learnt from sonographer speech recordings. The generation of captions is assisted by sonographer gaze-tracking information reflecting their visual attention while performing live-imaging and interpreting a frozen image. To evaluate the effect of adding, or withholding, different forms of gaze on the video model, we compare spatio-temporal deep networks trained using three multi-modal configurations, namely: (1) a gaze-less neural network with only text and video as input, (2) a neural network additionally using real sonographer gaze in the form of attention maps, and (3) a neural network using automatically-predicted gaze in the form of saliency maps instead. We assess algorithm performance through established general text-based metrics (BLEU, ROUGE-L, F1 score), a domain-specific metric (ARS), and metrics that consider the richness and efficiency of the generated captions with respect to the scan video. Results show that the proposed gaze-assisted models can generate richer and more diverse captions for clinical fetal ultrasound scan videos than those without gaze at the expense of the perceived sentence structure. The results also show that the generated captions are similar to sonographer speech in terms of discussing the visual content and the scanning actions performed.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Medical Image Analysis	Publication Date: Nov 1, 2022
Citations: 12	License type: cc-by

R Discovery Prime

R Discovery Prime

Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks.

Abstract

Talk to us

Similar Papers

More From: Medical Image Analysis

Lead the way for us

Similar Papers

FetalNet: Multi-task Deep Learning Framework for Fetal Ultrasound Biometric Measurements
Szymon Płotka ... Michał Lipa
-
Szymon Płotka, et. al.Szymon Płotka ... Michał Lipa
01 Jan 2020
01 Jan 2020

Identification of abnormal ventricular asymmetry in 2nd trimester fetal ultrasound using artificial intelligence
M Levy ... V Thorey
European Heart Journal | VOL. 44
M Levy, et. al.M Levy ... V Thorey
09 Nov 2023
European Heart Journal | VOL. 44

A Machine Learning Method for Automated Description and Workflow Analysis of First Trimester Ultrasound Scans.
Robail Yasrab ... Harshita Sharma
IEEE Transactions on Medical Imaging | VOL. PP
Robail Yasrab, et. al.Robail Yasrab ... Harshita Sharma
01 May 2023
IEEE Transactions on Medical Imaging | VOL. PP

Knowledge representation and learning of operator clinical workflow from full-length routine fetal ultrasound scan videos.
Harshita Sharma ... J Alison Noble
Medical Image Analysis | VOL. 69
Harshita Sharma, et. al.Harshita Sharma ... J Alison Noble
23 Jan 2021
Medical Image Analysis | VOL. 69

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks.

Abstract

Talk to us

Similar Papers

More From: Medical Image Analysis