Inter-Sentence Segmentation of YouTube Subtitles Using Long-Short Term Memory (LSTM)

Hye-Jeong Song,Chan-Young Park,Yu-Seop Kim,Hong-Ki Kim,Jong-Dae Kim

doi:10.3390/app9071504

Hye-Jeong Song, Chan-Young Park + Show 3 more

Open Access

https://doi.org/10.3390/app9071504

Copy DOI

Journal: Applied Sciences	Publication Date: Apr 11, 2019
Citations: 9	License type: CC BY 4.0

Affiliation: Hallym University

Abstract

Recently, with the development of Speech to Text, which converts voice to text, and machine translation, technologies for simultaneously translating the captions of video into other languages have been developed. Using this, YouTube, a video-sharing site, provides captions in many languages. Currently, the automatic caption system extracts voice data when uploading a video and provides a subtitle file converted into text. This method creates subtitles suitable for the running time. However, when extracting subtitles from video using Speech to Text, it is impossible to accurately translate the sentence because all sentences are generated without periods. Since the generated subtitles are separated by time units rather than sentence units, and are translated, it is very difficult to understand the translation result as a whole. In this paper, we propose a method to divide text into sentences and generate period marks to improve the accuracy of automatic translation of English subtitles. For this study, we use the 27,826 sentence subtitles provided by Stanford University’s courses as data. Since this lecture video provides complete sentence caption data, it can be used as training data by transforming the subtitles into general YouTube-like caption data. We build a model with the training data using the LSTM-RNN (Long-Short Term Memory – Recurrent Neural Networks) and predict the position of the period mark, resulting in prediction accuracy of 70.84%. Our research will provide people with more accurate translations of subtitles. In addition, we expect that language barriers in online education will be more easily broken by achieving more accurate translations of numerous video lectures in English.

Highlights

Speech to Text (STT) [1,2] is a process in which a computer interprets a person’s speech and converts the contents into text
Machine Translation (MT) is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another
Neural Machine Translation (NMT) [9] has dramatically improved MT performance, and there are a lot of translation apps, such as iTranslate and Google Translate, competing in the market

Summary

Introduction

Speech to Text (STT) [1,2] is a process in which a computer interprets a person’s speech and converts the contents into text. Model) [3], which constructs an acoustic model by statistically modeling voices spoken by various speakers [4] and constructs a language model using corpus [5]. Machine Translation (MT) is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another MT has been approached by rules [6], examples [7], and statistics [8]. Neural Machine Translation (NMT) [9] has dramatically improved MT performance, and there are a lot of translation apps, such as iTranslate (https://www.itranslate.com/) and Google Translate (https://translate.google.com/), competing in the market.

Methods

Results

Conclusion