Indonesian Dataset Expansion of Microsoft Research Video Description Corpus and Its Similarity Analysis

Faisal Rahutomo,Ahmad Hafidh Ayatullah

doi:10.22219/kinetik.v3i4.680

Abstract

This paper describes the academic base of an openly Indonesian dataset in Mendeley Data with DOI: 10.17632/d7vx5cc92y.1 [1]. The dataset is an Indonesian language expansion of Microsoft research video description corpus, an open dataset contains about 120 thousand sentences. The dataset is a useful resource because the sentences are a set of roughly parallel descriptions of more than 2,000 video snippets of 35 languages. Both paraphrase and bilingual relation are available but Indonesian description is not available in the dataset. Therefore, this paper describes the research effort to expand the dataset for the Indonesian language. The research collected 43,753 description texts of 1,959 short videos, parallel with Microsoft’s dataset. Adding more value to the dataset, similarity metrics calculations of the texts were done. The metrics were Cosine, Jaccard, euclidian, and Manhattan with average results were 0.22, 0.33, 2.38, and 6.08 respectively.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control	Publication Date: Oct 15, 2018
Citations: 1	License type: CC BY-NC 4.0

R Discovery Prime

R Discovery Prime

Indonesian Dataset Expansion of Microsoft Research Video Description Corpus and Its Similarity Analysis

Abstract

Talk to us

Similar Papers

More From: Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control

Lead the way for us

Similar Papers

THE MAKING OF PROFILE VIDEO ABOUT TOURISM IN SIAK REGENCY
Sikin Nuratika ... Safra Apriani Zahraa
INOVISH JOURNAL | VOL. 4
Sikin Nuratika, et. al.Sikin Nuratika ... Safra Apriani Zahraa
29 Jun 2019
INOVISH JOURNAL | VOL. 4

Youtube As Innovation in Teaching Words Equivalent of Bahasa Indonesia
K Wirahyuni
KnE Social Sciences | VOL. 3
K WirahyuniK Wirahyuni
17 Mar 2019
KnE Social Sciences | VOL. 3

Representative Spectral Extraction Approach for Hyperspectral Images Based on Mixed Spectral Similarity Metric
Zhang Mingming ... Ma Lei
-
Zhang Mingming, et. al.Zhang Mingming ... Ma Lei
01 Oct 2019
01 Oct 2019

XRF and XRPD data sets in ternary mixtures with high level micro-absorption and/or preferred orientations problems for phase quantification analysis
Beatrice Mangolini ... Mattia Lopresti
Data in Brief | VOL. 36
Beatrice Mangolini, et. al.Beatrice Mangolini ... Mattia Lopresti
09 Apr 2021
Data in Brief | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Indonesian Dataset Expansion of Microsoft Research Video Description Corpus and Its Similarity Analysis

Abstract

Talk to us

Similar Papers

More From: Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control