Abstract

In visual communication, the ability of a short piece of text to catch someone’s eye in a single glance or from a distance is of paramount importance. In our approach to the SemEval-2020 task “Emphasis Selection For Written Text in Visual Media”, we use contextualized word representations from a pretrained model of the state-of-the-art BERT architecture together with a stacked bidirectional GRU network to predict token-level emphasis probabilities. For tackling low inter-annotator agreement in the dataset, we attempt to model multiple annotators jointly by introducing initialization with agreement dependent noise to a crowd layer architecture. We found our approach to both perform substantially better than initialization with identities for this purpose and to outperform a baseline trained with token level majority voting. Our submission system reaches substantially higher Match m on the development set than the task baseline (0.779), but only slightly outperforms the test set baseline (0.754) using a three model ensemble.

Highlights

  • Emphasis selection is the task of choosing individual words or phrases of short written texts to emphasize

  • While emphasis selection has been used in previous work for more natural articulation in text-tospeech systems (Mass et al, 2018), in the SemEval shared task emphasis annotations are intended for use in visual communication such as in posters or advertisements (Shirani et al, 2020)

  • The model using our version of the Crowd Layer with agreement dependent initialization (σ = 0.77) and without the attention mechanism substantially outperformed both the task baseline and our majority voting model on the development set by a margin of ∼0.01

Read more

Summary

Introduction

Emphasis selection is the task of choosing individual words or phrases of short written texts to emphasize. Proceedings of the 14th International Workshop on Semantic Evaluation, pages 1698–1703 Barcelona, Spain (Online), December 12, 2020 Both datasets consist of tokenized sentences with annotations by nine different annotators for each token using an inside–outside–beginning (IOB) tagging scheme. The first dataset is from Adobe Spark, and contains 960 short text instances from flyers, posters, advertisements or motivational memes on social media It contains 5,940 tokens and the average number of tokens per instance is 6.18, ranging from 1 to 27 tokens. The kappa agreement between the four professional labelers in the speech emphasis dataset used by Mass et al (2018) for an expressive text-to-speech system is only slightly higher at 0.35 This suggests that emphasis annotations are very subjective and that this is a common issue across domains which is important to tackle in order to achieve good system performance

System
Crowd Layer
Evaluation
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call