Cross-domain analysis of discourse markers in European Portuguese

Vera Cabarrão,Ana Isabel Mata,Fernando Batista,Jaime Ferreira,Helena Moniz,Isabel Trancoso

doi:10.5087/dad.2018.103

Abstract

This paper presents an analysis of discourse markers in two spontaneous speech corpora for European Portuguese - university lectures and map-task dialogues - and also in a collection of tweets, aiming at contributing to their categorization, scarcely existent for European Portuguese. Our results show that the selection of discourse markers is domain and speaker dependent. We also found that the most frequent discourse markers are similar in all three corpora, despite tweets containing discourse markers not found in the other two corpora. In this multidisciplinary study, comprising both a linguistic perspective and a computational approach, discourse markers are also automatically discriminated from other structural metadata events, namely sentence-like units and disfluencies. Our results show that discourse markers and disfluencies tend to co-occur in the dialogue corpus, but have a complementary distribution in the university lectures. We used three acoustic-prosodic feature sets and machine learning to automatically distinguish between discourse markers, disfluencies and sentence-like units. Our in-domain experiments achieved an accuracy of about 87% in university lectures and 84% in dialogues, in line with our previous results. The eGeMAPS features, commonly used for other paralinguistic tasks, achieved a considerable performance on our data, especially considering the small size of the feature set. Our results suggest that turn-initial discourse markers are usually easier to classify than disfluencies, a result also previously reported in the literature. We conducted a cross-domain evaluation in order to evaluate the robustness of the models across domains. The results achieved are about 11%-12% lower, but we conclude that data from one domain can still be used to classify the same events in the other. Overall, despite the complexity of this task, these are very encouraging state-of-the-art results. Ultimately, using exclusively acoustic-prosodic cues, discourse markers can be fairly discriminated from disfluencies and SUs. In order to better understand the contribution of each feature, we have also reported the impact of the features in both the dialogues and the university lectures. Pitch features are the most relevant ones for the distinction between discourse markers and disfluencies, namely pitch slopes. These features are in line with the wide pitch range of discourse markers, in a continuum from a very compressed pitch range to a very wide one, expressed by total deaccented material or H+L* L* contours, with upstep H tones.

Highlights

The goals of this paper are twofold: (i) a linguistically oriented goal, to describe the acoustic-prosodic properties of discourse markers in European Portuguese (EP), such as portanto (‘ok’), pronto (‘ok’) or bom (‘well’); and (ii) a machine learning goal, to classify and discriminate between metadata events, i.e., discourse markers, disfluencies, and sentence-like units
Our results suggest that turn-initial discourse markers are usually easier to classify than disfluencies, a result previously reported in the literature
Our results showed that the selection of discourse markers is domain and speaker dependent

Summary

Introduction

The goals of this paper are twofold: (i) a linguistically oriented goal, to describe the acoustic-prosodic properties of discourse markers in European Portuguese (EP), such as portanto (‘ok’), pronto (‘ok’) or bom (‘well’); and (ii) a machine learning goal, to classify and discriminate between metadata events, i.e., discourse markers, disfluencies (such as lexicalized filled pauses, like aam or mm, deletions, substitutions), and sentence-like units. Tweets can be seen as a closer scenario and a worthwhile one to study the influences of speech on written modalities, in general, and to study the frequency and selection of specific discourse markers, in particular To accomplish these goals, we used a data-driven approach to identify the discourse markers present in the three corpora, followed by an automatic classification task, using acoustic-prosodic features, to differentiate discourse markers and disfluencies (Shriberg, 1994; Liu et al, 2006; Ostendorf et al, 2008; Moniz, 2013) from each other and from SUs. The acoustic-prosodic discrimination between structural metadata events still poses several challenges, due mostly to the prosodic distribution of such events and to the correspondent acoustic-prosodic features of such prosodic contexts.

Related work

Data description and selection of discourse markers

Manual selection of discourse markers

Intra-domain classification

Findings

Conclusions and future work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Dialogue & Discourse	Publication Date: Jun 8, 2018
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

Cross-domain analysis of discourse markers in European Portuguese

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Dialogue & Discourse

Lead the way for us

Similar Papers

Classificação prosódica de marcadores discursivos
Vera CabarrãO ... Helena Moniz
Revista da Associação Portuguesa de Linguística | VOL. -
Vera CabarrãO, et. al.Vera CabarrãO ... Helena Moniz
31 Oct 2016
Revista da Associação Portuguesa de Linguística | VOL. -

Classificação prosódica de marcadores discursivos
Vera Cabarrão ... Fernando Batista
Revista da Associação Portuguesa de Linguística | VOL. -
Vera Cabarrão, et. al.Vera Cabarrão ... Fernando Batista
01 Jan 2015
Revista da Associação Portuguesa de Linguística | VOL. -

A comparative corpus-based study of European Portuguese discourse markers bom and bem and French bon and bien
Fátima Silva ... Françoise Bacquelaine
Studia Universitatis Babeș-Bolyai Philologia | VOL. 68
Fátima Silva, et. al.Fátima Silva ... Françoise Bacquelaine
30 Dec 2023
Studia Universitatis Babeș-Bolyai Philologia | VOL. 68

Towards automatic language processing and intonational labeling in European Portuguese
Helena Moniz ... Fernando Batista
-
Helena Moniz, et. al.Helena Moniz ... Fernando Batista
14 Mar 2016
14 Mar 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cross-domain analysis of discourse markers in European Portuguese

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Dialogue &amp; Discourse

More From: Dialogue & Discourse