Abstract
This work describes the discourse markers present in two corpora for European Portuguese, in different domains (university lectures and map-task dialogues). In this study, we also perform a multiclass automatic classification task based on prosodic features to verify in both corpora which words are discourse markers, which are disfluencies, and which are sentence like-units (SUs). Results show that the selection of discourse markers varies across domain and between speakers. As for the classification task, results show that the discourse markers are better classified in the lectures corpus (87%) than in the dialogue corpus (84%). However, cross‑domain experiments evidenced that data trained with the dialogue corpus predicts better the events in the lecture corpus, since this domain displays more speakers and therefore complex patterns. In both corpora, markers are more easily classified as SUs than as disfluencies.
Highlights
This work describes the discourse markers present in two corpora for European Portuguese, in different domains
No domínio do processamento automático de fala, as marcas de pontuação, que delimitam sentence like-units (SUs), as disfluências e os marcadores discursivos fazem parte de um conjunto de eventos designados no inglês structural metadata events
Pretende-se recuperar automaticamente a pontuação e as maiúsculas em fronteiras de frase, bem como a anotação e filtragem de disfluências e de marcadores
Summary
Vera Cabarrão[1, 2], Helena Moniz[1, 2], Jaime Ferreira[1], Fernando Batista[1, 3], Isabel Trancoso[1, 4], Ana Isabel Mata[2], Sérgio Curto[1]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.