Abstract
This paper reports on an experiment implementing a data-intensive approach to discourse organisation. Its focus is on enumerative structures envisaged as a type of textual pattern in a sequentiality-oriented approach to discourse. On the basis of a large-scale annotation exercise calling upon automatic feature mark-up alongside manual annotation, we explore a method to identify complex discourse markers seen as configurations of cues. The presentation of the background to what is termed "multi-level annotation" is organised around four issues: linearity, complexity of discourse markers, top-down processing, granularity and the multi-level nature of discourse structures. In this context, enumerative structures seem to deserve scrutiny for a number of reasons: they are frequent structures appearing at different granularity levels, they are signalled by a variety of devices appearing to work together in complex ways, and they combine a textual role (discourse organisation) with an ideational role (categorisation). We describe the annotation procedure and experimental framework which resulted in nearly 1,000 enumerative structures being annotated in a diversified corpus of over 600,000 words. The results of two approaches to the rich data produced are then presented: firstly, a descriptive survey highlights considerable variation in length and composition, while showing enumerative structure to be a basic strategy resorted to in all three sub-corpora, and leads to a granularity-based typology of the annotated structures; secondly, recurrent cue configurations---our "complex~ markers"---are identified by the application of data mining methods. The paper ends with perspectives for further exploitation of the data, in particular with respect to the semantic characterisation of enumerative structures.
Highlights
Texts can be seen as the result of squeezing complex hierarchical structures into a largely linear format
While in terms of methodology it belongs in corpus linguistics and natural language processing, its theoretical foundations are to be found in functional linguistics, in psycholinguistics and in research on the visual dimension of texts
After Luc and Virbel (2001), we describe enumerative structures as textual objects resulting from a textual act whereby text is arranged so that the reader becomes aware of this textual arrangement
Summary
Texts can be seen as the result of squeezing complex hierarchical structures into a largely linear format. We chose to start from what may be seen as the most basic among the notions called upon to account for text/discourse organisation: linearisation, continuity vs discontinuity (the fundamental question behind discourse segmentation), and discourse patterns The arguments for this “back to basics” approach are given, organised around four issues: the linearity constraint, the non-discrete nature of discourse markers, the importance of top-down processing, granularity and the multi-level nature of discourse structures. These constitute the foundation for the choice of enumerative structures for annotation, the rationale for which is given, followed in Section 4 by the annotation model and method, from corpus preparation procedures to the manual annotation of structures and cues.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have