A corpus-driven approach to discourse organisation: from cues to complex markers

This paper reports on an experiment implementing a data-intensive approach to discourse organisation. Its focus is on enumerative structures envisaged as a type of textual pattern in a sequentiality-oriented approach to discourse. On the basis of a large-scale annotation exercise calling upon automatic feature mark-up alongside manual annotation, we explore a method to identify complex discourse markers seen as conﬁgurations of cues. The presentation of the background to what is termed “multi-level annotation” is organised around four issues: linearity, complexity of discourse markers, top-down processing, granularity and the multi-level nature of discourse structures. In this context, enumerative structures seem to deserve scrutiny for a number of reasons: they are frequent structures appearing at different granularity levels, they are signalled by a variety of devices appearing to work together in complex ways


Introduction
Texts can be seen as the result of squeezing complex hierarchical structures into a largely linear format.Understanding a text entails constructing a representation of the underlying structures.A major challenge in the study of written discourse is to identify the signals which guide readers in the process of constructing this representation.Depending on one's theoretical underpinning and focus, signals may be seen as discourse construction devices, as metadiscourse, as reading or processing instructions, as traces of the writers cognitive processes, or as cues revealing the authors intentions.The study presented in this paper sets up a data-intensive methodology whereby signals "emerge" from the systematic analysis of a large set of annotated structures.Its aim is the empirical characterisation of configurations of cues signalling a particular discourse pattern: enumerative structures.As these structures can concern text spans of any size, the perspective is described as multi-level.The study relies on the systematic annotation of structures in a corpus of French language texts, and on the application of data mining methods to detect emergent complex discourse markers.While in terms of methodology it belongs in corpus linguistics and natural language processing, its theoretical foundations are to be found in functional linguistics, in psycholinguistics and in research on the visual dimension of texts.We chose to start from what may be seen as the most basic among the notions called upon to account for text/discourse organisation: linearisation, continuity vs. discontinuity (the fundamental question behind discourse segmentation), and discourse patterns.
The arguments for this "back to basics" approach are given in the next section, organised around four issues: the linearity constraint, the non-discrete nature of discourse markers, the importance of top-down processing, granularity and the multi-level nature of discourse structures.These constitute the foundation for the choice of enumerative structures for annotation, the rationale for which is given in Section 3, followed in Section 4 by the annotation model and method, from corpus preparation procedures to the manual annotation of structures and cues.In Section 5, a descriptive survey of nearly 1,000 annotated structures leads to a proposal for a granularity-based typology, and to an analysis of genre-related variations.Finally, Section 6 presents the recurrent cue configurations made apparent by the application of data mining techniques to the rich annotated data.

Multi-level annotation in the ANNODIS project: preliminaries
The research presented here started with the ANNODIS annotation project, which can be described as a large-scale discourse-level annotation experiment calling upon different discourse models and different genres of written French language texts 1 .The project comprised two distinct approaches, respectively labelled bottom-up and multi-level.Bottom-up and multi-level annotations were applied to different corpora, for reasons which are explained in 4.2.2, but a set of texts was annotated in both frameworks to allow a direct comparison of the approaches (ANNODIS duo, see 5.3).The bottom-up annotation, conducted according to Segmented Discourse Representation Theory (Asher, 1993), focused on the identification of rhetorical relations.The multi-level annotation--the main focus of this paper--took a less well-chartered path, which the present section aims to describe and justify in the form of basic propositions underlying the choice of objects to annotate, the annotation method, as well as the questions asked of the annotated corpus.

Linearisation is a problem
Language is linear, while mental representations are not (or not necessarily).This, as many authors have pointed out (Levelt, 1981;Gernsbacher, 1995Gernsbacher, , 1997;;Heurley, 1997, inter alia), can be seen as problematic insofar as "in text, a multidimensional discourse model is squeezed into a linear form.Linearity requires the writer to produce each textual unit in turn, and processing constraints demand short units of meaning.Yet the mental representation on which the discourse is based is not a succession of facts or ideas which can each be expressed in one sentence.This is where discourse organisation comes in..." (Ho-Dac and Péry-Woodley (2009), echoing Levelt (1981)).Few approaches to expository discourse, however, focus on linearisation and its inescapable consequence--sequentiality, i.e. the segmentation of discourse into subsequent text spans.Goutsos (1996Goutsos ( , 1997) ) is one author who argues for a theory of sequential relations, in which he sees "an autonomous source of text connectivity" (ibid.502).He describes most approaches as favouring a what-perspective--"what is taking place in discourse" ("propositional or semantic content--over a how-perspective--"which would focus on the structuring rather than the individual units of text" (Goutsos, 1996, p. 503).The macrostructure or story grammar approach to discourse coherence (van Dijk, 1980) is an example of the what-perspective, as are, largely, models relying on the notion of rhetorical relations (Rhetorical Structure Theory (Mann and Thompson, 1988), Segmented Discourse Representation Theory (Asher, 1993), inter alia).Goutsos' howperspective has its roots in functional linguistics, in the notion of "information packaging" (Chafe, 1976(Chafe, , 1994)), in Halliday's textual metafunction (Halliday, 1977(Halliday, /1983)), in the notion of textual strategy (Enkvist, 1985;Virtanen, 1992).A number of researchers in the field of automatic text generation, especially in the RST "sphere", have developed models based on a distinction related to the one Goutsos proposes: in particular Virbel, via his Text Architecture Model based on the notion of "textual object" (Virbel, 1989(Virbel, , 2015;;Lemarié et al., 2008), and Power et al., who argue that what they call "abstract document structure" is a separate descriptive level in the analysis and generation of written texts (Power et al., 2003).A distinction does exist within RST between subject-matter and presentational relations, a distinction which Taboada and Mann (2006) associate with Goutsos' proposals [p. 443].But Power et al. go further by questioning the ambiguity in RST between text spans and the meanings of these spans in the attribution of relations, and call for a clear distinction between document structure and rhetorical structure, which they claim are as distinct as syntax from semantics (Power et al., 2003, p. 245).These authors have in common a central concern with text segmentation, and how it is signalled (i.e. with what signalling devices).They do not primarily focus on the nature of relations between text segments--the central concern for models of discourse organisation based on discourse relations.In Goutsos' model, the two fundamental relations between text spans are simply continuity and discontinuity (or shift), text being seen as a "periodic alternation of transition and continuation spans" (Goutsos, 1996, p. 501).Continuity applies by default, and therefore can be implicit, whilst discontinuity requires some form of signalling, i.e. the presence of linguistic devices that "function as cues to the reader", "help [ing] the reader assign the utterance in which they occur to a continuation or a transition space" (Goutsos, 1996, p. 517).The next section sketches out the conception of discourse organisation signals which is embodied in the present study.

Signalling discourse organisation is "a struggle between different forces"
At any given time, linguistic choices in text production are influenced by several principles concurrently at play.These choices are, in Enkvist's words, "the outcome of a conspiracy or a struggle between the different forces that affect the linearization of discourse" (Enkvist, 1985, p. 321).We shall use an example to explain why Enkvist's image seems relevant: Example 1 from a policy oriented text published by IFRI (French Institute for International Relations)2

New budgetary cuts [section heading]
Let us now look at the effect of the crisis as things stand at present.[...] In the United Kingdom , the defence budget, which amounted to 44.5 billion in addition to spending relating to external operations in Afghanistan and Iraq, is to be cut by [...] In Germany , debate is raging over whether or not to abolish national service, which would reduce troop numbers from 250,000 to [...] In Austria , a question mark is hanging over the military service and most of the country's tanks have been withdrawn from service.[...] In Greece, the defence budget will be amputated by [...] On reading Example 1, one is immediately aware of the presence of a number of paragraphinitial adverbials (In the United Kingdom, In Germany, etc.).Time and space adverbials are amongst potential sequentiality cues which have also been considered good segmentation markers by researchers working in a what-perspective: they can be associated with topic shifts (Piérard and Bestgen, 2006), and they introduce a new interpretation criterion projecting forward (Charolles et al., 2005).Adopting a how-perspective, we would argue that what is significant about these adverbials is that there are four of them in relatively short succession, exhibiting strong parallelism: paragraphinitial prepositional phrases (In + name of country) followed by a comma.Together, they form an identifiable pattern, and recognising this pattern is in our view very much part of understanding what the text is about.Now, a series of paragraph-initial sequencers (Firstly, Secondly,...) instead of adverbials, or adverbials of time instead of space, would create a functionally similar pattern.A sequence of four non-initial, non-detached adverbials, on the other hand, would definitely not realise the same text strategy (Virtanen, 2004), and would not be perceived in the same way.In this perspective, features that are often overlooked in discourse organisation research must be fully taken into account, in particular layout and punctuation: the paragraph breaks, as well as the commas following each prepositional phrase, are clearly determining features.The pattern at once delineates the items as discontinuous and brings them together--because of their parallelism--into a higher level span (see Figure 2

below).
Viewed from a what-perspective, the four adverbials in Example 1 introduce spatial criteria which are essential for the interpretation of subsequent text: everything said after adverbial n and before adverbial n + 1 only applies in (is only true for) the spatial area designated by the adverbial.We find unconvincing the dichotomy sometimes found in the literature on textual metadiscourse between propositional and non-propositional textual material (or primary and secondary discourse) (see Ho-Dac et al., 2012).The adverbials in Example 1 are both at once: they have both an ideational and a textual role, and they are noteworthy for both the what-and the how-perspectives.This duality takes us back to Enkvist's remark about "a struggle between different forces": given that at any one time several processes are concurrently going on in text (producing content, organising text), signalling devices must be expected to be largely multifunctional, since they may be shared by several processes.
The perspective sketched out here challenges the view of discourse organisation as signalled in text primarily via specialised (lexical) discourse markers.Along the lines of authors such as Marcu (Marcu, 2000(Marcu, , 2006)), we would argue that discourse markers are likely to be more eclectic and less discrete, i.e. to come in the form of bundles of cues in which most of the time no single element is either necessary or sufficient.If, as suggested in the analysis of Example 1, and in line with a number of authors (Virtanen, 1992(Virtanen, , 2004;;Hasselgård, 2010), adverbials only function as sequentiality markers when associated with other features (e.g.positional or punctuational features) and/or occurring in a series (Ho-Dac and Péry-Woodley, 2009), the search for discourse markers is redefined as a search for recurrent cue configurations.A similar approach is adopted in recent research on the signalling of implicit discourse relations (Taboada and Das, 2013).In their study, Taboada and Das stress that "we need to move beyond the signalling by discourse markers (...) in order to understand how relations are processed, and in order to extract them automatically" (idem, p.250).The authors propose annotating a wide range of cues in order to discover combined and multiple signals of discourse relations (idem, p.250), an objective close to our own search for complex discourse markers.

Top-down processing influences discourse interpretation
Discourse markers are usually seen as bearers of instructions.For instance, connectives carry the instruction to link two text segments (or the propositional meanings they convey) via a particular relation.This conception is rooted in a bottom-up view of discourse interpretation as a unit by unit construction.We focus here on movement in the opposite direction, proposing that in text processing--particularly in the case of expository text--there is also an immediate perception of high-level signals, a Gestalt-like grasp of large-scale textual patterns, which in turn influences the step by step interpretation process (Asher et al., to appear).This interest in top-down processing relates to research in neighbouring fields: cognitive psychologists and psycholinguists have studied the effect of headings and sub-headings, and of layout features such as paragraph breaks, on reading comprehension and recall (Lorch and Lorch, 1996;Lemarié et al., 2012Lemarié et al., , 2008;;Heurley, 1997); computational approaches to document generation and understanding, in their concern with linearisation (cf.Section 2.1), have sought to define principles governing the physical presentation of text (Power et al., 2003;Bateman et al., 2001;Virbel, 1989Virbel, , 2015)).We use the term "textual pattern" to suggest that this top-down processing may have more to do with pattern recognition than with a compositional meaning-construction process.Signalling is part and parcel of the definition of textual patterns, since they are characterised by their ability to be readily perceived by readers.There can be no such thing as an implicit textual pattern.

Discourse structures are multi-level
In our attempt to draw attention to top-down processing, we referred to high-level signals and largescale textual patterns.More precisely, the signals and textual patterns in question are typically multi-level and apply recursively, and these are properties which are of great interest to us.The textual pattern which is the focus of this paper, enumerative structures, will be seen to range from a few lines to whole sections of text, and to allow several levels of embedding.This multi-level property is clearly related to how the text spans delimited by discourse structures interact with document structure segmentation (sections and sub-sections, paragraphs).Example 1 illustrates such an interaction between two modes of organisation, where place adverbials and paragraph segmentation may be seen as signalling the items of an enumeration.The approach therefore implies attention to document structure (Power et al., 2003), and its signals.Visual signals of document structure are considered as fully-fledged discourse features with the potential to combine with other cues to form what we call complex discourse markers.
The previous section has allowed us to clarify our general objective in the light of the basic tenets of our approach.We can now turn to the method specifically set up to identify these complex discourse markers, which involves automatic feature-tagging (Section 4.1.1),manual annotation of textual patterns (Section 4.1.2) and data mining techniques to identify correlations (Section 6).The method is designed for long expository texts which differ noticeably from newspaper articles in terms of discourse organisation (cf.Section 4.2.2).As described in Section 6.3, this method has made it possible to identify complex discourse markers made up of cues appearing in series or in specific patterns.We also insist on the role of genre in determining what cues are used, or in shifting the balance of interpretation of particular sets of cues.The first stage in the description of the method is to present the context--the ANNODIS multi-level annotation experiment.

An annotation experiment to implement a multi-level approach to discourse
In order to observe diverse structuring modes, including at high levels of organisation, we devised an annotation experiment to be carried out on lengthy non-narrative texts, organised into three distinct sub-corpora so as to allow potential genre-related variation to emerge.In accordance with the approach presented above, the annotation is not based on predefined markers: the identification of cue configurations functioning as discourse markers is an expected outcome of the analysis of the annotated data.However, our manual annotation relies on extensive pre-processing, in particular the systematic pre-marking of selected features, in an approach inspired by Biber (Biber, 1988;Biber et al., 2007).Figure 1 gives an overview of the methodology developed for this experiment.The association of exhaustive NLP-generated linguistic information (pre-marked features) with human intuitions (manual annotation of structures and cues) produces rich data, opening the way for new investigations using corpus-linguistics or data driven methods.Two multi-level structures have been annotated according to this methodology within the ANNODIS project--topical chains and enumerative structures--, but the present paper deals solely with the latter.
In line with the approach outlined in Section 2, the annotation project described here differs in several major ways from previous discourse annotation initiatives, such as the Penn Discourse Treebank (PDTB Prasad et al., 2008), the RST (Rhetorical Structure Theory) Treebank (Carlson et al., 2003), or the Discourse Graphbank (Wolf et al., 2004).The PDTB's focus is low-level discourse structure (elementary predicate-argument relations) and it is grounded in a lexicalised approach to discourse (role of discourse connectives as predicates).Though based on different models of discourse relations, with varying views on the role of lexical connectives, the RST Discourse Treebank and the Discourse Graphbank share similar objectives, in line with a what-perspective more than a how-perspective, to return to the distinction introduced in Section 2.1.The fact that all these annotation projects use only news material, mostly from the Wall Street Journal, is also revealing of how they differ from the experiment presented here, as will be made clear in the description of our experimental framework in Section 4.2.Having set out the context in which the annotation experiment was designed, we will now focus on one of the two multi-level structures selected for annotation: enumerative structures beyond the sentence level, as illustrated by Example 2: Example 2 from Wikipedia (English): "Global warming" (Retrieved 2014-10-09) Examples of impacts include: • Food: Crop production will probably be negatively affected in low latitude countries, while effects at northern latitudes may be positive or negative.Global warming of around 4.6 • C relative to pre-industrial levels could pose a large risk to global and regional food security.
• Health: Generally impacts will be more negative than positive.Impacts include: the effects of extreme weather, leading to injury and loss of life; and indirect effects, such as undernutrition brought on by crop failures.
Our interest in enumerative structures as textual patterns deployed at different discourse organisation levels is initially rooted in systemic functional linguistics: Halliday's description of text as "the unit of the semantic process" (Halliday, 1977(Halliday, /1983, p. 63) , p. 63) encourages the formulation of hypotheses on how perception of high-level structures may influence text interpretation at a more local level.In this context, we are developing an approach to "texture" which takes into account visual aspects of text construction, aspects considered by Power et al. (2003) as part of "document structure".Along the lines defined by these authors--pursuing Nunberg's reflection on "text grammar" (Nunberg, 1990)--and by researchers inspired by Virbel's model for "text architecture" (Luc et al., 2000;Luc and Virbel, 2001), we argue for a linguistic status for what Power et al. (2003) call "the graphical component", on a par with lexico-syntactic cues.After Luc and Virbel (2001), we describe enumerative structures as textual objects resulting from a textual act whereby text is arranged (visually or through other devices) so that the reader becomes aware of this textual arrangement.The associated semantics is that the reader is led to interpret the enumerated elements (i.e. the items) as similar in some respect, and therefore as constituting a segment homogeneous in terms of a "co-enumerability criterion".The co-enumerability criterion may be lexically expressed, as in Example 2 (Examples of impacts), or realised more indirectly.Two peripheral elements may contribute to this textual arrangement: a trigger which announces the enumeration and/or a closure.Enumerating appears thus as a very basic way of organising text, and a generic one in the sense that it can be resorted to for a wide range of semantic or rhetorical functions.
Despite this basic character, enumerating as a text construction strategy has not elicited much interest among discourse linguists: the "relative neglect" noted by Schiffrin in 1994, and described by her as "a surprising oversight" (Schiffrin, 1994, p. 378) still seems to apply.There are, on the other hand, quite a few studies focusing on specific linguistic elements playing a role in enumerating, in particular lexical item introducers, which have been variously named "linear integration markers" (Turco and Coltier, 1988;Jackiewicz, 2005), "sequencers" (Hempel and Degand, 2008) and "serial markers" (Bras and Schnedecker, 2013).Mostly concerned with the semantic description and classification of such markers (numerical: firstly, etc.; temporal: subsequently, finally, etc.; spatial: in the first place, etc.), these studies tend to leave out non-lexical cues such as visual devices and document structure.Research on "enumerable" nouns (Tadros, 1985, p. 6) or "shell nouns" (Francis, 1994;Schmid, 2000), so-called because of their underspecified meaning, also constitutes a relevant related field of investigation.Such nouns are seen as announcing ("predicting" to use Tadros' term) subsequent specification in the following text, and, in the case of enumerative structures, naming the co-enumerability criterion which provides the rationale for enumerating.
In contrast with these two groups of studies which mostly take specific markers as their starting point, our interest is in the text-structuring role of enumerating, and in the diverse ways in which these structures are signalled (Ho-Dac et al., 2010).Seen from this angle, the "markers" selected in the studies mentioned above are cues amongst others, playing a role in multiple-cue signalling devices--complex discourse markers.The existing research, however, as well as providing numerous insights, raises a number of issues of interest to our project, issues which underlie the questions we are going to ask of our annotated data.
From our text organisation perspective, as distinct from the discourse markers perspective in which studies of item introducers were carried out, there is no reason to give special treatment to lexical markers or to distinguish them from the various other ways in which enumerating may be signalled.As mentioned earlier, we take full account of document structure and include among potentially relevant text features the visual devices which organise the text on the page, delimiting different spans of text: typographical variations, layout (indentation, line spacing, paragraph breaks, bullets, headings).As visual devices can be seen as pulling enumerative structures towards the textual component while lexico-syntactic cues seem better able to contribute to the ideational component, a central objective of this study is to examine how these various cues interact, and what variables may have an impact on these interactions.
Another underlying question concerns the nature of the relation between the enumeration proper, i.e. the items, and the peripheral elements: trigger and closure.Viewing the relationship between the co-enumerability criterion expressed in the trigger (or closure) and each of the items in the enumer-ation in terms of taxonomy-based hypernym-hyponyms relations is clearly too narrow: expressions of co-enumerability may be used to gather together world entities, but also textual objects (section, chapter), rhetorical functions (examples, as in Example 2), or other forms of textual organisation such as steps in a chronology, stops in an itinerary, etc.Within a larger goal of exploring how the textual and the ideational components are enmeshed, enumerative structures deserve scrutiny for their ability to organise text and categorise content at one and the same time.Following Luc et al. (2000), enumerating is described as a textual act which asserts the coenumerability of the listed "entities" by transposing it textually.This "textual transposition" can take many forms, the most obvious being when items are separate paragraphs with bullet points.The nature of the co-enumerability may be made explicit, as in Example 3 below, where the two text segments are presented as similar in that they are both "criticisms" (A moralistic criticism [Une critique moraliste] and A deterministic criticism [Une critique déterministe]) 5 .The co-enumerability criterion may be expressed in the trigger, in a prospective element (two types of criticisms [deux types de critiques]), and/or in the closure, in an encapsulation (These two criticisms [Ces deux critiques]) (cf.Conte, 1996;Sinclair, 1983).The enumeration is the only necessary element in this structure.

An annotation model for enumerative structures
Example 3 from WIKI sub corpus6 (wik2 liberteSE coder3 1254325598390) From Example 3 onward, the formatting of examples obeys the following conventions: the two right-hand columns delimit each annotated ES and its components; horizontal lines in the lefthand column indicate paragraph breaks in the original, i.e. each boxed segment corresponds to a complete paragraph.In Example 3 for instance, the trigger is in a paragraph, the items consist of two paragraphs each, and the closure starts a sixth paragraph.Where excessively long paragraphs were cut, this is signalled by [...] (items 1 and 2).When a component covers only part of a paragraph, this is indicated as in Example 3 for the closure.
The reference associated with each example (e.g.wik2 liberteSE coder3 1254325598390) is its identifier in the ANNODIS resource.It can be searched for in the resource using the ANNODIS browser7 .

Annotating enumerative structures: annotation procedure and experimental framework
Biber et al. ( 2007) propose a step-by-step method for corpus-based studies of discourse, providing a detailed account of seven steps seen as necessary in order to arrive at generalisable descriptions of discourse structure in corpora.These steps may be carried out in two possible orders: either topdown (a priori communicative/functional categories provide the basis for manual text segmentation) or bottom-up (starting with automatic segmentation based on lexical cohesion).In both cases, the segmentation stage leads to a linguistic characterisation based on the analysis of the distribution of textual features, according to the methodology initially set up by Biber to produce an emergent text typology (Biber, 1988).In the bottom-up approach, the communicative/functional categories are derived from the linguistic characterisation (identification of clusters, which are then given a functional interpretation).Our approach is fundamentally grounded in Biber's use of systematic feature-marking and analysis, but can be seen as proposing a third way with respect to Biber et al. (2007).In accordance with the top-down approach, the functional units under study are determined and defined a priori.This is the model which is embodied in the annotation manual, and sketched in Section 3.2 above.But in a perspective akin to Biber et al's bottom-up approach, the human annotation process is guided by the pre-marking of features which emerge from previous studies as potentially relevant for the identification and description of enumerative structures.
The next section outlines these two major steps in the annotation procedure.Section 4.2 then fills in the detail of how they were carried out in practice: it describes the annotation interface, the corpus and the annotation model.

AUTOMATIC PRE-MARKING OF FEATURES
Prior to manual annotation, a systematic pre-marking of potentially relevant features was automatically carried out on the POS-tagged and syntactically parsed texts8 , relying on local grammars and making use of specifically designed lexicons.The selection of features calls upon previous research (see Section 3.1) and covers a wide variety of linguistic phenomena, both visual (punctuation, layout) and lexico-syntactic.The set of pre-marked features is organised in Table 1 below into seven types, which constitute the basis for the analyses which will be presented in Section 6.

Feature type
Feature The inclusion of sentence-initial circumstance adverbials in this set of features is based on Charolles' framing hypothesis (Charolles et al., 2005): such adverbials have the potential to project forward an interpretation criterion, and thus define the initial boundary of a discourse frame, i.e. a text segment clustering around a specific interpretation criterion.They were pre-marked as potential item introducers, as were sentence-initial connectives, which are potential sequencers.The automatic detection of such sentence-initial cues proceeded in two steps: the first was to detect all syntactically detached elements occurring before the grammatical subject; the second to attribute to each detached element a syntactico-semantic function (circumstantial adverbial, sequencer, other connective).
Prospective elements consist in simple and fairly unambiguous cataphoric patterns: XXX as follows (.:) PREP the following NUMBER XXXs.
In addition to these cataphoric patterns, prospective elements also include plural noun phrases where a classifier, or "shell-noun" (cf.Section 3.1), occurs9 .As for encapsulations, two patterns associated with shell-nouns were used (Conte, 1996;Schmid, 2000): plural demonstrative noun phrases and noun phrases introduced by the semi-determiner such (tel(le)s).
Example On remarque que dans cette conception philosophique de la liberté, les limites ne sont pas des limites contraignant la liberté de la volonté humaine ; ces limites définissent en réalité un domaine d'action où la liberté peut exister, ce qui est tout autre chose.

MANUAL ANNOTATION OF STRUCTURES AND CUES
Pre-marked features were meant to act as flags to guide annotators in the identification of sporadic discourse units, leading them away from linear reading towards a more global view of text.The manual annotation consisted of two main tasks: first, delimiting and labelling the components of ESs which were detected (trigger, co-items, closure and the co-enumerability criterion if explicitly stated); second, marking-up features considered as relevant cues, by either validating pre-marked features or annotating and labelling complementary cues.In (3) for example, after delimiting the components and identifying the co-enumerability criterion (criticisms [critiques] expressed in trigger and closure), the annotators validated all the pre-marked features and went on to annotate as an extra cue the parallelism between the two NPs (a moralistic criticism / a deterministic criticism) introducing the items.Once identified, additional cues were labelled according to the categories defined in the annotation guidelines, i.e. the categories used for premarking with the addition of syntactic parallelism10 .
If no predefined label fitted, the annotators were invited to create descriptive labels.As a consequence, a proportion of annotator-added cues form a heterogeneous set of non-categorised features (e.g.coreferential expression, trigger repetition, apposition, named entity...).
Each feature marked-up as relevant (whether or not it was pre-marked) becomes what we call an "ES-cue" , i.e. a linguistic feature which, in combination with others, participates in the signalling of enumerative structures, and hence of discourse organisation.It is through the identification of recurring configurations of ES-cues that we propose to define complex markers (cf.Section 6).

THE ANNOTATION INTERFACE
With annotators having to annotate textual zones of varying sizes, as well as deal with discontinuity and possible overlaps with previously delimited zones, the annotation task was highly complex and required an efficient purpose-built annotation interface.
The design of the GLOZZ interface (Widlöcher and Mathet, 2012) reflects two major requirements concerning text visualisation and the annotation procedure itself.The text visualisation interface has to take into account the output of the pre-marking procedure, including XML encoding of the text layout and formatting.The annotation interface must offer a panel of user-friendly editing tools for delimiting and characterising units; it must also facilitate navigation in the text being annotated.The solution adopted is to offer two views of the text: a main view for annotation and a global view to get a large-scale vision of the text (cf.Glozz premarked documents in Figure 3).Through these two modes of access to text, the interface encourages the annotator to combine a top-down and a bottom-up approach during reading, so as to be able to see local cues as well as global structures.
In order to ensure that the grasp of texts by annotators is as ecological as possible, and to reduce the inevitable processing difference between annotating and reading, the main view must present texts as real documents with major aspects of layout preserved.
All these requirements were taken on board in the design of the GLOZZ annotation platform11 .

THE ANNODIS CORPUS
Our approach to discourse imposes constraints on the selection of texts for the corpus.Contrary to previous discourse annotation programmes (cf.Section 2), we opted for lengthy expository texts, first because they tend not to be structured around a major referent--as is often the case in narratives--, secondly because they favour complex organisation and are therefore more likely to contain different structuring modes (including complex document structure).Another consideration was that corpus linguistics methods require fairly large volumes of texts and sufficient numbers of annotated structures if some generalisation of observations is to be possible (cf.Piérard and Bestgen, 2006).Finally, because we consider genre as a feature to be taken into account in the definition of complex discourse markers (Ho-Dac and Péry-Woodley, 2009; Taboada and Das, 2013), we compiled a diversified corpus enabling contrastive analyses.Considering all these criteria, the texts selected for inclusion in the ANNODIS corpus combine three genres of lengthy expository texts: web-encyclopaedia articles (nearly 200,000 words from the French Wikipedia in its version of June 18, 2008), scientific papers (proceedings of Congrès Mondial de Linguistique Franc ¸aise 2008, about 135,000 words) and reports in the field of interna-Figure 3: ANNODIS corpus preparation: from original text to pre-marked document ready for annotation tional relations (from the French Institute for International Relations, over 180,000 words).These three sub-corpora are respectively named WIKI, LING and GEOP.The total number of texts ( 83) was set in accordance with the time constraints on the annotation programme.Given our objectives, special care was taken in the preparation of the corpus: not only were all the texts XML encoded in conformity with the TEI-P5 norm, but it was imperative that the visual appearance of the texts be preserved, which is also a departure from previous experiments.Semiautomatic procedures were set up to annotate and encode the specific layouts signalling textual objects: title, headings with their level, paragraphs, lists and citations.Figure 3 gives a schematic view of the corpus preparation process.

THE ANNOTATION EXERCISE
The manual annotation was programmed in two stages, beginning with an exploratory phase dedicated to the evaluation of the task's feasibility, which led to a series of improvements in the procedure: clarification of the protocol, simplification of the annotation model, changes in the visualisation parameters, and correction of the annotation guide.Then came the annotation task itself.Three undergraduate linguistics students were selected as neutral (non-specialist) annotators.The 83 texts of our corpus were split into 3 sets.There was a training phase during which four texts were jointly annotated and the three annotators were encouraged to compare and share their annotations.This training phase led to an improved, stable version of the annotation guide (Colléter et al., 2012).After training, six new texts were annotated by the three coders, and these annotations were used for measuring inter-annotator agreement.
Measuring inter-annotator agreement in this case means checking whether two annotators have identified the same ESs in the target text.We considered that there was agreement on a given ES when annotators A and B had selected the same text span, and identified exactly the same items within this span.If two annotated ESs differed only in terms of trigger and/or closure (these units being optional) while respecting the previous conditions, they were considered identical.
Overall agreement between two coders for each text was measured via the F-score.The nature of the task ruled out traditional agreement measures (such as Cohen's Kappa) because ES marking is not a categorisation task.In a task such as ours, as Hripcsak and Rothschild (2005) explain, there is no access to a negative count, i.e. we cannot take into account the fact that both annotators agreed that there are no ESs in a particular span of text.For the evaluation of a marking task, the F-score is the measure which is most commonly used (see e.g.Brants (2000) for syntactic annotation).In our case the F-score is based on the number of ESs identified by both annotators and the overall number of ESs identified by each, as formulated below: This score was measured for every pair of annotators over the 6 texts (2 from each subcorpus), each having been annotated by three different coders.The overall average F-score is 0.67 (sd 0.21), meaning that over two ESs out of three marked up by one coder were also marked up by the other coder.This value was considered sufficient for the final annotation phase to be launched, whereby the remaining texts were distributed among the three annotators, each text being dealt with only once.
In a final stage, disagreements were post-annotated, and adjudicated versions of the ten multiannotated texts from the training and evaluation phases were produced12 .As observed in Colléter et al. (2012), disagreements mostly concern small and/or isolated ESs, as well as structures which may be considered as contrasts or chronologies rather than ESs.
The data collected at each stage is available on-line (original documents, texts prepared for annotation, pre-adjudicated versions, etc.) 13 , together with a technical report which includes the annotation manual together with coders' testimonies and adjudication details (Colléter et al., 2012).The exploitation of the annotations has so far been carried out in two ways: manually, by means of an exploration interface14 , and automatically, via data mining techniques.

Analysing the annotated corpus: enumerative structures (ESs) as a basic strategy
The rich annotated data resulting from the annotation exercise just described can now be examined for answers to the issues and questions raised in Sections 1 and 2. We start with a descriptive survey of the frequency, length and distribution of ESs in the corpus, which provides the basis for a quantitative assessment of their importance as a text construction strategy (Section 5.1), and a structural characterisation in terms of cardinality (number of items) and composition (presence/absence of a trigger and a closure) (Section 5.2).In Section 5.3 we compare bottom-up and multi-level approaches, taking advantage of the annotation of ESs in terms of discourse relations in a sub-corpus.We finally delve deeper into characteristics which are directly relevant to two major discourse organisation issues: enumerative structures are multi-level structures, capable of organising textual material at any level of granularity from entire sections to the sub-sentential level (see the typology in Section 5.4); enumerative structures are text-segmenting patterns as well as content-structuring categorisation devices, highlighting the interweaving of the textual and ideational components (Section 6).

Frequency, length and distribution of annotated ESs
Table 2 summarises the results of a first survey of the annotations, showing ESs to be a basic strategy frequently resorted to by writers in different genres of expository texts.There is an average of 12 ESs per text in our corpus (range: 2 to 34), and ESs have an average length of 429 words, with considerable variation (from 8 to 8,666 words).The "text coverage" value is the proportion of a given text appearing in at least one ES15 : on average, 44.6% of a text's words are contained in ESs; in some cases, text coverage is over 90%.ESs are present in all three sub-corpora with significant variations which will be presented in the next section together with variations regarding composition.The next sub-sections aim to flesh out this initial picture via analyses of the composition ESs and of their interaction with discourse relations and document structure.Although Table 3 gives the average number of items per ES as 3.4, it is worth noting that 42% of ESs contain only two items, whilst rare extreme cases may comprise up to 48 items 17 .Cardinality (number of items) and length (number of words) are positively correlated, though at a marginal level (r=0.14).Closures are rare (13.2%), whereas most ESs start with a trigger (74.6%).Given that either trigger or closure can express the co-enumerability criterion, trigger-less ESs could have been thought more likely to have a closure, but cross-tabulation of the data does not confirm this hypothesis: only 3% of trigger-less ESs have a closure.To end this general picture, complete ESs are fairly rare (around 10%) while over 22% of ESs are minimalist, i.e. composed only of items.No significant correlation has been established between completeness and length or cardinality.

Composition of annotated ESs
Looking at Tables 2 and 3, interesting variations across our sub-corpora begin to emerge.The largest ESs, both in length and cardinality, are found in encyclopaedia articles (WIKI), where they also cover a larger part of the text than in other sub-corpora (455 words/ES; 55.5% of total text).At the other end of the spectrum, international relation reports (GEOP) contain fewer and shorter ESs (369 words/ES) which cover a much smaller proportion of the text's surface (32.8% of total text).These variations across sub-corpora are all statistically significant (p < 0.001, Kruskal-Wallis test), but a clearer understanding of the text structuring role of ESs is needed to assess their linguistic significance.This is what the next two sections work towards by bringing into the picture two distinct sets of annotation, discourse relations and document structure, in order to arrive at a better characterisation of ESs' discourse function.

Interaction between discourse relations and ESs
As mentioned in Section 2, the ANNODIS project also involved a bottom-up annotation of discourse relations, and a part of the ANNODIS resource, labelled "ANNODIS duo", was annotated with both ESs and discourse relations.The model and method for the annotation of discourse relations originate in Segmented Discourse Representation Theory (SDRT): coders started by segmenting texts into Elementary Discourse Units (EDUs) and, after reaching mutual agreement, associated them with discourse relations, building up Complex Discourse Units (CDUs) until they arrived at a complete hierarchical representation of the text.Sixteen discourse relations were annotated 18 , a selection which represents a compromise between informativeness and reliability of the annotation process.The selection constitutes a consensual set of relations which are shared by most discourse models, or correspond to well-defined subgroups in fine-grained theories (Hovy, 1990), as well as to the level of grain adopted for the Penn Discourse Tree Bank (Prasad et al., 2008)  19 .ESs annotated in the ANNODIS duo resource 20 contain an average of 24.5 EDUs per ES (between 3 and 65).Because CDUs are recursively nested, the raw number of CDUs per ES is not relevant without a more qualitative analysis.Looking at the discourse relations annotated on the borders of triggers and items, i.e. relations associated to EDUs starting or ending triggers and items, the following associations may be observed: • 75% of triggers end with an EDU linked forward to another segment via ELABORATION* 21  and/or FRAME relations, • 94% of items start with an EDU attached to another segment via an ELABORATION* relation associated in 35% of cases with a simultaneous CONTINUITY relation, • when considering only initial items, 92% of ESs have an initial item where the starting EDU is associated to an ELABORATION* relation.
The fact that most ESs can be described in terms of just two discourse relations, i.e.ELABORA-TION* and CONTINUITY, confirms that the structure can legitimately be regarded as a functional unit, regardless of the diverse forms in which it occurs.Moreover, each of these relations seems to have a specific role in the structure: ELABORATION* between the trigger and items, and CONTI-NUITY between items, as Example 5 illustrates.These observations also support the SDRT model developed in Bras et al. (2008) 4. According to its authors, the ENUMERATION relation was introduced so that analysts would be able "to juggle between constituents describing semantic content and constituents describing discourse packaging while ENTITY-ELABORATION would not have allowed so" (Vergez-Couret et al., 2011).The authors clearly sense that two different types of text-building process are at play, which they want to account for while keeping them apart, hence the "juggling".This term calls to mind the doubts expressed in Goutsos (1996, p. 257)    "Ideational analyses of texts have identified relations of Joint, List or Sequence (Hoey, 1979;Mann and Thompson, 1988), whose status is clearly not so prominently ideational as textual.More generally, it is doubtful whether essentially presentational relationships like enumeration or listing can be couched in ideational terms at all.[...] the insistence on recognising a semantic relation between every single text segment comes into conflict with the occurrence of purely descriptive, propositionally loosely or arbitrarily related chunks of text." A similar question was raised in an earlier study on enumerating by Luc et al. (1999), who argued that their initial representation within Virbel's Text Architecture Model (cf.Section 2.1) needed complementing by a Rhetorical Structure Theory representation, while pointing out the inadequacy of tree-like structures to represent ESs and stressing the importance of visual clues, largely overlooked by researchers working within RST.Regarding the latter argument, Virbel et al. (2005, p.234) denounce linguistics' blindness to visual cues : "Linguistics, just as--to a lesser extent--information science, has long been 'blind' to the role of visual properties of written language, while other research fields (anthropology, history of texts, cognitive and experimental psychology) did point to the fundamental importance of these properties from the viewpoint of cognition."22 .
Among the visual properties overlooked by linguists, including discourse linguists, are titles and headings, whose role in text processing has been the object of much study in cognitive psychology (Lorch and Lorch, 1996;Lemarié et al., 2008Lemarié et al., , 2012)), but which are difficult to integrate within a discourse relations approach.The present study was designed from the outset to deal with the corpus not just as texts but as documents, whose layout structure is meaningful.The next section focuses on an annotation layer dedicated to the documents' layout structure, which appears to be particularly well-suited to characterising the annotated ESs in their diversity.

Enumerative structures are multi-level: a granularity-based typology
As mentioned in Section 4.2.2, the layout structure of documents was annotated according to TEI-P5 encoding.The textual objects considered here are sections, headings, lists and paragraphs.They are used as features in order to account for the variety of annotated ESs in terms of length and composition.Because they enter into hierarchical relationships, these layout units also provide a scale for describing ESs' granularity level (cf.Section 2.4).
We observed earlier that in terms of completeness, ESs show variations that are not explained either statistically by length or cardinality (cf.Sections 5.1 and 5.2) or by distinct discourse relations (cf.Section 5.3).In contrast, granularity level appeared as the most informative variable for the classification of these structures (see Ho-Dac et al., 2010).The interaction between ESs and the document's layout structure, presented in Table 4, provides the basis for a granularity-based typology of ESs.This granularity-based typology emerged as the optimal way of clustering the annotated ESs according to quantifiable variations in their form and composition.Moreover, it gives us a way of organising the data by distinguishing classes of objects likely to make use of different signalling modes, as described in Section 6.
Each type is now defined more precisely in terms of its typographical and layout features.
• Type 1 corresponds to multi-section ESs, in which items are sections with a visible heading, as in Example 6 below.
• Type 2 ESs are prototypical formatted lists where each item is signalled by a bullet or number, as in Examples 2 and 3 above.
• ESs which extend over more than one paragraph but do not belong to either of the previous types are Type 3.These Type 3 ESs contain at least one paragraph break which may occur between two components (e.g. between trigger and first item or, as in Example 8 between final item and closure as well as between items), with no specific constraints on the position and/or number of paragraph breaks.There is considerable variation in the distribution of types across sub-corpora, as shown in Figure 5. WIKI ESs are the most strongly associated with visual layout: in 19% of cases, the items are headed sections (Type 1) and in 36.4% they are formatted lists (Type 2).Such emphasis on visual properties is to be expected in texts designed to be read on screen.It is in marked contrast with linguistics papers and international relations reports, where ESs aligned on visual layout are fairly rare (fewer than 10% of Type 1 ESs) and Type 4 ESs i.e. low-level structures without visual properties, are the most frequent (45.5% in LING and 61.4% in GEOP against 22.4% in WIKI).Type 3 ESs, multi-level structures without headings or bullets as item introducers, are the most stable across corpora (between 20% and 23%).These variations support our assumption that genre must be considered a relevant feature for the characterisation ES and also for the definition of complex discourse markers.For example, (sub)headings may be considered as ES-cues only in specific genres or text types.The relation between a section heading and its sub-headings could arguably be seen as an inclusion relation similar to that of co-items in an enumerative structure.Yet it is not the case that all headed sections including headed sub-sections can be classed as ESs, and indeed most were not identified as ESs by the annotators.What marks out the annotated Type 1 ESs is the presence of a semantic criterion linking the items, in other words the fact that they function on both the ideational level and the textual level: in Example 6, the first level heading, Caesar's amorous conquests, provides this semantic criterion which unites under the category Caesar's amorous conquests upper-class Roman women (Les femmes de la haute société romaine) and queens (Les reines).

TYPE 2: BULLETED LISTS
Type 2 ESs are characterised by the presence of bullets or numbers signalling each item.They have the highest cardinality (4.1 items/ES), but are significantly shorter (184 words/ES, p < 0.001), their constituent items being generally restricted to short phrases, as in Example 7 below.There are exceptions, however, such as Example 3 above, where some items cover several paragraphs.Triggers are almost systematically present: 95% in WIKI, 97% in GEOP, and 100% in LING.The corollary is a tiny percentage of minimalist ESs.
ES2 TRIGGER le premier cas, celui où -i -est faux, mais où -ii -subsiste, les relations de -ii -sont à porter au crédit de la construction [...] Precise alignment of elements (trigger, items, closure) with paragraphs is not mandatory for this type.Example 9 shows a complete enumerative structure where the trigger and the first two items share one paragraph, while the last item and the closure appear in a separate one.

ITEM 3
De fait, la première option semblait être la seule permettant de poursuivre l'effort de 1991, en poussant [...] CLOSURE Type 3 ESs are average in length, with slightly above average cardinality.Whilst the frequency of triggers is markedly low (61%), closures are much more frequent than elsewhere, particularly in LING (27%) and GEOP (32%).Despite this comparatively high frequency of closures, Type 3 includes the highest proportion of minimalist ESs: over a third have neither trigger nor closure.These minimalist Type 3 ESs are characterised by the presence of series of ES-cues in paragraphinitial position (see Section 5.2.2 above).It may also be noted that only in Types 3 and 4 do we find ESs which have a closure and no trigger, as ES1 in Example 8.

TYPE 4: INTRAPARAGRAPH ESS
Type 4 ESs, which are contained within a paragraph, are the most frequent.The ES in (10) below and the nested structure (ES2) in ( 8) are examples of this type.Unsurprisingly, Type 4 ESs have the smallest mean length (120 words/ES); they also have significantly fewer items than other types: over half are 2-item ESs (against 29% for Type 2, 34% for Type 1 and 41% for Type 3).The presence of triggers and closures is slightly below average.As a consequence, complete ESs are fairly rare (7%), in contrast with minimalist ESs (29%), illustrated in Example 10.
ITEM 3 To summarise, the major variations accounted for by the granularity-based typology are as follows: • Type 1 ESs are significantly longer; • Type 2 ESs, with higher cardinality and shorter length, have a trigger most of the time and as a consequence are rarely minimalist ESs; • Type 3 ESs have significantly more often a closure, but are also more minimalist than the others; • Type 4 ESs are the shortest in length and cardinality with a high proportion of minimalist ESs.
This typology will be used in the next section to organise the data by distinguishing classes of objects likely to make use of different signalling modes.

Mining the annotated corpus for configurations of ES-cues
In this final section, we move on to the search for recurring configurations of ES-cues as a way of identifying the complex markers signalling ESs.Prior to this phase of the analysis, ES-cues (i.e.validated features) had to be organised into relevant categories.We added syntactic parallelism, encountered in various forms in Examples 1, 4 and 5, as a frequent annotator-added cue which had been identified from the outset as a potential ES-cue but could not be pre-marked for technical reasons.We now describe this re-classification, which accounts for the differences between Table 1 (Section 3) and Table 6 below.
In addition to the main annotation task, identifying ESs and their constitutive elements, the annotators were asked to mark up the cues which they identified as signalling these structures (cf.3.1.2).The resulting corpus contains 4,052 individual annotated cues which were explicitly identified as ES-cues either through pre-marked feature validation or through manual addition; to these must be added 500 headings, systematically counted as ES-cues when occurring in a trigger or when item-initial.
It should be stressed that identifying cues is considerably more difficult than identifying ESs, and at this stage we have no inter-annotator agreement measure on this task.A number of problems were encountered, some of which originate in the pre-marking procedure--any text processing program inevitably generates both noise and silence--, others in the level of linguistic competence required of the annotators, or in semantic difficulties inherent in some of the cues.Due to these limitations, our analysis will be restricted to the identification of the global behaviour of ES-cues.
The goal of this section is twofold: 1. to examine frequencies and distributions for the different kinds of ES-cues; 2. to identify recurrent cue configurations as a first step towards the definition of ES markers (cf.Section 3.1).Annotator-added cues, except syntactic parallelisms, were counted as "TriggerLex.","Clo-sureLex."or "ItemOthers" according to their host component.We are aware that these categories are excessively broad.We will in particular need to isolate prospective elements and encapsulations in order to investigate the expression of the co-enumerability criterion.A semantic characterisation of the expression of the co-enumerability criterion is required for a finer functional classification of ESs.

Description of cues in ES components
Tables 7 and 8 provide the detail of the distribution of annotated cues for each component.Distributions are given both globally and according to ES types.All values are percentages, and are relative to the frequency of the corresponding element: out of the 131 ESs with a closure (i.e.out of 13.2% of ESs, cf.Table 5) 78.6% have a lexical cue.Percentages do not add up to 100 for triggers and items, as they each can have between zero and several cues (of different kinds).

TRIGGER AND CLOSURE CUES
Trigger and closure are almost systematically signalled by a cue: over 75% of these components were associated by the annotators with at least one ES-cue.
Two categories vary considerably in frequency across types: explicit lexical elements, which potentially announce the co-enumerability criterion (TriggerLex.)Closures are characterised by the strong presence of lexical cues (ClosureLex.).It must however be kept in mind that this component is rare in our annotated data (cf.Table 3), with the consequence that these percentages correspond to very few cases and the results cannot be extrapolated.

ITEM CUES
Table 8 shows that all predefined categories of item cues were indeed found in the ANNODIS resource, and that conversely very few item cues are found in the "ItemOthers" miscellaneous category.Over 15% of ESs contain at least one of the most common lexical cues i.e. a sequencer, an adverbial or a parallelism.Among these lexical cues, only parallelisms are distributed more or less equally in all ES types.As a consequence, parallelism is the cue which combines most frequently with the visual cues inherent to Type 1 and 2 ESs (ItemHead.or Bullets).The other lexical cues are on the contrary extremely rare in Type 1 and Type 2. This lack of variety in the signalling of items creates a stark contrast between Types 1 and 2 on the one hand, and Types 3 and 4 on the other, the latter displaying a greater complexity of organisation associated with a wide range of cues.
The distribution of item cues in Types 3 and 4 presents an interesting contrast: Type 3 ESs favour circumstance adverbials over sequencers (47.8% and 26.3% respectively), whereas in Type 4 sequencers (34.4%) prevail over adverbials (19.3%).An explanation for this difference may be found in the organising role of circumstance adverbials (cf.1.2): in order to function as discourse segmentation markers, these must be paragraph-initial, as in Example 12 below where 3 place adverbials occupy paragraph-initial position (

Cue associations
The examples make it clear that most ESs are signalled concurrently by several kinds of ES-cues.The previous section gave an insight into the frequencies of individual ES-cues without taking into account their co-occurrence.Yet our hunch, as stressed in Section 2, is that textual patterns are not just signalled by discrete clearly identifiable dedicated markers, but by configurations of ES-cues functioning as complex discourse markers.In such a perspective, a discourse function should not be attributed to a particular lexical expression--on the basis of a specific semantic or pragmatic value--but rather to this expression when it occurs in a particular context or configuration (cf. the pattern formed by the series of paragraph-initial adverbials in Example 1).The ANNODIS resource now provides us with data to investigate this hypothesis, and this is what we attempt below using the notion of cueset.This is not an easy task, however, as we want to allow for flexibility while hoping to catch recurring patterns.The identification of textual patterns when reading can be conceptualised in terms of pattern recognition: a threshold is reached when there are enough converging cues to push interpretation towards the identification of the pattern in question.A more satisfactory approach to cuesets would involve attributing weights to individual ES-cues in order to account for the fact that several weak cues may do the same work as one strong cue.This is our horizon, with the work presented here as a first exploration of the data in this direction.
A cueset is the set of cue categories occurring in an ES.As the purpose of these sets is to help identify frequent cue associations, we apply the following simplifications: 1. for item cues, a single occurrence suffices for the cue to be included in the set, there is no need for the cue to appear in every item; 2. cue frequency within an ES is not taken into account, and is reduced to a simple binary value of presence/absence.
The main reasons for these simplifications are the potential incompleteness of item marking (e.g.firstly in the first item not followed by other sequencers), and the inherent difficulty of the cueannotation task.
The number of different associations was calculated for the whole collection of ESs, and for each ES type studied independently.Of all theoretically possible configurations23 , over half were actually observed: 113 distinct cuesets were identified for all 991 ESs.This result is interpreted as meaning, on the one hand, that ESs are signalled by a variety of cue configurations--for example a lexical cue in the trigger followed by a series of sequencers, or a combination of sequencers and adverbials; and on the other hand, that certain specific cue associations recur, while others are not found.
In order to identify the most frequent cuesets, we focus here on the 14 cuesets occurring at least 20 times across types and corpora, which represent 63% of all ESs.Among the most frequent cuesets, we find clusters made of cues which have not been the focus of much attention in studies of enumerating i.e bullets, punctuational patterns and more interestingly lexical cues in the trigger (cf.Example 7).In almost all frequent cuesets there is at least one trigger cue (a punctuational or a lexical one) associated with all possible item cues (i.e.headings, bullets, sequencers, adverbials, parallelisms).
The most frequent cueset is the combination T riggerP unct.+ T riggerLex.+ Bullets which occur 83 times i.e. in 8.4% of ESs.The fairly similar cueset T riggerP unct.+ Bullets recur only 40 times, which means that visual cues are usually combined with lexical ones.The same kind of combination is found with the cueset T riggerP unct.+ T riggerLex.+ ItemP unct.This finding supports our view of signalling as a struggle between different forces (cf.Section 2.2): visual devices can be seen as pulling enumerative structures towards the textual component while lexico-syntactic cues seem better able to contribute to the ideational component.
Cuesets including the much-studied sequencers are also very frequent.Two types were observed with approximately equal frequency: cuesets composed of sequencers only as in Example 8 (73 cuesets, 7.4% of ESs); and cuesets which combine sequencers with other cues such as a lexical cue in the trigger (as in Example 9), parallelisms or adverbials (87 cuesets, 8.8% of ESs).Cuesets made up purely of adverbials (Examples 9 and 11) are as frequent as those made up purely of sequencers (74 cuesets, 7.5% of ESs).But cuesets mixing adverbials with other kind of cues are fairly rare (only 23 with lexical cue in the trigger and 20 with sequencers).
All the cuesets described are fairly stable across sub-corpora except for two: those made up of ItemHead.and Paral.which only recur in scientific papers and those made up of adverbials which occur primarily in encyclopaedia articles and never in scientific papers.
A number of specific configurations have been identified, which, when correlated with ES types, can be summarised as follows: • Type 1 ESs are typically signalled by a sequence of same level headings, with an upper level heading acting as a trigger, and occurs in documents where layout and visual formatting play a prominent role.
• Type 2 ESs are typically signalled by a sequence of bulleted items, almost systematically introduced by a punctuational cue (final colon in the preceding paragraph), and/or a lexical cue in the trigger which may carry semantic information on the co-enumerability criterion.
• Type 3 ESs are typically signalled by a contiguous series of paragraphs with circumstance adverbials (or, less likely, sequencers) in initial position, which have both a textual and an ideational role.Such structures seem to be highly genre-sensitive.
• Type 4 ESs can be described as a single paragraph containing a series of sequencers, with a high probability of a colon marking the end of the trigger, or a prospective element indicating the co-enumerability criterion.In contrast with Type 2, Type 4 ESs reflect the ideational dimension of discourse organisation more than the textual dimension.

Towards complex discourse organisation markers identification
In order to validate and formalise the cuesets observed above, we used a common data mining technique for identifying recurrent associations between pairs of cues by extracting the association rules i.e. the logical implication rules between cues (Agrawal et al., 1993).This method ensures that all possible cases are systematically examined.The association rules are of the form: Rules 3 and 4 merely confirm that ESs of Types 1 and 2 have a high proportion of triggers (Table 5), and that most of these triggers contain a lexical cue (Table 7).Such a finding can be interpreted as showing that even in apparently purely visual i.e. textual ESs, a lexical cue somewhere will ensure the presence of the ideational dimension.
Rule 5 links the existence of an encapsulation to that of a prospective element.Again, this must be interpreted with the knowledge that both these cues are quite systematic in triggers and closures.Rule 5 says that ESs with a closure generally have a trigger.In other words, closures are not used to compensate for the absence of a trigger.By lowering the tolerance of the association rules system, more rules can be made to emerge, although they are known to be much less reliable and systematic.One interesting point is that, even with a low threshold, no rule involving circumstance adverbials emerges.This negative result confirms that this cue category is much less likely to work in association with others.
Most of these results were predicted and explained in previous sections, which suggests that no other obvious specific cue associations can be identified as a result of our annotation exercise.

Conclusion
Enumerative structures were selected as the focus of this study as a way of throwing new light on linearisation and segmentation, discourse phenomena which are particularly difficult to analyse empirically.We described enumerative structures in Sections 2 and 3 as a generic multi-level device for organising text.According to our broad functional definition, they are textual patterns assembling text spans which are made to appear as similar in a given respect, thereby forming a higher-level segment homogeneous in this particular respect.They arrange into linear format text segments which are ideationally discontinuous but functionally equivalent and interchangeable.Their signalling calls upon a great diversity of cues working together.As such, enumerative structures constitute good handles for analysing how writers cope with "The Unbearable Linearity of Texts".
Our objective of proposing a data-intensive methodology for the study of linearisation and segmentation imposed certain requirements in terms of ease of detection and annotation.As their function depends on their being readily detectable, they constitute a good object for annotation.The sizeable annotated resource described here, which has been made available to the research community, is characterised by a number of original features: • it is composed of highly-structured long expository texts in three different genres (as opposed to short news material); • its mark-up combines NLP-based exhaustive techniques and human intuitions; • the visual characteristics of the texts have been encoded so as to provide a presentation respectful of the original layout; • the annotation guidelines were designed so as to bring together under a functional umbrella objects that are linguistically more diverse than in previous studies: Type 4 (intraparagraph ESs) is shown to be just one realisation, accounting for no more than 40% of annotated ESs.
The paper summarises the results of the first analyses of the annotated data.The evidence-based and quantified typology we propose encapsulates the major results of our analyses so far: it provides a broad picture of a device realising a basic textual strategy.The preliminary analyses presented here only scratch the surface of what the annotated corpus allows.Where cues have been counted together, e.g. in the analysis of triggers and closures, finer analyses are needed to take into account the specific contribution of each type of cue, in particular in the case of expressions of the co-enumerability criterion.Qualitative studies are necessary for the analysis of the rhetorical and semantic functions of enumerative structures in text, opening the way for the study of correlations between such functions and cue configurations, and for the exploration of the differences between sub-corpora.
One important issue concerns the nature of the relation between the items and the "classifier" which introduces and links them.Does an enumerative structure reveal a pre-existing categorisation or can it "discursively create" such knowledge, as suggested by Schiffrin (1994, p. 396) or Luc et al. (2000)?The latter hypothesis, a constructivist one more in keeping with our textual approach, makes the expression of the co-enumerability criterion worth studying as potentially revealing not just of pre-existing knowledge structures, but of a writer's discourse strategy.A systematic study of expressions of the co-enumerability criterion is under way (Rebeyrolle and Péry-Woodley, 2014).It suggests that only in very few cases are these expressions linked to the enumerated items by a hypernym-hyponym relation, i.e. a taxonomic relation which is discourse-independent (< 10%).First results show that the expressions of the co-enumerability criterion would generally be better described as "text-bound" labels (Francis, 1994), in terms of shell nouns or signalling nouns (Flowerdew, 2003;Flowerdew and Forest, 2015).The study develops a model associating semantic and textual properties of linguistic expressions of co-enumerability, proposing that semantic characteristics situate enumerative structures containing these expressions on a cline from mainly textual (metadiscursive) to mainly ideational (stable, discourse-independent categorisation).On this basis, a classification of ESs in terms of their discourse function will be put forward, to be compared with existing taxonomies of relevant discourse relations and analyses based on them (e.g.Joint, List, Sequence in RST).
Cue configurations should also be examined further in relation to layout (ES type) and composition: for example minimalist ESs (neither trigger nor closure) are markedly more numerous in Types 3 and 4, which is also where adverbials are most frequent as item markers, and may compensate for the absence of expression of the co-enumerability criterion in a prospective or encapsulating element (Rebeyrolle and Péry-Woodley, 2014).We wish to look further into these trade-offs as examples of how ideational and textual metafunctions are interwoven in these structures.ESs should also be examined in context, within the linearity of text: interactions between ESs (nested and in sequence), interactions between ESs and other textual structures (including annotated topical chains).Finally, the markers identified should be tested and refined for the automatic detection of ESs, with potential applications in automatic text synthesis and document navigation.
Example 10 from GEOP sub-corpus (geop 11SE coder1 1254301361468) [...] Between 1949 and1970, [...] ; [...] the share of demand covered by imported oil rose from 10% to 23%.Between 1978 and 1985, imports went down sharply, in absolute terms (-3.8MB/d) as well as in relative terms (-16 points of market share).Two factors explain this phenomenon: the development of the giant oil field at Prudhoe Bay in Alaska, and the fall in oil demand linked to the second "oil crisis" in 1979 and to the economic recession.Since 1985, the share of imported oil in covering demand has continuously increased up to today.
Example 11 from GEOP sub-corpus (geop 27SE coder2 1282829750411) [...] Terrorist events proliferate at the crossroad between four great circulations: the circulation of words and images (which makes it possible to cobble together solidarities between very different social groups), the circulation of capital (which allows the setting up of efficient logistics), the circulation of weapons (which keeps opening up the prospects for future dangers), and the circulation of men.[...] Example 12 from WIKI sub-corpus (wik2 attentats11septSE coder1 1254125810843) In the US, the only person until now to have been judged for direct implication in the 9/11 attacks is the Frenchman Zacarias Moussaoui.Arrested less than a month before the attacks, he has been accused by the American federal authorities of having had knowledge of the forthcoming attacks and of not having communicated his information.On May 3rd 2006, after a two-month trial, he was found guilty by the jury of the federal tribunal of Alexandria in Virginia on six charges of conspiracy linked to the terrorist attacks of September 11th and sentenced to life imprisonment without the possibility of parole.
In Germany, the Morrocan Mounir al-Motassadeq, arrested on November 28th 2001, is sentenced first to 15 years in prison in 2003 for complicity in these attacks.Freed in February 2006 after his conviction was quashed, he saw his initial sentence confirmed by the tribunal of Hamburg on January 8th 2007.
In Spain, the Syrian Imad Eddin Barakat Yarkas, chief of the local Al-Qaida cell, is arrested on November 13th 2001, charged with conspiring towards the September 2001 attacks.On September 26th 2005 he receives a twenty-seven year prison sentence.
Example 13 from GEOP sub-corpus (geop 16SE coder1 1255425907703) [...] But the discussions are obscured by ideological positions: "subsidies are intrinsically bad", "you can't touch the CAP", "developing countries will always be the victims of an unfair system" ... positions which are not borne out by practices.All countries, even the most virtuous, use subsidies in a manner which may distort trade; the CAP is constantly under revision, and its cost is not high (0.5% of European GNP); finally, it is not true that developing countries stand to gain from a total end to subsidies, given the massive comparative advantage of the biggest agricultural producers, which are not developing countries.

Figure 1 :
Figure 1: A data-intensive method for the study of discourse organisation and discourse signalling Our annotation model was designed to allow a moderately open-ended annotation task, aiming to leave some leeway for possible off-model annotators' intuitions.According to this model, an enumerative structure (ES) extends beyond a single sentence 3 and is made up of three segments: (1) the trigger: an optional introductory segment, (2) the enumeration, defined as a list of at least two co-items, (3) the closure: an optional closing segment.Figure 2 gives a schematic representation of this definition of ESs.It shows how the elements entering into this linear arrangement are in fact different kinds of nested text spans linked via a number of possible short and long distance discourse relations 4 .

Figure 2 :
Figure 2: Enumerative structure representation un tel abandon ne relève-t-il pas d'une forme déguisée de déterminisme ?Nous serions alors victimes d'une illusion de libre arbitre : [...] ITEM 2 Nietzsche reprendra cette critique : Aussi longtemps que nous ne nous sentons pas dépendre de quoi que ce soit, [...] Ces deux critiques mettent en lumière plusieurs points importants.CLOSURE [...] D'autre part, de fac ¸on tout à fait inverse aux tentatives de monosémisation, on fait proliférer les sens [...] ITEM 2 Les positions -a-et -bne sont qu'apparemment paradoxales : il n'y a aucune contradiction, d'un côté, à [...] CLOSURE The trigger is clearly marked: it consists of the heading (2.2.Two ways of denying polysemy [Deux manières de nier la polysémie]) and the sentence following it.The prospective element in the heading, made obvious by a numeral determiner, two ways [deux manières], is reiterated in the first sentence.The items are then signalled in four complementary ways: each one makes up a paragraph, they are introduced by a dash, sequentially labelled with letters, and start with a correlative adverbial which stresses the parallelism between the two assertions (On the one hand -On the other hand [D'une part -D'autre part]).The closure ends the enumeration with an encapsulating noun phrase (Positions -a-and -b-[Les positions -a-et -b-]).This example shows how different ES-cues can reinforce one another, giving the ES high visibility.
Co-items (π b , π c ) together introduce a complex constituent (π) which is attached to the trigger (π a ) by the ENUMERATION relation.A coordinating relation (by default CONTINUATION) is inferred between the co-items.

Table 2 :
Frequency and coverage of annotated ESs in ANNODIS and all three sub-corpora

Table 3 ,
an overall view of the composition of ESs in the corpus, shows that only a small proportion is complete with respect to the canonical three-part model--trigger, items, closure16.Example 5 from LING sub-corpus (ling kleiberSE coder3 1254143156093)2.2Deuxmanièresde nier la polysémie ESTRIGGER Une réponse possible est [...]la polysémie en tant qu'association de plusieurs sens à une même forme lexicale se trouve niée de deux manières apparemment paradoxales : -a-D'une part, les vocables donnés comme polysémiques se voient en quelque sorte "monosémisés" par[...] (Muller et al., 2012) of dealing with essentially textual relations in ideational terms: Like the annotation guide for multi-level structures, the guide produced for the annotation of discourse relations is freely available(Muller et al., 2012).It provides an intuitive introduction to discourse segments, including the question of embedding of discourse segments to form CDUs; a list of detailed instructions describing how to handle segmentation; and a semantic definition of each discourse relation with examples and potential markers.20. 26 ESs: 15 in WIKI, 5 LING and 6 in GEOP.21.ENTITY-ELABORATION and ELABORATION relations are merged for this analysis under the label ELABORATION*. MENT19.

•
Finally, Type 4 stands for ESs contained within a single paragraph, see Example 10 and 11.It is the most frequent across the whole corpus.

Table 4 :
Granularity-based typology of ESs

Table 5 ,
followed by descriptions of the four ES Types.ESs i.e. they also have a trigger.Indeed, most Type 1 ESs have a trigger (84%) which is generally a heading of the next level up and announces the enumeration both visually (via document structure) and semantically.Example 6 shows a Type 1 ES with a trigger and 2 items: César séduit de nombreuses femmes tout au long de sa vie et plus particulièrement celles issues de la haute société romaine.Il aurait ainsi séduit Postumia, la femme de Servius Sulpicius, Lollia, [...] César entretient des relations particulières avec Servilia Caepionis, [...] Le penchant de César pour les plaisirs de l'amour semble également attesté par [...] 6.2.Les reines ITEM 2 César a des relations amoureuses avec Eunoé, femme de Bogud, roi de Mauritanie.Cependant, sa relation avec Cléopâtre VII est restée plus célèbre.[...] least frequent in all three sub-corpora, they are, as can be expected, the longest ESs in our corpus, averaging 1,858 words (with an enormous range: 252 to 8,666 words), yet their cardinality is close to the average (3.4 items per ES).The few Type 1 ESs which have a closure (4%) are all complete ESs stretch over at least two paragraphs, with no headings or bullets, as illustrated in Example 8.This example is a case of nesting: ES2 is embedded in ES1.The larger ES (ES1) is Type 3, with two paragraph breaks: one between the two items, the other before the closure.This example illustrates the role of paragraph-initial position in the signalling of ESs: each item-paragraph starts with a sequencer (A first observation [Une première observation]; A second observation [Une deuxième observation]).These sequencers are echoed by the two item-introducing expressions in the embedded Type 4 ES (ES2) (In the first case [Dans le premier cas]; The second position [La seconde position]).

Table 6
below lists the categories of ES-cues taken into account.The abbreviations in bold are used throughout the remainder of this section.
Ex. 3: Ces deux critiques Ex. 9: To sum up / En somme Table 6: Categories of ES-cues for analysis (after re-classification)

Table 7 :
, and punctuation marks (a final Distribution of trigger and closure cues colon, TriggerPunct.),which have a purely textual role.Characteristic trigger punctuation is most frequent in Type 2 ESs, part of a well-established pattern for introducing lists, seen in Examples 3, 5 and 7. Punctuation cues are also fairly frequent in Type 4 ESs: Example 11 illustrates how punctuation is instrumental in signalling such intraparagraph ESs, with a colon as a trigger cue, and final commas reinforcing the parallelism between items.

Table 8 :
In the United States [Aux États-unis], In Germany [En Allemagne], In Spain [En Espagne].Distribution of item cues (cue category per ES type) Aux États-Unis, la seule personne à avoir été jugée jusqu'à présent pour son implication directe avec les attentats du 11 Septembre est le Franc ¸ais Zacarias Moussaoui.Arrłté moins d'un mois avant les attaques, il a été accusé par les autorités fédérales américaines d'avoir eu connaissance des attentats à venir mais de n'avoir pas communiqué ses informations.Le 3 mai 2006, au terme de deux mois de procès, il a été reconnu coupable par le jury du tribunal fédéral d'Alexandria en Virginie de six chefs d'accusation de complot en liaison avec les attentats terroristes du 11 Septembre et condamné à la prison à perpétuité, sans possibilité de remise de peine.En Allemagne, le marocain Mounir al-Motassadeq arrłté le 28 novembre 2001, est condamné une première fois à quinze ans de prison en 2003 pour complicité dans ces attaques.Remis en liberté en février 2006 après que sa condamnation a été cassée, il voit sa première peine confirmée par le tribunal de Hambourg le 8 janvier 2007.En Espagne, le Syrien Imad Eddin Barakat Yarkas, chef de la cellule locale d'Al-Qaida est arrłté le 13 novembre 2001, inculpé de conspiration en vue des attentats de septembre 2001.Il est condamné le 26 septembre 2005 à vingt-sept ans de prison.This positional constraint is incompatible, by definition, with intraparagraph Type 4 ESs.Sequencers on the other hand are specialised in the signalling of ESs, and seem therefore to be more independent from positional constraints, which could explain why they are particularly suited to these low-level ESs, as illustrated by Example 8.In addition, punctuation item cues are fairly frequent in Type 4, as are punctuation trigger cues.Example 13 illustrates such a combination of punctuation and lexical cues in a Type 4 ES where items are separated by a semicolon and the last one introduced by the connective enfin / finally.Mais les discussions sont occultées par des positions idéologiques : 'la subvention est intrinsèquement néfaste', 'la PAC est intouchable', 'les PED sont quoi qu'il arrive victimes d'un système injuste' ... positions contredites par les pratiques.