Discourse Relations and Connectives in Higher Text Structure

The present article investigates possibilities and limits of local (shallow) analysis of discourse coherence with respect to the phenomena of global coherence and higher composition of texts. We study corpora annotated with local discourse relations in Czech and partly in English to try and find clues in the local annotation indicating a higher discourse structure. First, we classify patterns of subsequent or overlapping pairs of local relations, and hierarchies formed by nested local relations. Special attention is then given to relations crossing paragraph boundaries and their semantic types, and to paragraph-initial discourse connectives. In the third part, we examine situations in which annotators incline to marking a large argument (larger than one sentence) of a discourse relation even with a minimality principle annotation rule in place. Our analyses bring (i) new linguistic insights regarding coherence signals in local and higher contexts, e.g. detection and description of hierarchies of local discourse relations up to 5 levels in Czech and English, description of distribution differences in semantic types in cross-paragraph and other settings, identification of Czech connectives only typical for higher structures, or the detection of prevalence of large left-sided arguments in locally annotated data; (ii) as another type of contribution, some new reflections on methodologies of the approaches under scrutiny.


Introduction
In coherence-oriented discourse studies, the recognition of the distinction between local and global coherence dates back to the 1980's and one of the most compelling ways of its explanation is that "the main purpose of global coherence relations is to help eliminate locally coherent nonsense texts" (Unger, 2006, Samet andSchank, 1984). According to them, global coherence is the connectivity between the main events of the text (scripts, plans and goals) and the global relations hold independently of the local coherence relations between discourse segments. In automatic discourse processing, the local and the global coherence models are also known as shallow and deep discourse analyses/parsing, respectively (Prasad et al., 2010). Both types of approaches deal with determination and description of semantic and pragmatic relations between individual text units but they differ in the focus of the analyses and the methodology.
Local coherence models proceed bottom-up, from the smallest discourse units within a single sentence (clauses, even nominalized events, states etc.) and across the sentence boundary. Emphasis is put on the description of semantico-pragmatic relations between every two consecutive units and on the identification of lexical cues anchoring these relations (mostly discourse connectives; where there is no such surface cue, the relation is called implicit). The coherence analysis in this way describes how every discourse unit is related to the previous one, there are no hierarchies postulated and no claims about the shape of the overall structure of the analyzed documents. The advantage of local discourse analyses is their easy applicability in annotation and shallow parsing (looking for surface cues, connectives and other markers), given the usability across different languages and relatively high reliability (e.g. via inter-annotator agreement). As far as language resources aiming at formalized linguistic description of discourse coherence are concerned, internationally, there is a range of corpora annotated for local coherence relations in different languages. They mostly follow the Penn Discourse Treebank (PDTB) annotation style , Prasad et al., 2019, compare 2.2. There exists even a multilingual corpus with local discourse annotations in the PDTB style of parallel texts in six languages (Zeyrek et al., 2019). 1 For Czech, there is a publicly available large corpus of local discourse annotation, the Prague Discourse Treebank 2.0 (PDiT 2.0; Rysová et al., 2016), see Section 3. This resource has been developed by the Prague discourse group since 2009 and thus naturally represents the point of departure for this study.
Global coherence models, on the contrary, proceed top-down, from the text as a communicative whole, and postulate a hierarchically interconnected structure of smaller and larger units. These models usually capture each document as a single continuous graph representation with specific properties and constraints on the relations (e.g. tree-like graphs). This concept has a strong potential in the possibility to demonstrate the composition of smaller blocks of the text, as well as to identify more general and more important text contents and relations between them. For global annotations, there are far less projects, compare Section 2, and so far, there is no such project for Czech data. However, global coherence modeling (often in addition to local coherence models), apart from its straightforward application in automatic coherence evaluation (Feng et al., 2014, Lin et al., 2011 can significantly contribute to other NLP tasks, such as summarization (Zhang, 2011), topic identification (Pons-Porrata et al., 2007), text generation (Kiddon et al., 2016), textual entailment (Hagege and Jacquet, 2014) and others.
In this study, we examine the distributions, properties and mutual settings of local discourse relations in order to reveal possible patterns of higher discourse structuring and signs of global discourse coherence. Our research question is: Is global coherence signaled by the same types of language devices as the local one or can we reveal some differences by studying various phenomena in locally annotated data?
The different types of local discourse relations and their mutual settings that we analyze in this study are represented in Example 1 and Figure 1 which outlines discourse annotation in the Czech If the new limits could be prepared sooner they will probably only be issued as another amendment to the old law directive So it seems that the processors are in no hurry with new limits Yet there is hunger for new limits condition connective: If range: 0->0 reason-result connective: So range: 0->0 concession connective: Yet range: 0->0 S1 S2 S3 Figure 1: An annotated example of an intra-sentential relation (condition), inter-sentential relation within a paragraph (reason-result) and a cross-paragraph relation (concession) from the Prague Discourse Treebank 2.0. Nodes represent individual clauses, root nodes refer to individual sentences, semantic types of discourse relations are highlighted in orange, connectives in green.
PDiT 2.0. The example demonstrates an intra-sentential relation, an inter-sentential relation within a paragraph and a cross-paragraph relation. Also, a hierarchical structure (a nested relation, i.e. a relation fully included within an argument of another relation) is shown by this example text.
(1) (S1) If the new limits could be prepared sooner, they will probably only be issued as another amendment to the old law directive. (S2) So it seems that the processors are in no hurry with new limits.
[Paragraph boundary] (S3) Yet there is hunger for new limits.
Within the sentence (S1), there is an intra-sentential relation of condition signaled by the connective pokud [if]. Its two arguments are the two clauses in this sentence. The whole sentence (S1) is a left-sided argument of an inter-sentential reason-result relation anchored by a tak [so]. The ifrelation is a subset of so-relation, they are forming a hierarchical structure. These two sentences belong to a single paragraph and the two mentioned discourse relations are local, intra-paragraph relations. Sentence (S3) starts a new paragraph and it is attached to the preceding sentence (S2) by a cross-paragraph inter-sentential concession relation with the connective přitom [yet]. In this case, the cross-paragraph link is local, between adjacent sentences, but in other possible contexts, the left-sided argument of the relation may span across several sentences or/and it can be non-adjacent.

Goals
The main goal of the present study is to systematically analyze existing local discourse annotations of Czech, and partly English, for possible signs of higher/global discourse structure. The analysis shall serve as springboard for a planned global coherence annotation of Czech, similarly as the deepsyntactic layer was informative for the analysis of local coherence relations (Mírovský et al., 2012). An underlying assumption is that even the local annotation already displays such features of global text structure. This assumption is based on our observations from locally annotated data so far: we detected patterns like hierarchical organization of smaller and larger discourse relations, connectives and other discourse cues operating between larger blocks of texts, long-distance relations, genrerelated patterns and so on. Also, we hypothesize that in a well-formed, coherent text, the paragraph structure, as a main formal structuring device, must be mirrored in discourse relations and their semantics in some way.
Corpus analyses of possible features of higher/global text structure in this study concern three separate topics: • The issues of structure, or "the shape" of a text: we investigate mutual configurations of discourse relations (pairwise) and their complexity within locally annotated texts of Prague Discourse Treebank 2.0. We identify and quantify the embeddings, overlappings and crossings of relations, according to a similar study conducted by Lee et al. in 2006 on complexity of discourse structure in the Penn Discourse TreeBank 2.0. Moreover, we look into hierarchies local discourse relations can form. This research was previously published in Poláková and Mírovský (2020). In this new version, we reflect on some feedback by our readers, explain our previous findings in more detail and by more examples, give original Czech annotated examples where they were originally left out due to space limitations, and, most importantly, extend our analysis of hierarchies of nested relations by an analogous one for the English annotations in the PDTB 3. • The analysis of paragraph-initial connectives, relations they anchor and their respective discourse types (senses). This analysis is intended to reveal any possible differences between (the majority of) relations within individual paragraphs and those cross-paragraph relations in our data that have an explicit marking by a connective. • The analysis of large arguments, more precisely, of relations with one or both arguments larger than a single sentence, and their properties.
The article is structured as follows. In the second part of the Introduction, we specify the extent of our study and define types of discourse relations with respect to their scope in the text structure. Related research is mentioned in Section 2. Section 3 describes the data format and application framework used in our study and presents the main data resources. Section 4 with four subsections represents the core of the research and offers results of our study in four directions -(i) configurations of pairs of discourse relations, (ii) hierarchies of nested discourse relations, (iii) paragraphinitial senses and connectives, and (iv) relations with large arguments. Section 5 summarizes our findings and offers some perspectives. Examples of texts with deep hierarchies of discourse relations can be found in two Appendices.

Theoretical Aspects and Definitions
For the descriptive purposes of this study, we need to terminologically distinguish several types of coherence. Global coherence of a text refers to the coherence of a given document as a whole, including all inner structure of its coherence relations of smaller and greater units. This inner structure is assumed to form hierarchies of smaller and larger relations. On the contrary, local coherence refers to the "flat", chain-like coherence of two minimal subsequent text units, as defined most elaborately in the PDTB 2  approach mentioned above. Further, we distinguish coherence on a higher level, or the so-called higher text structure, which describes coherence relations between/among larger text blocks, but not necessarily in a whole document. Typically, a distinctive higher structure in a text is formed by (multi-sentential) lists and enumerations and it often relates to genre rules or practises (e.g. in legal texts, instructions, recipes). Independently, we also distinguish interor cross-paragraph coherence, that means, coherence relations between individual paragraphs as integral units but also between smaller units belonging to two different paragraphs. Coherence relations within individual paragraphs are then again referred to as local relations, and they can be either intraor inter-sentential, or belong to a higher structure, if made up of larger units (more sentences).
Obviously, these categories are not rigorous and mutually exclusive, as already the very purpose of this study -looking for global features in a locally annotated data -suggests. This gives rise to a question, e.g., how to approach the relation of the last sentence of one paragraph to the immediately following first sentence of the next paragraph, or, how to treat intra-sentential lists spanning across several paragraphs. We will address these and further related issues further in Section 4.
A note should be made to the use of the term hierarchy in this study. Whereas we explore the hierarchical organization of discourse relations as wholes, there also is another understanding of hierarchy in discourse, in terms of the ways discourse units organize themselves based on criteria such as their relative importance (nuclearity in Rhetorical Structure Theory). We do not address the latter here.

Related Research
From the wide range of coherence theories which address a text/document as a structured whole and offer some type of a formalized representation of this whole (e.g. Grosz and Sidner, 1986, Hobbs, 1979, Asher and Lascarides, 2003, 2 our approach is methodologically closest to the Penn Discourse Treebank and Rhetorical Structure Theory (RST), which are introduced in more detail in the following sections.

Rhetorical Structure Theory
The Rhetorical Structure Theory (RST, Mann and Thompson, 1988) is one of the most influential frameworks among the global coherence models. It was originally developed with the intention to model text coherence in order to study computer-based text generation. The main principle of RST is the assumption that coherent texts consist of minimal units, which are linked to each other, recursively, through rhetorical relations and that coherent texts do not exhibit gaps or non-sequiturs (Taboada and Mann, 2006). The RST represents the whole text document as a single (projective) 2. For a detailed classification of discourse approaches within the local/global space of coherence relations see Bateman and Rondhuis (1997). tree structure. Basic features of these structures are the rhetorical relations between two textual units (smaller or larger blocks that are in the vast majority of cases adjacent) and the notion of nuclearity. For the classification of RST rhetorical relations, a set of labels was developed, which originally contained 24 relations, but the authors themselves add that it is an open set "susceptible to extension and modification for the purposes of particular genres and cultural styles" (Mann and Thompson 1988, p. 250). The type of a rhetorical relation is defined with respect to the author's intended effect on the reader together with the application of the principles of nuclearity. The RST has gained great attention, it was further developed and tested, language corpora were built with RST-like discourse annotation, such as for English the RST Discourse Treebank (RST-DT, Carlson et al., 2003) and its extension RST-Signalling Corpus (Das et al., 2015), for German (Stede and Neumann, 2014), Spanish (Da Cunha et al., 2011), Basque (Iruskieta et al., 2013) and also for some other languages.
On the other hand, the framework was repeatedly criticized with regard to some of its theoretical claims, above all, concerning the question of adequacy/sufficiency of representation of a discourse structure as a tree graph. 3 Linguistically, the strong constraints of a tree (no crossing edges, one root, all the units interconnected etc.) gave rise to a search for counter-examples in real-world texts. It was shown that not only adjacent text units exhibit coherence links and that there are even cue phrases, which connect non-adjacent units and thus support the claim that a tree graph is too restricted a structure for an adequate discourse representation . Therefore, more complex graphs with crossings and overlaps should be adequate, which resulted in the creation of the Discourse GraphBank , a resource of the same texts annotated with the main diverging principle of relaxing the tree-ness constraint.
On these grounds and in the same way like Lee et al. (2006), we try to demonstrate where the local and global analytic perspectives meet and interact. Our analysis of Czech data thus contributes new empirical material to the scientific debate on whether projective trees are descriptively adequate to represent structure of texts (e.g. Marcu, 2000, Egg and Redeker, 2010. Additionally, there have been some RST-based studies investigating the distributions of rhetorical relations on different levels of text, e.g. Williams and Reiter (2003), Liu and Zhang (2016). Their research questions are similar to ours in the section on paragraph-initial relations and cues 4.3. The latter study introduces a RST-tree conversion to dependency trees with different levels of granularity, with a node representing a clause, a sentence or a paragraph. They point out that most rhetorical relations in the RST-DT treebank occur on all levels but with different percentages and the two upper levels of discourse processes are more alike in many aspects, which are observations that also follow from our local data analyses (4.3).

Penn Discourse Treebank
The Penn Discourse Treebank (PDTB) represents a lexically based, local model of discourse. Its analysis of discourse relations consists primarily in finding and analyzing lexical cues of discourse coherence as "anchors" of discourse relations. Such a cue, a discourse connective, is defined as a discourse-level predicate opening positions for two discourse arguments. Discourse connectives include coordinating conjunctions (e.g. and, but), subordinating conjunctions (because, if ) and discourse adverbials (nevertheless). Apart from connectives, the two discourse arguments of a discourse relation (and their extent) and the semantic type (sense) of a discourse annotation were annotated. In 2004, the first version of Penn Discourse Treebank was released (Miltsakaki et al., 2004). The second release of the PDTB four years later includes annotation of the ca. 49,000 sentences of the Wall Street Journal part of the Penn Treebank . Apart from explicit connectives, other phenomena have been annotated in this version, mainly implicit relations and attribution (attributing beliefs and assertions to agents making them) alternative lexicalization of a connective (AltLex, e. g. that is why), entity based relations (EntRel) and places with weak coherence (NoRel). In its latest version PDTB 3.0 (Prasad et al., 2018, Prasad et al., 2019, the annotations were enriched by many relations mostly in intra-sentential contexts, the sense taxonomy was revised and the existing annotations enhanced in many aspects.
The PDTB-style connective/argument analysis has become very popular, also because such an analysis requires less interpretation and pragmatic inference than the RST analysis. The PDTB authors also claim that their approach is theory-neutral, independent from any syntactic theory, and as such can be transferred to other languages.
In the first part of our analysis on configurations of positionally close relation pairs (4.1), we relate our findings to a similar research conducted on the second version of the English Penn Discourse Treebank in Lee et al. (2006). The aim of their study was to describe and quantify the configurations of discourse relations as typical or less typical in terms of discourse complexity. The complexity of discourse configurations there is being compared to the complexity of relations in syntax, but also refers to the principles of the global Rhetorical Structure Theory, in particular the representation of any text document as a single tree-like structure with strong constraints. The study actually tries to answer the question whether their empirical locally annotated data would fulfill or violate these strong constraints. Therefore Lee et al. (2006) studied various types of overlaps of discourse relations. They encountered a variety of patterns between pairs of discourse relations, including nested (hierarchical), crossed and other non-tree-like configurations. Nevertheless, they conclude that the types of discourse dependencies are highly restricted since the more complex cases like crossings and partial overlaps can be factored out by appealing to discourse notions like anaphora (non-structural, possibly long-distance tie) and attribution (attribution span as a known mismatch between the syntactic and discourse structure), and argue that pure crossing dependencies, partially overlapping arguments and a subset of structures containing a properly contained argument should not be considered part of the discourse structure. The authors challenge Czech discourse researchers to introduce a similar study (footnote 1 in their paper, ibid.) in order to observe and compare the complexity of discourse and syntax dependencies in two typologically different languages. 4

Data and Tools
Our study is based primarily on two data resources: for Czech, the Prague Discourse Treebank 2.0 (PDiT; Rysová et al., 2016, Zikánová et al., 2015, 5 and, for English, the Penn Discourse Tree-4. From this viewpoint, we do not assume substantial differences between the two languages in discourse structure itself. However, there are surely larger differences on "lower" levels of linguistic description, in our case most visibly connective repertoires and syntactic properties of the two languages, which can in the outcome influence some motivations in the annotation design and thus effect the resulting discourse structure, compare Mírovský et al. (in prep.). 5. Please note that the discourse annotation in PDiT 2.0 is the same as in PDT 3.5 (Hajič et al., 2018) and the upcoming PDT-C 1.0, which are newer versions of the PDiT 2.0 data.

CONTRAST EXPANSION
confrontation conjunction opposition conjunctive alternative restrictive opposition disjunctive alternative pragmatic contrast instantiation concession specification correction equivalence gradation generalization

CONTINGENCY TEMPORAL
reason-result synchrony pragmatic reason-result precedence-succession explication condition pragmatic condition purpose Table 1: Semantic types of discourse relations in the PDiT 2.0 bank ver. 3.0 (PDTB; Prasad et al., 2019). These two corpora are comparable in size (approx. 50 thousand sentences each), genre (journalistic texts) and they have both been manually annotated for discourse relations in comparable approaches (PDTB and the PDTB-like modification in the PDiT). Namely, for explicit relations, all connectives in a given text were identified, their two arguments were detected and a semantic/pragmatic relation 6 between them was assigned to the relation. Discourse connectives in PDiT 2.0 are of two types: primary and secondary, according to Rysová and Rysová (2014). Primary connectives are defined as grammaticalized expressions such as because or therefore whereas secondary connectives are not (yet) fully grammaticalized structures such as except for this, the reason was or for this reason. The notion of secondary connectives in Prague Treebank roughly corresponds to the categories of AltLex and AltlexC (alternative lexicalization tokens and alternative lexicalization constructions) in PDTB 3.
The PDTB and PDiT use similar sets of discourse relation types -the set of discourse types in PDiT is inspired by the Penn Discourse Treebank 2.0 sense hierarchy , see the complete list in Table 1. The taxonomy is two-level, but in the annotation only the second-level relations can be assigned. Confrontation in the Czech taxonomy equals to the original juxtaposition in the PDTB 2 (mostly represented by connectives like while, whereas), restrictive opposition also includes exception (only, except that...), correction equals to substitution (not A, but B, instead), explication is giving evidence by other means than reason-result. 7 Pragmatic labels are used in accordance with the PDTB 2 pragmatic labels.
6. "sense" in the PDTB terminology, "discourse type" in PDiT 2.0 7. strongly represented by a typical Czech connective totiž (actually, since, I mean, as a matter of fact) PDiT 2.0 contains 21 223 explicit discourse relations (i.e. relations signalled by explicit discourse connectives, both primary and secondary), 8 while the PDTB 3 contains 25 696 such relations (we take into account relations of type Explicit, (AltLex) and AltLexC). 9 Both PDiT and the PDTB are available in the Prague Markup Language format (PML; Pajas and Štěpánek, 2008), 10 an XML-based format and application framework designed for multilayer linguistic annotations with available tools allowing for complex linguistic studies over PML data: btred for scripting in Perl and Prague Markup Language -Tree Query (PML-TQ; Pajas and Štěpánek, 2009) as a graphically oriented querying system.

Analysis
This section is divided into four subsections according to the research directions in which we address different aspects of a higher/global text structure traceable in locally annotated corpora. 4.1configurations of pairs of discourse relations, a study carried out on the basis of a similar research by the PDTB group and, like this study, contributing to the debate on acceptable "shapes" of an interconnected coherence representation of a text; 4.2 -hierarchies of nested discourse relationsa study that reveals if local discourse relations, without having the underlying assumption of hierarchy, also exhibit some type of hierarchical structure; 4.3 -paragraph-initial senses and connectives, a survey on cross-paragraph coherence in comparison to intra-paragraph settings; and 4.4 -relations with large arguments, as these are rather unexpected/unusual in local settings and can denote possible non-elementary units of higher text structuring. 11

Configurations of Close Relation Pairs
For this part of the analysis, and also for the subsequent one described in 4.2, we take advantage of the fact that the corpora used in the PDTB and our studies are comparable in many aspects (as mentioned earlier) and also contain a similar number of annotated discourse relations (approx. 20 thousand). However, it is important to notice that Lee et al. (2006) included also implicit relations in their study and conducted the research on the older version of the PDTB (2.0). The Lee et al. (2006) study defines six basic types of patterns of relation pairs: independent relations, full embeddings, shared arguments, properly contained arguments, pure crossings and partially overlapping arguments. In our analysis, we decided to make a more fine-grained classification of the patterns to cover all possible settings and to get more detailed insight into the configurations, and ended up with 17 categories, compare Table 2. Sometimes we also use different pattern naming (e.g. independent relations vs. adjacency), but for those patterns accounted for in both studies, we state how the names map.
In our data, we have collected and classified types of relative positions of neighbouring discourse relations in the corpus, using a scripting tool btred as a part of an application framework for PDiT-8. We exclude the 361 list relations from this study, i.e. relations between subsequent members of enumerative structures. They represent a special text structure on their own, and as such, in the Prague approach they stand aside the binary discourse relations and discourse type taxonomy. 9. Implicit discourse relations have been annotated on a part of the PDiT 2.0, see Zikánová et al. (2019); this annotation was not taken into account in the present analysis, as the size of the implicit datasets is different at present. 10. The PML is the primary publication format for PDiT. The English PDTB was first transformed to the PML format, for the purposes of studying discourse annotations in a unified format. 11. The first and the second subsections/topics are closely related, as hierarchies are a subset of possible relation configurations (nested relations pattern), but we keep them separate in the study.
Line Frequency in PDiT Pattern Visualization Table 2: Patterns of adjacent, overlapping or embedded pairs of explicit discourse relations in PDiT 2.0. The column visualization shows relative position of the arguments read from left to right; arguments of one relation are marked with '-', the other one with '='. The total number of such close relations in PDiT was 17 628; 109 of them (0.6%) did not fit the listed patterns. Numbers in brackets mean frequencies if only inter-sentential discourse relations (arguments in different sentences) were taken into account; in total, there were 2 984 such close relations in PDiT, 85 of them (2.8%) did not fit the listed patterns.
types of corpora (Pajas and Štěpánek, 2008). We define a close relation pair as a pair of discourse relations that are either adjacent (the left argument of one relation immediately follows the right argument of the other relation), or they overlap in one of many possible patterns. We first analyze the detected patterns in PDiT and, next, a specific subsection is devoted to the description of the detected hierarchical structures (4.2). We were able to detect 17 628 close relation pairs in PDiT, and for each such pair, we investigated its pattern, the mutual arrangement of the two relations. In Table 2, these patterns are also graphically illustrated. The table shows figures for all explicit relation pairs and, in brackets, for inter-sentential relation pairs only.
The most common settings for close relations in PDiT are pure adjacency (succession) of two relations (line 1: 6 572 cases in PDiT), and "full embeddings", in other words two-level hierarchies, in total 7 134 in PDiT (lines 12 and 13). These configurations (lines 1, 12 and 13) represent together slightly more than 3/4 of all detected patterns in PDiT. They are also referred to as very "normal" structural relationships in Lee et al. (2006, p. 82).
The next-largest group is progress (line 2), a shared argument in the PDTB terminology, with 1 923 instances or 10.9% of all patterns. Lee et. al. report 7.5% of this type, which is fairly comparable.
Total overlap (line 3) is caused by the possibility to annotate two different relations between the same segments for co-occurring connectives, as in because for example or but later. It occurs 51 times (0.3%) in PDiT. This pattern may vary a lot in different annotation schemes, as there may be different approaches to handling co-occurring connectives, or even the possibility of a a co-occurrence of an explicit and an implicit relations between the same segments, which is the approach taken by some newer corpora, e.g. by the PDTB in its version 3.
The envelopment (line 15) concerns in the vast majority of cases a non-adjacent (long-distance) relation and another relation placed in the text between the arguments of the non-adjacent relation. The enveloped relation is often a sentence with an inner syntactic structure annotated. The same is often true for hierarchical patterns (lines 12 and 13) and it explains the big difference observed in both corpora between all detected envelopment and hierarchical patterns and only the inter-sentential ones. Linguistically, some of the envelopment cases are sentences headed by two attribution spans (verbs of saying) and some structure in the reported content in between, also cases of two linked reporter's questions in an interview and the inner structuring of the interviewee's answer, but also texts with no striking structural reasons for such an arrangement. In PDiT, envelopment represents ca. 5% of all settings and is certainly worth of further investigation.
Patterns with properly contained arguments, either one of the arguments (lines 5, 7, 8 and 9) or both (10 and 11), very often involve "skipping one level" in the syntactic tree of a sentence, see Example 2 12 of the type 8 (containment I), i.e. exclusion of a governing clause from the argument (mostly an attribution span), that makes its syntactically dependent clause (mostly a "reportedcontent-argument") to a subset of an other argument of the other relation represented (mostly) by a whole sentence. In Example 2, the text span a young researcher works with enthusiasm for science, regardless of salary is a left-sided argument of the but-relation (in bold) and, at the same time a subset, properly contained, in a larger right-sided argument of the therefore-relation (in italics). The governing clause It is therefore appropriate... the fact that is thus the mismatch, it represents the difference between proper containment and the much more frequent shared argument pattern (line 2).
(2) The gap in the standard of living that appears between the qualified scientific elite and the business sphere, right now, at the beginning of the transformation of the society, will leave traces. It is therefore appropriate to pamper young researchers and not misuse the fact that a young researcher works with enthusiasm for science, regardless of salary. But a person who begins to find his mission in research also starts a family, wants to live at a good place and live with dignity.
Besides the discussed attribution (introductory statements) that has been excluded from the argument, this mismatch in the shared argument extent is also the case of the annotation of some secondary connectives realized by whole clauses (like This means that..., Example 3, type 9, containment II). These verb phrases are not treated as parts of any of the arguments they relate to and pose a methodological issue. In the representation in Example 3, the clause This means that is not in italics, it is not considered to be a part of any argument of the left relation, i. e. the relation it anchors.
(3) This brief overview essentially exhausts the areas of notarial activities within the framework of free competition between notaries. This means that in these notarial agendas, the client has the option of unlimited choice of notary at his own discretion, as the notary is not bound to the place of his work when providing these services.
A third setting concerns multi-sentence arguments, where the contained argument is typically a single sentence. Patterns with properly contained arguments (lines 5, 7 -11) represent in total 3.6% (638) of all patterns in PDiT.
A (pure) crossing is a setting where the left-sided argument of the right relation comes in between the two arguments of the left relation, compare line 14 in Table 2. Pure crossings violate the RST constraints most visibly, with crossing edges, so the debate on tree adequacy often circles around the acceptability of crossings in discourse analysis. Lee et al. (2006) identify 24 cases (0.12%) of crossings in the PDTB2. We detected only 10 such cases in PDiT, which is a very small proportion. Manual inspection of the cases of crossing revealed several different scenarios, from clearly incorrect annotation, more interpretations possible, across cases with attribution spans in between, to a few, in our opinion, perfectly sound analyses, as exemplified by Example 4 from PDiT. If we accepted the possibility that not only (b), but a larger (b+c) unit relates to (d) in the alsorelation, which would be a completely fine interpretation in the Prague annotation, the relation of (e) -the neither-relation -in our view still cannot accept just (d) as its left-sided argument. We also think this case cannot be factored out due to anaphora. There is, for sure, room for different interpretations within different theories, we just offer our data, state our view and admit that crossing structures are extremely rare even in our empirical data. Partial overlap is a type of structure that violates the RST tree constraint, too. Lee et al. (2006) could only find 4 such cases. We detected 11 cases in PDiT (lines 16 and 17 of Table 2). They often include large arguments of untypical range (2.5 sentences etc.) which can be questioned. Some of the relations also include secondary connectives with strong anaphoric links (in this respect, given the fact that etc.). These relations can be factored out, yet, again, even among the small number of cases in PDiT there were linguistically acceptable ones, compare Example 5 (partial overlap II).
(5) The responsibility of the future tenant of this 103,000-square-metre area will be to care for all properties, including their maintenance and repairs. The tenant will also have to resolve the parking conditions for market visitors and to meet the conditions of the Prague Heritage Institute during construction changes due to the fact that the complex is a cultural monument. The capital city at the same time envisages preserving the character of the Holešovice market.

Hierarchies
The purpose of looking for hierarchical structures in the locally annotated data is to discover to what extent such an annotation shows signs of some higher (global) structure, too. We do not claim that the trees detected by us are the trees a global analysis like RST would discover, but we demonstrate the existence of some hierarchical text structure in local annotation. Some of it could perhaps partially match to RST-formed subtrees (and definitely there would be an intersection of separate relations, compare e.g. intersections in Wall Street Journal local and global annotations in 13. The "also-not" connective is originally in Czech ani, in the meaning of neither. Lit. translation: "Neither here is_concerned a small portion..."   Poláková et al., 2017), but this is yet to be investigated. We are also aware, as pointed out in Egg and Redeker (2010), that minimal, local annotations do not normally form a connected graph. 14

HIERARCHIES IN THE PDIT 2.0
For the study of hierarchical structures, we used pairs of nested relations where one of the relations is as a whole included in one argument of the other relation (patterns corresponding to lines 12 and 13 from Table 2) to recursively construct tree structures out of pairs of the nested relations. Based on the quantitative results, we inspected selected samples of the detected patterns manually, in order to check the script outcome and to provide a linguistic description and comparison. The results on the whole PDiT data are displayed in Table 3, arranged according to the scheme of such hierarchy trees (identical structures are summed and represented by the hierarchy scheme). We only mention cases where there are at least three levels in the tree, as two-level hierarchies are part of Table 2. For explanation: The scheme "A ( B )" means that the whole relation B is included in one of the arguments of the relation A (this is, of course, only a two-level tree). The scheme "A ( B C ( D ))" 14. And the more so, as we do not include implicit and entity-based relations into our study. means that relations B and C are all included in the individual arguments of relation A (without specifying in which argument they are, so they can be both in one argument or each in a different argument) and the relation D is completely included in one of the arguments of relation C. It is a three-level hierarchy. Generally, we count the depth (number of levels) of a hierarchy tree as a number of nodes in the longest path from the root to a leaf.
There are many sub-hierarchies in a large/deep hierarchy, for example "B ( C ( D ))" is a subhierarchy of "A ( B ( C ( D )) E )"; such sub-hierarchies, however, are not counted in Table 3, i.e., each hierarchy is only counted in the table in its largest and deepest form as it appeared in the PDiT 2.0 data. 15 In the PDiT data, local discourse relations form hierarchies up to five levels. We have identified 5 patterns of 5-level hierarchies (5-LH), with the total of 6 instances, see Table 3. There is also a number of 3-and 4-level hierarchies. An analysis of random samples (and of all the deepest ones) revealed, surprisingly, that there may be a 4-level hierarchy spanning 11 sentences, but also a 5-LH spanning only two sentences, from which one is typically a more complex compound sentence. The "longest" of the 5-LHs includes also 11 sentences (line 14) and it also exhibits branching (D, G and K as leaves, where the G-path is the deepest). One of the 5-LHs should be in fact one level flatter (line 12), as the lowest two relations are three coordinated clauses with two and-connectives: "the troops protected them and fed them and gave them the impression that they were invulnerable...". Such structures are notoriously hard to interpret for any framework, yet in Prague annotation, the annotation is incorrectly hierarchical where it should have been flat.
For a better illustration of the hierarchical configuration of the relations, a text sample with a 4-LH is analyzed in Appendix 1 to this study. It covers one whole paragraph and a part of a preceding one (11 sentences). Its pattern is A ( B ( C ( D ) E ) F G ). 16 To find out how much structure is involved only within individual sentences, i.e. how much of sentential syntax forms the hierarchies, in a second phase we filtered out all intra-sentential relations. The numbers in brackets give counts for patterns of hierarchies, if only inter-sentential relations (i.e. arguments in different sentences) are accounted for. 17 The hierarchies of this type are much less frequent and their maximum depth is just 3, which implies that beyond the sentence boundary, local annotation of explicit connectives does not represent hierarchical text structuring very often in the PDiT 2.0.

HIERARCHIES IN THE PENN DISCOURSE TREEBANK 3.0
Having at our disposal the Penn Discourse Treebank 3.0 annotations converted into the Prague PML format (Poláková et al., 2017), we can use the same procedure to search for hierarchical structures also in these English locally annotated data. We take into account relations of the type Explicit, AltLex and AltLexC. The results are summed up in Table 4. Similarly as in the experiment with the Czech data, we have detected hierarchies of relations up to 5 depth levels. Overall, their counts are 15. This also explains the zeros in Table 3. The empty line in the table suggests that there are more different patterns of hierarchies of the given depth, the same holds also for Table 4. 16. We do not present a 5-LH here, as the largest one is too complex and the smaller ones include two sentences only, so the main structure is syntactic in nature. 17. Please note that hierarchies counted in brackets (without intra-sentential relations, column 3 in Tables 3 and 4) are not a subset of the respective hierarchies from the same table row that also include intra-sentential relations -adding intra-sentential relations into the hierarchies means that a particular hierarchy pattern may change (be enlarged). This happens most clearly in lines 6 and 7 of Table 3 Table 4: Selected schemes of hierarchies of explicit discourse relations in the PDTB 3.0. Numbers in brackets mean frequencies of hierarchy schemes if only inter-sentential discourse relations were taken into account (no other inter-sentential at least 3-level hierarchies were encountered in the data).
slightly lower, e.g. the simplest 3-LH pattern (A (B (C)) is 232 in comparison to 381 in the Czech data. The maximum depth is also 5 but there is only one such detected hierarchy, compare the last line of Table 4. The source text with all the relations annotated within this hierarchy is presented in detail in the Appendix 2 to this study. The hierarchy spans across 2 paragraphs (7 sentences), and its pattern is A ( B C D ( E ( F ( G )))), which implies that the relations B and C do not take part in the deepest branch of the hierarchy, they are only included in the highest relation A. Further, it can be observed that the two lowest relations, G (the deepest one) and F, are intra-sentential, while the hierarchy crosses the sentence boundary with its relation E, which is the lowest inter-sentential relation. So, only the relations A and D, the arguments of which span across two or more sentences 18 in our understanding contribute in a way to a higher discourse structure. The number of hierarchies formed by inter-sentential relations only is even lower in the PDTB 3, only 7 (compared to 14 in the PDiT 2.0), with the same maximum depth of 3 levels. A hypothesis for the relatively small number of hierarchies built by only inter-sentential relations in both corpora is that only some of the connectives operating at higher discourse levels were identified and annotated as such, some of them were assigned local coherence links due to the minimality principle. 19 18. more precisely, one argument in each of them spans two or more sentences 19. The minimality principle instructs the annotators to mark as an argument as many clauses and/or sentences as are minimally required and sufficient for the interpretation of the relation. It was applied both in PDTB and in PDiT   (2019) with focus on paragraph-initial connectives and we dedicate the following subsection to this topic.

Paragraph-initial Semantic Types and Connectives
A possible different scope of connectives in paragraph-initial positions was previously indicated in a study analyzing discourse connectives with anaphoric properties and their ability to relate to a distant, non-adjacent left-sided argument (Poláková and Mírovský, 2019). A special set of cases was defined where the connective actually does not relate to a short non-adjacent segment on its left, but to an adjacent, quite larger segment of text and is interpretable as a means of higher discourse structuring. From another perspective on paragraph boundary, cross-paragraph discourse relations were recently investigated in Prasad et al. (2017) for implicit relations, leading to the observation that a first sentence in a given paragraph semantically relates to the immediately preceding last sentence of the previous paragraph in only 52% of their sample, with 48% having links to nonadjacent left contexts. Our analysis of the features of cross-paragraph coherence only takes explicit relations into account (including secondary connectives), although we acknowledge that implicit connecting or other signalling is common between paragraphs. We focus on the following issues: (i) Is there a difference between the semantics of discourse relations that ensure continuity between paragraphs on one side, and relations within an individual paragraph on the other? Are certain relations more typical for cross-paragraph coherence, while others are typically local? (ii) Are discourse connectives in the cross-paragraph usage different from locally used connectives?
Because of limitations of the actual version of the conversion of the PDTB data to the PML format, in this subsection and also in the study of paragraph-initial connectives in the subsequent subsection we only examined texts of the Prague Discourse Treebank 2.0 (as characterized in Table 5), via PML-TQ queries and using information about paragraph numbers available at roots of the deep-syntactic trees. 20 In terms of the higher structure and local relations in a text, three sets of relations can be distinguished. First of them are the relations between paragraphs, or cross-paragraph relations, as the top set (1). The rest, i.e. intra-paragraph relations, can be divided into inter-sentential (2) and intraannotations. The PDTB moreover annotates supplementary material to an argument, where needed . 20. Information about paragraph numbers in a direct form of attribute para_no is only available in the PDiT data accessible via the PML-TQ search engine and is taken from identifiers (attribute id) of t-roots -roots of the tectogrammatical trees. 21. The average length of a paragraph is counted after the exclusion of headings, captions and metatext, i.e. 44 979 / 11 643.
(1) Cross-par inter-S (2) Intra-par inter-S (3) Intra-S   (3); the differences between these groups are mainly due to the syntactic structure. In our analysis, we want to focus specifically on the relations between paragraphs (1). These relations do not reflect syntactic structure that much, which is why we compare them mainly to the separate group of intra-paragraph inter-sentential relations (2). However, in order to keep the picture of discourse relations complete, we also include a comparison with (intra-paragraph) intra-sentential relations (3). Thus, we focused on three datasets. For the cross-paragraph relations (dataset 1), we analyzed explicit discourse relations which connect the first sentence of the paragraph to any previous text. As for local relations (2), the set in question concerns explicit inter-sentential discourse relations that do not go beyond the scope of one paragraph. The relations in the set (3) connect arguments within one syntactic tree (sentence), i.e. they cross neither the sentence boundary, nor the paragraph boundary. 22 The distribution of semantic types of discourse relations across all the datasets differs significantly, see Table 6 and the graph in Figure 2, the significance was verified using the χ 2 test.
The vast majority of instances (63-68 percentage points in all three datasets) belong to three discourse relations, namely conjunction, opposition, and reason-result. In cross-paragraph relations, conjunction comes first, whereas in intra-paragraph inter-sentential relations, opposition is most common. Within intra-sentential relations, conjunction absolutely predominates with 42% of occurrences, and the very frequent relation of condition comes as third, compare details below.
In the top ten of the cross-paragraph relations, some relations typically occur which are not in the top ten of the intra-paragraph relations, namely specification, generalization, and instantiation. Often, the following paragraphs in our data expand the content of the previous text in this way. On the other hand, we cannot consider the relation of specification as only typical in the higher 22. Theoretically, there could be cross-paragraph intra-sentential discourse relations, too, such as lists of dependent clauses printed each in a different paragraph, cf. The request will be accepted if the applicant comes from the EU, and if he/she encloses two letters of recommendation. We do not deal with such sentences in this paper since their occurrence is marginal.  Table 6 showing percentages of the semantic types for the three datasets, i.e. for cross-paragraph (first column), intra-paragraph inter-sentential (second column), and intra-sentential discourse relations (third column). In each dataset, all its occurrences sum to 100%.
text structure, since it also occurs in the top 10 intra-sentential relations. We can assume that some structural features of specification in the datasets (1) and (3) differ; these features will be an object to a further research. In the top ten of the intra-paragraph inter-sentential relations (dataset 2), on the other hand, gradation, explication and correction are present, representing probably another way of text progression typical for smaller units. The datasets (1-3) differ not only in proportions of frequent semantic types, but also in those which are the least represented in the respective datasets. We have looked into semantic types of discourse relations which are represented by less than 1% of occurrences in each of the analyzed datasets, see Table 7.
As can be seen from Table 7, some relations are rare in our data in general, independently from paragraph boundaries and syntactic structure. This is the case of all the pragmatic relations, equivalence and conjunctive alternative. Another group of relations has a low representation in the dataset (1) and (2), but they are quite typical for intra-sentential relations (dataset 3). This applies to condition and purpose, see Table 6 and Table 7. We can consider these relations as typically syntactic and local. The relation of correction is usually not used across paragraphs. However, it is typical for both types of intra-paragraph relations. Finally, we can find typical relations of the higher structure in our data (dataset 1), too, which occur in the intra-sentential structure (dataset 3), namely instantiation and generalization.
Let us sum up the observations concerning semantics of discourse relations in the datasets (1-3). Distributions of semantic types in inter-sentential relations (datasets 1 and 2) are quite close to each other, but they differ from intra-sentential relations distinctly. This finding was confirmed by the result of χ 2 tests, too. In other words, inter/intra-sententiality is a more important feature for the semantics of discourse relations than the presence or absence of paragraph boundary. Nevertheless, (1) Cross-par inter-S (2) Intra-par inter-S (3) Intra-S  cross-paragraph relations (dataset 1) have specific features distinguishing them from intra-paragraph relations (datasets 2 and 3), too. Besides the most frequent relations the distributions of which are very similar for all the three datasets (conjunction, opposition, reason-result), cross-paragraph relations typically express different meanings of expansion (generalization, specification, instantiation).
On the other hand, they rarely express semantics of correction which is typical for intra-paragraph relations (datasets 2 and 3); moreover, some typical intra-sentential relations, such as condition, purpose, disjunctive alternative and synchrony almost do not occur across paragraphs in our data.

DISCOURSE CONNECTIVES IN PARAGRAPH-INITIAL POSITIONS
We also addressed the question of the differences in the use of discourse connectives across paragraphs and in local relations. We focused on the three most common relations, conjunction, opposition and reason-result, and the connectives by which these relations are expressed (see Table 8).
The analysis showed that the relation of conjunction is often expressed by specific discourse connectives in the cross-paragraph context, namely with the adverb dále [further, next] and the particle také [too, also]. The basic and most frequent discourse connective for conjunction a [and] is much more frequently used in the datasets (2) and (3), i.e. in the intra-paragraph structure. Further, in the intraparagraph coherence, discourse connectives based etymologically on a prepositional phrase with an anaphoric element are common, cf. přitom [at the same time, lit. by that], and its intra-sentential relative variant přičemž [and/while, lit. by which]. Discourse connectives based on relatives are not used in the inter-sentential structures. Discourse connectives for opposition are almost identical for cross-paragraph and intra-paragraph inter-sentential relations (dataset 1 and 2). A typical discourse connective in both these groups is však [however], whereas intra-sententially (dataset 3), opposition is predominantly expressed by the conjunction ale [but]. In dataset (3), multi-part discourse connectives are used (with an nonautonomous part sice [sure/true]) which are not used inter-sententially.
Similarly, discourse connectives used for reason-result are very close in the datasets (1) and (2). Again, a specific set of discourse connectives is used in the intra-sentential context (dataset 3). Moreover, these two larger groups (inter-sentential datasets 1 and 2, and the intra-sentential set 3) (1) Cross-par inter-S (2) Intra-par inter-S (3) Intra-S  Generally, there is a certain difference among the observed datasets in the usage of secondary connectives. Whereas in intra-sentential relations (dataset 3) secondary connectives do not occur in high positions of the table, they are yet frequent enough in the dataset 2 (dodal [he added]) and even more in the dataset 1 (dodal [he added], v tomto směru [in this sense]).
To sum up, discourse connectives in the cross-paragraph and local coherence overlap to a large extent; nevertheless, paragraph-initial discourse connectives still have some special features. Some of them are based on the inter-sentential character of cross-paragraph relations and they are common for datasets (1) and (2). Thus, discourse connectives in these groups do not include relative elements or typical intra-sentential expressions (sice [true/sure]). Further, in the inter-sentential context, discourse connectives expressing result are used more often than connectives of reason. The latter are, on the other hand, more frequent in the intra-sentential relations (dataset 3).
Within the group of inter-sentential discourse connectives (datasets 1 and 2), paragraph-initial connectives differ from intra-paragraph connectives in certain aspects. For the relation of conjunction, the specific discourse connective dále [further, next] is frequently used. The proportion of secondary discourse connectives is higher in this group than in datasets (2 and 3), too.  Table 9: Overview of argument sizes of inter-sentential discourse relations without lists in PDiT 2.0.

Large arguments
Although the minimality principle was taken into account during annotating argument extents in the PDiT, the annotators could also mark large argument spans and non-adjacent arguments, if justified (compare 4.2). In a way, the minimality principle, if applied thoroughly, can prevent the detection of natural "whole paragraph" to "whole paragraph" relations and similar. In this part of the analysis, we try to look into cases where marking a large argument was superior to the minimality requirement, we quantify them and try to find explanations for them. For this study, relations with large arguments are specified as relations where either one or both arguments are larger than one sentence. For the analysis, we only take into account explicit inter-sentential relations and exclude list structures. We also exclude such intra-sentential relations, which cross the sentence boundary with a part of one of their arguments. They represent a very specific group (25 instances in the corpus).
In the data of the PDiT 2.0, extents of discourse arguments are given by the tree representation of a sentence -a discourse relation is marked between two tree nodes, roots of two subtrees that in most cases represent the arguments -and further specified by two range attributes, start_range and target_range, which help define more complex cases. It is important to keep in mind that for symmetric relations (i.e. where both arguments are of the same nature and the linear order of the arguments is always the same, as in conjunction, opposition or synchrony), the target_range attribute value always defines the extent of the left-sided argument and the start_range attribute value defines the extent of the right-sided argument. For asymmetric relations like reason-result, in which the arguments can switch the order, start-and target range are assigned to each relation individually via semantic definitions of Arg1 and Arg2. Arg1 always has the target_range attribute, disregarding its location. This property of the PDiT annotation needs to be taken into account in the phase of the analysis below where the left or right position of the arguments matter. 23 The proportions of inter-sentential relations with different sizes of arguments are summarized in Table 9. There are 5 966 relations with single-sentenced both arguments in the PDiT 2.0, in other words relations with "small" arguments, and they represent 89% of all inter-sentential relations. Table 10: Large arguments in PDiT 2.0, measured both for the left-right positions and start-range target-range directions (= semantics of the arguments); "large" means that the given argument spans more than one sentence while the other one (in the same column) only spans one sentence (of its subset); symmetric relations are marked with the light gray background.
The remaining 729 relations that have at least one large argument (two and more sentences), form 11% of all inter-sentential relations. There are 617 relations with a large left argument and a singlesentenced right argument (large -1), compared to only 92 relations with large right argument and a single-sentenced left argument (1 -large). There are only 20 relations with both large arguments (large -large), which is, given the figures for hierarchies, in our opinion a surprisingly low number, in terms of percentage negligible.
The most distinctive finding for large arguments is the huge disproportion of left-sided and rightsided large arguments, or in other words, the great predominance of large left-sided arguments. In a detailed view, the figures in Table 10, the "left -right" section, show that this is the case across almost all discourse types (senses). 24 However, the left-right positions of the arguments are only informative for symmetric relations (with grey background in Table 10), we will elaborate on them further in this section. 25 For asymmetric relations, argument semantics plays a crucial role. There are three exceptions to the tendency of a larger left-sided argument in Table 10: instantiation, explication and specification, where large right-sided arguments are more common. But, if we look at the "start-target" part of the table, which is more informative for these asymmetric relations, we can see that in all these cases, argument semantics goes hand in hand: these are precisely the relations where the right contexts are represented by those arguments, which are easily conceivable as more elaborated, expandable: the argument providing an explanation (large start arg = 15), the argument giving example (13), the specifying argument (3). 26 For comparison, an analogous, and in terms of numbers much stronger, disproportion is visible in generalization, also an asymmetric relation, with 76 large left contexts which are also the 76 more specific arguments. The figures here fully comply with the intuitive notions of how semantics of these relations work, e.g. a generalizing, summarizing statement should be generalizing over a large previous text segment.
What could be less intuitive are the sizes of individual arguments in reason-result relation. For this relatively frequent relation, 27 (176 instances with large arguments in total), 155 (88%) have a large "reason" arguments but only 20 (11%) have a large "result" argument. The large arguments stand mostly on the left. Thus, semantically, the arguments are in line with the similar explication relation, that means the dominance of large reason/explication arguments, but in terms of location, these arguments stand elsewhere, on the left for reason-result and on the right for explication. A possible explanation can be the different importance of the arguments, in the terms of RST nuclearity, and/or role of secondary connectives and connective phrases.
For symmetric relations, we do not have a straighforward explanation for the predominance of large left-sided arguments. Only the following assumptions can be suggested: we might evaluate this as an effect of the annotation strategy, given that the connective is mostly a part of the right-sided argument, and so it may seem unnatural to go beyond the strong right boundary of the connectivecontaining sentence. We may also ask if the minimality principle is applied the same way to left and right contexts. And/or, this phenomenon might be inherent to language. In a similar manner in which anaphora in the language occurs much more often than cataphora, the connectives (some of which are indeed anaphoric, compare Webber et al. (2003), Stede and Grishina (2016) or Poláková and Mírovský (2019)), relate to small or larger previous semantic contents. 28 From a cognitive perspective, this disproportion of argument sizes may be connected with the linear way of text production and also the gradual growth of information received by the reader, and the perspective of the annotator, who may proceed incrementally, like a reader, not knowing about the sizes of any right context. This issue needs a further insight, since it may be very important for the understanding of the difference of analytic perspectives in local and global annotation approaches.
25. This also implies that proportions of numbers for symmetric relations in the left-right section and the start-target section of Table 10 are identical, they only reflect the annotation convention that the target argument is always on the left, in other words, the discourse arrow always leads to the left for the symmetric relations. 26. Counts for these three relations are too low to draw any hard conclusion but even the small numbers here support the intuitive claims. 27. Even when its intra-sentential instances are filtered out here, there are no protože [because]-relations included. 28. In our experience, cataphoric connectives are mostly secondary, with a demonstrative element that mostly introduces a dependent clause (e. g. thanks to the fact that... ) and so their scope is very narrow. This might be, however, language-specific.

Conclusion
The aim of this study was to determine, using corpus methods, to what extent local annotations of discourse relations enable to abstract and describe phenomena of global coherence or higher text structuring. We used the 50 thousand sentences of the Prague Discourse Treebank 2.0 for Czech and the equally sized Penn Discourse Treebank 3 for English. 29 The analysis focused on three main aspects: 1. the "shape" of the text in terms of mutual configuration of close discourse relations (pairwise); 2. cross-paragraph relations, their semantics and the properties of connectives in these relations as opposed to intra-paragraph settings, and 3. the size of arguments (text units) connected by discourse relations. Regarding 1., even the discourse relations annotated in local annotations settings are assumed to form specific patterns, which are inherent in global coherence models like the Rhetorical Structure Theory (RST), but not postulated for local coherence models. This includes recursive hierarchical structuring of smaller and larger relations. Also, the RST model applies strong constraints on the overall document structure, defining it as a (constituency) tree with no crossings and overlaps.
Our analysis of relation configurations further contributes to the theoretical coherence-oriented research in particular by bringing empirical data to the discussion whether RST tree graphs are adequate (and sufficient) to represent discourse structure: Czech data available to us support this claim, exceptions are very rare. We have described patterns that are typical (adjacency, progress, hierarchy, etc.), less typical (argument containment patterns, envelopment) and quite rare (total overlap, crossing, partial overlaps etc.) in our data, and analyzed them linguistically. Further, we have compared our findings to those of a similar study conducted on English PDTB, version 2 (Lee et al., 2006), learning that the proportions of occurrence of individual patterns roughly correspond in both corpora, although our study distinguishes some more subtle configurations. Frequent patterns in our data comply with the RST tree structure rules. Less frequent patterns in the PDiT mostly deal with inclusion or exclusion of attribution spans, but also with the annotation strategies for secondary connectives in cases where they form a whole clause (It means that...) or they are anaphoric (in this respect). In some rare patterns, where, in our opinion, there is a violation of the tree structure in the sense of RST, we have found a small number of linguistically defensible interpretations that are not to be factored out due to discourse anaphora or attribution, as Lee et al. (2006) suggest.
Next, we have investigated hierarchies built by nested local relations in both Czech and English data. In all investigated properties in this respect, the two corpora are very similar. In both of them, we have detected even 5-level hierarchies, although they are quite rare. In a more detailed perspective, however, much of the structure proved to be intra-sentential: beyond the sentence boundary, local annotation of explicit connectives and AltLexes does not expose hierarchical text structuring very often. Detected hierarchies of inter-sentential relations only reached max. 3 levels of depth in both corpora. We do not claim that the trees detected by us are the trees a global analysis like RST would discover, but we demonstrate the existence of some hierarchical text structure in local annotation.
2. In the part of the analysis concerning cross-paragraph phenomena, the distributions of semantic types of discourse relations reveal dominance of three elaborative meanings, namely specification, generalization, and instantiation in relations crossing the paragraph boundary. These relations 29. The analysis of English locally annotated data is so far only complementary to the analyses of Czech data but we plan to extend it also to other subtopics in this study in the future.
are not in the top ten most frequent of all other intra-paragraph relations, 30 and our findings only confirm their intuitively perceived large role in text composition and lesser role in syntax. Nevertheless, it was observed that the feature of inter/intra-sententiality is more important for the semantics of discourse relations than the presence or absence of the paragraph boundary. Typical intra-sentential relations, such as condition, purpose, disjunctive alternative and synchrony, almost never occur in cross-paragraphs relations in our data.
As for connectives, for the relation of conjunction, there seems to exist a specific discourse connective of cross-paragraph links, the connective dále [further, next]. Distributions of other connectives in frequent relations between and within paragraphs do not vary much, and, counter-intuitively, also coordinating conjunctions (and, but) are fairly represented in cross-paragraph relations. The proportion of secondary discourse connectives is higher in these contexts.
3. The analysis of argument size examined the hypothesis that in local coherence models, existence of large arguments (more than one sentence) should be limited by the annotation principle to annotate minimal units. It was discovered that relations with one or both large arguments are indeed not very frequent in PDiT 2.0 (11% of all inter-sentential relations) and that the large argument is in almost 85% on the left, which might be annotator's bias when proceeding from left (known context) to right (unknown context), and/or it may an inherent property of texts. It would be interesting to compare this observation to the ways of tree branching in the RST-Discourse Treebank global annotations.
In the future, we plan to further extend our analysis to the PDTB 3 data and we would like to include also implicit relations, entity-based relations and hypophora (relations of question and answer) in both languages. Implicitness is an important feature of inter-sentential discourse relations, therefore inclusion of implicit relations into the research will enhance the general picture of the distribution of semantic types. This kind of results can be used then e.g. for the prediction of meaning of inter-sentential discourse relations. Furthermore, the role of the typically implicit semantic types, such as instantiation or specification, can be described in detail in this way.
The outcome of the study will be reflected in a future RST-like annotation of Czech. First, the results will be confronted with a real pilot RST analysis on a sample of locally annotated Czech texts with detected hierarchical organization, in order to assess the degree of equivalence of the hierarchical structures. The findings about hierarchies and the large arguments from this study can be further crosschecked with notions like nuclearity and canonical order of discourse arguments to find more about possible bridging through global and local frameworks. Also, the distribution of semantic labels given the size of text arguments in local annotations can be related to the use of RST labels (e.g. to the division to subject matter and presentational rhetorical relations) and their correspondence in lower and higher structuring can be discussed.
The fact that local discourse annotation in both PDiT and PDTB also displays hierarchical structure (up to 5 levels of depth) but at least two lowest levels are usually intra-sentential, implies a large role of syntax in discourse complexity. Syntactic hypotaxis/parataxis, but also the (a)symmetry of local discourse labels when related to nuclearity can be of advantage in a possible automatic RST pre-annotation or in rhetorical parsing.
Last but not least, methodologically, our experiments seem to reveal a lot about annotation strategies and biases: the minimality principle seems to affect the left-sided and right-sided argument sizes with great difference and its consistent application also may hinder the ability of local 30. with the sole exception of specification in intra-sentential use, which may be connected to the very frequent nominal right-sided arguments (and governing verb ellipsis) in the PDiT annotaiton. models to accurately assess coherence of larger blocks. Our results also open space for the hypothesis that the annotation procedure itself, e.g. the order in which individual segments are connected to other segments, may influence the segment size and the hierarchical structure formed.
Appendix 2: A five-level hierarchy in the Penn Discourse Treebank 3.0 The PDTB 3.0 text segment (wsj_2431, 2 paragraphs, 7 sentences) with a detected 5-level hierarchy of discourse relations. The pattern of the hierarchy is A ( B C D ( E ( F ( G )))). (26) Then (E), to buttress his credibility with the left, he enticed some smaller leftist parties to stand for election under the PASOK banner.
(27) Next (D), he continued to court the communists -many of whom feel betrayed by the left-right coalition's birth -by bringing into PASOK a well-respected Communist Party candidate.
(28) For balance, and in hopes of gaining some disaffected centrist votes, he managed to attract a former New Democracy Party representative and known political enemy of Mr. Mitsotakis. (29) Thus (A) PASOK heads for the polls not only with diminished scandal-stench, but also with "seals of approval" from representatives of its harshest accusers.
33. Note that the relations B and C do not take part in the deepest branch of the hierarchy, they are only included in the higher relation A.
Relation A ('thus', Contingency.Cause.Result): leftarg: (23 -28) So it seems that Mr. Mitsotakis and his communist chums may have unwittingly served Mr. Papandreou a moral victory on a platter: PASOK, whether guilty or not, can now traipse the countryside condemning the whole affair as a witch hunt at Mr. Papandreou's expense. But while verbal high jinks alone won't help PASOK regain power, Mr. Papandreou should never be underestimated. First came his predictable fusillade: He charged the Coalition of the Left and Progress had sold out its leftist tenets by collaborating in a right-wing plot aimed at ousting PASOK and thwarting the course of socialism in Greece. Then, to buttress his credibility with the left, he enticed some smaller leftist parties to stand for election under the PASOK banner. Next, he continued to court the communists -many of whom feel betrayed by the left-right coalition's birth -by bringing into PASOK a well-respected Communist Party candidate. For balance, and in hopes of gaining some disaffected centrist votes, he managed to attract a former New Democracy Party representative and known political enemy of Mr. Mitsotakis. rightarg: (29) PASOK heads for the polls not only with diminished scandal-stench, but also with "seals of approval" from representatives of its harshest accusers Relation B ('but', Comparison.Concession.Arg2-as-denier): leftarg: (sub23) PASOK, whether guilty or not, can now traipse the countryside condemning the whole affair as a witch hunt at Mr. Papandreou 's expense rightarg: (24) while verbal high jinks alone wo n't help PASOK regain power, Mr. Papandreou should never be underestimated leftarg: (25 -26) First came his predictable fusillade: He charged the Coalition of the Left and Progress had sold out its leftist tenets by collaborating in a right-wing plot aimed at ousting PASOK and thwarting the course of socialism in Greece.
Then, to buttress his credibility with the left, he enticed some smaller leftist parties to stand for election under the PASOK banner rightarg: (sub27) he continued to court the communists ... by bringing into PASOK a wellrespected Communist Party candidate Relation E ('then', Temporal.Asynchronous.Precedence): leftarg: (sub25) He charged the Coalition of of the Left and Progress had sold out its leftist tenets by collaborating in a right wing plot aimed at ousting PASOK and thwarting the course of socialism in Greece rightarg: (sub26) he enticed some smaller leftist parties to stand for election under the PA-SOK banner Relation F ('by', Contingency.Cause.Reason): leftarg: (sub25) the Coalition of the Left and Progress had sold out its leftist tenets