Evaluation in Discourse: a Corpus-Based Study

This paper describes the CASOAR corpus, the first manually annotated corpus that explores the impact of discourse structure on sentiment analysis with a study of movie reviews in French and in English as well as letters to the editor in French. While annotating opinions at the expression, the sentence or the document level is a well-established task and relatively straightforward, discourse annotation remains diffcult, especially for non experts. Therefore, combining both annotations poses several methodological problems that we address here. We propose a multi-layered annotation scheme that includes: the complete discourse structure according to the Segmented Discourse Representation Theory, the opinion orientation of elementary discourse units and opinion expressions, and their associated features. We detail each layer, explore the interactions between them and discuss our results. In particular, we examine the correlation between discourse and semantic category of opinion expressions, the impact of discourse relations on both subjectivity and polarity analysis and the impact of discourse on the determination of the overall opinion of a document. Our results demonstrate that discourse is an important cue for sentiment analysis, at least for the corpus genres we have studied.


Introduction
Sentiment analysis has been one of the most popular applications of natural language processing for over a decade both in academic research institutions and in industry. In this domain, researchers analyze how people express their sentiments, opinions and points of view from natural language data such as customer reviews, blogs, fora and newspapers. Opinions concern evaluations expressed by a holder (a speaker or a writer) towards a topic (an object or a person). An evaluation is characterized by a polarity (positive, negative or neutral) and a strength that indicates the opinion degree of positivity or negativity. Example (1), extracted from our corpus of movie reviews, illustrates these phenomena 1 . In this review, the author expresses three opinions: the first two are explicitly lexical-ized opinion expressions (underlined in the example) whereas the last one (in italic) is an implicit positive opinion since it contains no subjective lexical cues.
(1) What a great animated movie. I was so thrilled by seeing it that I didn't movie a single second from my seat.
From a computational perspective, most current research examine the expression and extraction of opinion at two main levels of granularity: the document and the sentence 2 . At the document level, the standard task is to categorize documents globally as being positive or negative towards a given topic (Turney (2002); Pang et al. (2002); Mullen and Nigel (2004); Blitzer et al. (2007)). In this classification problem, all opinions in a document are supposed to be related to only one topic 3 . Overall document opinion is generally computed on the basis of aggregation functions (such as the average or the majority) that take as input the set of explicit opinions scores of a document and output either a polarity rating or an overall multi-scale rating (Pang and Lee (2005); Lizhen et al. (2010); Leung et al. (2011)). At the sentence level, on the other hand, the task is to determine the subjective orientation and then opinion orientation of sequences of words in the sentence that are determined to be subjective or express an opinion (Yu and Vasileios (2003); Riloff et al. (2003); Wiebe and Riloff (2005); Taboada et al. (2011)). This second level also assumes that a sentence usually contains a single opinion. To better compute the contextual polarity of opinion expressions, some researchers have used subjectivity word sense disambiguation to identify whether a given word has a subjective or an objective sense (Akkaya et al. (2009)). Other approaches identify valence shifters (viz. negations, modalities and intensifiers) that strengthen, weaken or reverse the prior polarity of a word or an expression (Polanyi and Zaenen (2006); Shaikh et al. (2007); Moilanen and Pulman (2007); Choi and Cardie (2008)). The contextual polarity of individual expressions is then used for sentence as well as document classification (Kennedy and Inkpen (2006); Li et al. (2010)).
We believe that viewing opinions in a text as a simple aggregation of opinion expressions identified locally is not appropriate. In this paper, we argue that discourse structure provides a crucial link between local and document levels and is needed for a better understanding of the opinions expressed in texts. To illustrate this assumption, let us take the example (2), extracted from our corpus of French movie reviews. (2) contains four opinions: the first three are strongly negative while the last one (introduced by the conjunction but in the last sentence) is positive. A bag of words approach would classify this review as negative, which is contrary to intuitions for this example.
The characters are unpleasant. The scenario is totally absurd. The decoration seems to be made of cardboard. But, all these elements make the charm of this TV series.
Discourse structure can be a good indicator of the subjectivity and/or the polarity orientation of a sentence. In particular, general types of discourse relations that link clauses together like Parallel, Contrast, Result and so on from theories like Rhetorical Structure Theory (RST) (Mann and 2. There is also a third level of granularity not detailed here which is the aspect or feature level where opinions are extracted according to the target domain features (Liu (2012)). 3. Of course, this assumption is debatable. For instance in forums, blogs and news, opinions are related to several topics. Thompson (1988)) or Segmented Discourse Representation Theory (SDRT) (Asher and Lascarides (2003)) furnish important clues for recognizing implicit opinions and assessing the overall stance of texts. For instance 4 , sentences related by the discourse relations Parallel or Continuation often share the same subjective orientation like in Mary liked the movie. Her husband too. Here, Parallel (triggered by the discourse marker too) holds between the two sentences and allows us to detect the implicit opinion conveyed by the second sentence. Polarity is often reversed in case of Contrast and usually preserved in case of Parallel and Continuation. Result on the other hand does not have a strong effect on subjectivity and polarity is not always preserved. For instance, in Your life is miserable. You don't have a girlfriend. So, go see this movie, the positive polarity of the recommendation follows the negative opinions expressed in the first two sentences. In case of Elaboration, subjectivity may not be preserved, in contrast to polarity (it would be difficult to say The movie was excellent. The actors were bad). Finally, Attribution plays a role only when its second argument is subjective, as in I suppose that the employment policy will be a disaster. In this case, depending on the reported speech act used to introduce the opinion, Attribution affects the degree of commitment of the author and the holder (Asher (1993); Prasad et al. (2006)).
Discourse-based opinion analysis is an emerging research area (Asher et al. (2008); Taboada et al. (2008Taboada et al. ( , 2009); Somasundaran (2010); Zhou et al. (2011); Heerschop et al. (2011); Zirn et al. (2011); Polanyi and van den Berg (2011); Trnavac and Taboada (2010); Mukherjee and Bhattacharyya (2012); Lazaridou et al. (2013); Trivedi and Eisenstein (2013); Wang and Wu (2013); Hogenboom et al. (2015); Bhatia et al. (2015)). Studying opinion within discourse gives rise to new challenges: What is the role of discourse relations in subjectivity analysis? What is the impact of the discourse structure in determining the overall opinion conveyed by a document? Does a discourse based approach really bring additional value compared to a classical bag of words approach? Does this additional value depend on corpus genre? The CASOAR project (a two year DGA-RAPID project (2010-2012 involving Toulouse University and an NLP company Synapse Développement) aimed to address these questions by gathering and analyzing a corpus of movie reviews in French and in English as well as letters to the editor in French. It extended our earlier work where Segmented Discourse Representation Theory (SDRT) (Asher and Lascarides (2003)) was used to study opinion within discourse (Asher et al. (2008(Asher et al. ( , 2009).
Before moving to real scenarios that rely on automatic discourse annotations, we first wanted to measure the impact of discourse structure on opinion analysis in manually annotated data. While annotating opinions at the expression, sentence or document level achieved a relatively good interannotator agreements, at least for explicit opinion recognition, and opinion polarity ; Toprak et al. (2010)), annotation of complete discourse structure is a more difficult task, especially for non experts (Carlson et al. (2003); ). Combining opinion and discourse annotations poses several methodological problems: the choice of the corpus in terms of genre and document length, the definition of the annotation model, and the description of the annotation guide so as to minimize errors, etc. A second point was more challenging: what is the most appropriate level to annotate opinion in discourse? Should we annotate opinion texts using a small set of discourse relations? Or should we use a larger set? Should discourse annotations annotators be simply asked to follow their intuitions after having been given a gloss of the discourse relations to be used, or should we provide them with a precise description of the structural constraints regarding the underlying discourse theory?
We developed a multi-layered annotation scheme that includes: the complete discourse structure according to SDRT, opinion orientation of elementary discourse units and opinion expressions, and their associated features. In this paper, we detail each layer, explore the interactions between them and discuss our results. In particular, we examine: the correlation between discourse and semantic category of opinion expressions focusing on the role of evaluation to identify discourse relations, the impact of discourse relations on both subjectivity and polarity analysis, and the impact of discourse on the determination of the overall opinion of a document. Our results demonstrate that discourse is an important cue for sentiment analysis, at least for the corpus genres we have studied.
The paper is organized as follows. Section 2 gives some background on annotating sentiment and discourse, and provides a brief introduction to SDRT, our theoretical framework. Section 3 presents our corpus. Section 4 details the annotation scheme, annotation campaign, and reliability of the scheme. Section 5 gives our results. We end the paper by a discussion where we highlight the main conclusions of our corpus-based study and discusses the portability and applicability of the annotation scheme.

Existing corpora annotated with sentiment
There are several existing annotated resources for sentiment analysis. Each resource can be characterized in terms of the corpus used, the basic annotation unit and annotation levels. In this section, we overview main existing resources according to these three criteria.
Compared to English, few resources have been developed for other languages. In French, the Blogoscopy corpus (Daille et al. (2011)) is composed of 200 annotated posts and 612 associated comments. There is also Bestgen et al. (2004)'s dataset composed of 702 sentences extracted from a newspaper 6 . In Spanish, the TASS corpus 7 is composed of 70,000 tweets annotated with global polarity as well as an indication of the level of agreement or disagreement of the expressed sentiment within the content. In German, the MLSA 8 (Clematide et al. (2012)), is a publicly available corpus composed of 270 sentences manually annotated for objectivity and subjectivity. Finally for Italian, the Senti-TUT corpus 9 includes sentiment annotations of irony in tweets (Bosco et al. (2013)). Multilingual sentiment annotation has also been explored: the EmotiBlog corpus consists of labeled blog posts in Spanish, Italian and English (Boldrini et al. (2012)), Mihalcea et al. (2007) Banea et al. (2010) automatically annotated English, Arabic, French, German, Romanian, and Spanish news documents.
In this paper, we aim to annotate opinion in discourse in multi-genre documents (movie reviews and news reactions) in French and movie reviews in English. To our knowledge, no one has conducted a corpus-based study across genres and languages that analyzes how opinion and discourse interact at different levels of granularity (expression, discourse unit and the whole document). Thus, there is almost no extent work for us to compare ourselves to other. Even though several annotation schemes already exist for the expression/phrase level (MPQA, JDPA-corpus, Darmstadt-corpus, MLSA), the descriptive analysis investigating the interaction between sentiment and discourse is novel.

Basic annotation unit
State-of-the art opinion annotation campaigns take the expression (a set of tokens), sentence or document as their basic annotation unit. However, annotating opinion in discourse required to move to start with elementary discourse unit (EDU) which is the intermediate level between the sentence and the document. Indeed, the sentence level is not appropriate for analyzing opinions in discourse, since, in addition to objective clauses, a single sentence may contain several opinion clauses that can be connected by rhetorical relations. Moving to the clause level is also not appropriate, since several opinion expressions can be discursively related as in The movie is great but too long where we have a Contrast relation introduced by the marker but. Therefore, we need to move to a fine-grained and semantically motivated level, the EDU.
Annotating EDUs not quite corresponding to either sentences or clauses has been standard in discourse annotation efforts for many years (see Section 2.2 for an overview). However, annotating sentiment within EDUs is still marginal. Among the few annotated sentiment corpora at the EDU level, we cite Asher et al. (2009), who analyzed explicit opinion expressions within EDUs. Somasundaran et al. (2007) used a similar level in order to detect the presence of sentiment and arguing in dialogues. Zirn et al. (2011) performed subjectivity analysis at the segment level. They used a corpus of product reviews segmented using the HILDA tool 11 , an RST discourse parser. Lazaridou et al. (2013) used the SLSeg software package 12 to segment their corpus into EDUs following RST. The corpus was then used to train a joint model for unsupervised induction of sentiment, aspect and discourse information.
In this paper, documents are segmented according to SDRT principles.

Annotation levels
Our annotation scheme is multi-layered and includes: the complete discourse structure, segment opinion orientation, and opinion expressions. At the document level, we propose to annotate the document overall opinion as well as its full discourse structure following the SDRT framework. Global opinion annotation resembles previous document level annotation (Pang and Lee (2004)). To the best of our knowledge, this is the first sentiment dataset that incorporates discourse structure annotation. At the segment level, we propose to associate to each EDU a subjectivity type (among four main types: explicit evaluative, subjective non evaluative, implicit, and objective) as well as polarity and strength. Segment opinion type mainly follows ; Toprak et al. (2010) and Liu (2012).  already proposed an expression-level annotation scheme that distinguishes between explicit mentions of private states, speech events expressing private states, and expressive subjective elements. Toprak et al. (2010), following , distinguished in their annotation scheme (consumer reviews) between explicit opinions and facts that imply opinions. Finally, Liu (2012) has also observed that subjective sentences and opinionated sentences (which are objective or subjective sentences that express implicit positive or negative opinions) are not the same, even though opinionated sentences are often a subset of subjective sentences. In this work, we propose, in addition, to study what are the correlations between segment opinion types and the overall opinion on the one hand (cf. Section 5.3.4), and between segment types and rhetorical relations (cf. Sections 5.3.2 and 5.3.3).
The opinion expression is the lowest level and focuses on annotating all the elements associated to an opinion within a segment: (1) the opinion span, excluding operators (negation, modality, intensifier, and restrictor), (2) opinion polarity and strength, (3) opinion semantic category, (4) topic span, (5) holder span, and (6) operator span. Our annotation at this level is very similar to state of the art annotation schema at the expression level (e.g. MPQA, JDPAcorpus, Darmstadt-corpus, MLSA corpus). However, in addition, we explore the link between discourse and opinion semantic category of subjective segments (cf. Section 5.3.1).

Existing corpora annotated with discourse
The annotation of discourse relations in language can be broadly characterized as falling under two main approaches: the lexically grounded approach and an approach that aims at complete discourse coverage. Perhaps the best example of the first approach is the Penn Discourse Treebank (Prasad et al. (2008)). The annotation starts with specific lexical items, most of them conjunctions, and includes two arguments for each conjunction. This leads to partial discourse coverage, as there is no guarantee that the entire text is annotated, since parts of the text not related through a conjunction are excluded. On the positive side, such annotations tend to be reliable. PDTB-style annotations have been carried out in a variety of languages (Arabic, Chinese, Czech, Danish, Dutch, French, Hindi and Turkish).
Complete discourse coverage requires annotation of the entire text, with most, if not all, of the propositions in the text integrated in a structure. It includes work from two main different theoretical perspectives, either intentionally or semantically driven. The first perspective has been investigated within Rhetorical Structure theory, RST (Mann and Thompson (1988)), whereas the second includes Segmented Discourse Representation Theory, SDRT (Asher and Lascarides (2003)), and the Discourse Graph Bank model (Wolf and Gibson (2006)). RST annotated resources exist in Basque, Dutch, German, English, Portuguese and Spanish. Corpora following SDRT exist in Arabic, French and English.
To get a complete structure for a text, three decisions need to be made: • what are the elementary discourse units (EDU)?
• how do elementary units combine to form larger units and attach to other units?
• how are the links between discourse units labelled with discourse relations? 6 Many theories such as RST take full sentences or at least tensed clauses as the mark of an EDU. SDRT, as developed in (Asher and Lascarides (2003)), was largely mute on the subject of EDU segmentation, but in general also followed this policy. Concerning attachment, most discourse theories define hierarchical structures by constructing complex segments (CDUs) from EDUs in recursive fashion. RST proposes a tree-based representation, with relations between adjacent segments, and emphasizes a differential status for discourse components (the nucleus vs. satellite distinction). Captured in a graph-based representation, with long-distance attachments, SDRT proposes relations between abstract objects using a relatively small set of relations. Identifying these relations is a crucial step in discourse analysis. Given two discourse units that are deemed to be related, this step labels the attachment between the two discourse units with discourse relations such as Elaboration, Explanation, Conditional, etc. For example in [This is the best book] 1 [that I have read in along time.] 2 we have Elaboration(1, 2). Their triggering conditions rely on the propositional contents of the clauses -a proposition, a fact, an event, a situation -the so-called abstract objects (Asher (1993)) or on the speech acts expressed in one unit and the semantic content of another unit that performs it. Some instances of these relations are explicitly marked i.e., they have cues that help identifying them such as but, although, as a consequence. Others are implicit i.e., they do not have clear indicators, as in I didn't go to the beach. It was raining. In this last example to infer the intuitive Explanation relation between the clauses, we need detailed lexical knowledge and probably domain knowledge as well.
In this paper, we aim to annotate the full discourse structure of opinion documents following a semantically driven approach, as done in SDRT.

Overview of the Segmented Discourse Representation Theory (SDRT)
SDRT is a theory of discourse interpretation that extends Kamp's Discourse Representation Theory (DRT) (Kamp and Reyle (1993)) to represent the rhetorical relations holding between EDUs, which are mainly clauses, and also between larger units recursively built up from EDUs and the relations connecting them. SDRT aims at building a complete discourse structure for a text or a dialogue, in which every constituent is linked to some other constituent. We detail below the three steps needed to build this structure, namely: EDU determination, attachment, and relation labelling.

EDU determination
We follow the principles defined in the Annodis project 13 ). In Annodis, an EDU is mainly a sentence or a clause in a complex sentence that typically corresponds to verbal clauses, as in [I loved this movie] a [because the actors were great] b where the clause introduced by the marker because, indicates a cutting point. We have here the relation Explanation(a, b). An EDU can also correspond to other syntactic units describing eventualities, such as prepositional and noun phrases, as in [After several minutes,] a [we found the keys on the table] b where we have two EDUs related by Frame(a, b). In addition, a detailed examination of the semantic behavior of appositives, non restrictive relative clauses and other parenthetical material in our corpora, revealed that such syntactic structures also contributed EDUs 14 . Such constructions provide semantic contents that do not fall within the scope of discourse relations or operators between the constituents in which they occur. In Example (3), we see that the apposition in italic font does not or at least needs not fall within the scope of the conditional relation on a defensible interpretation of the text. Such "nested" EDUs are a useful feature in sentiment analysis as EDUs conveying opinions may be isolated from surrounding "objective" material, as in the movie review in (4). Finally, concerning attributions, we segment cases like "I say that I am happy" into two EDUs: "I say" and "that I am happy". (3) If the former President of the United States, who has been all but absent from political discussions since the 2008 election, were to weigh in on the costs of the economic shutdown, the radical Republicans might be persuaded to vote to lift the debt ceiling.
(4) [The film [that distressed me the most] is CRY OF FREEDOM].
In addition to this definition, we observe in our corpora that several opinion expressions (often conjoined NP or AP clauses) could be linked by discourse relations. We thus resegment such EDUs into separate units. Annodis segmentation principles were then refined in order to take into account the particularities of opinion texts. For example, the following sentence: [the movie is long, boring but amazing] is segmented as follows: [the movie is long,] 1 [boring] 2 [but amazing] 3 with Continuation(1,2) and Contrast ([1,2],3), [1,2] being a complex discourse unit. Even if segments 2 and 3 do not follow the EDU standard definition (they are neither sentences nor clauses), we believe that such fine-grained segmentation will facilitate polarity analysis at the sentence level.
During the annotation of EDUs, we consider that argument naming generally follows the linear order in the text. In case of embedding, the main clause is annotated first. For instance in (4), we have: [The film [that distressed me the most] 2 is CRY OF FREEDOM] 1 .

Attachment decision
In SDRT, a discourse representation for a text T is a structure in which every EDU of T is linked to some (other) discourse unit, where discourse units include EDUs of T and complex discourse units (CDUs) built up from EDUs of T connected by discourse relations in recursive fashion. Proper SDRSs form a rooted acyclic graph with two sorts of edges-edges labeled by discourse relations that serve to indicate rhetorical functions of discourse units, and unlabeled edges that show which constituents are elements of larger CDUs. SDRT allows attachment between non adjacent discourse units and for multiple attachments to a given discourse unit 15 , which means that the structures created are not always trees but rather directed acyclic graphs. These graphs are constrained by the right frontier principle that postulates that each new EDU should attach either to the last discourse unit or to one that is super-ordinate to it via a series of subordinate relations and complex segments.
One of the most important feature that makes SDRT an attractive choice for studying the effects of discourse structure on opinion analysis is the scope of relations. For instance, if an opinion is within the scope of an attribution that spans several EDUs, then knowing the scope of the attribution will enable us to determine who is in fact expressing the opinion. Similarly, if there is a contrast that has scope over several EDUs in its left argument, this can be important to determine the overall contribution of the opinions expressed in the arguments of the contrast. To get this kind of information, we need to have discourse annotations in which the scopes of discourse relations are clear and determined for an entire discourse graph. Example (5) taken from the Annodis corpus ) illustrates what are called long distance attachments 16 . A causal relation like Result, or at least a temporal Narration holds between 3 and 6, but it should not scope over 4 and 5 if one does not wish to make Sequin's admission to the hospital a month ago and her turning 79 a consequence of her death last Saturday.

Relation labelling
SDRT models the semantics/pragmatics interface using discourse relations that describe the rhetorical roles played by utterances in context, on the basis of their truth conditional effects on interpretation. Relations are constrained by: semantic content, pragmatic heuristics, world knowledge and intentional knowledge. They are grouped into coordinating relations that link arguments of equal importance and subordinating relations linking an important argument to a less important one. This semantic characterization of discourse relations has two advantages for our study: first, the semantics of discourse relations makes it more straightforward to study their interactions with the semantics of subjective expressions, and secondly the semantic classification in SDRT leads to a smaller taxonomy of discourse relations than that given in RST, enabling an initial study of the interaction of discourse structure and opinion to find generalisations. Additionally, the fact that in SDRT multiple relations may relate one discourse unit to other discourse units allows us to study more complex interactions than it would be possible in the other theories. Figure 1 gives an example of the discourse structure of the example (6), familiar from Asher and Lascarides (2003). In this figure, circles are EDUs, rectangles are complex segments, horizontal links are coordinating relations while vertical links represent subordinating relations.

The CASOAR corpus
We selected data according to four criteria: document genre, the number of documents per topic, document length and the type of opinion conveyed in the document. To better capture the dependencies between discourse structure and corpus genre, the annotation campaign should be conducted on different types of online corpora, each with a distinctive style and audience. For each corpus, topics (a movie, a product, an article, etc.) have to be selected according to their related number of documents or reviews. Our hypothesis was that the more attractive a topic is (i.e., it aroused a great 16. For a discussion of long-distance discourse relations in RST, see (Marcu (2000)). number of reactions), the more opinionated the reviews are. In addition, the number of positive and negative documents has to be balanced. Given that discourse annotation is time consuming and error prone, especially for long texts where long distance attachments are frequent, documents should not be too long. On the other hand, documents should have an informative discourse structure and hence should not be too short either. Finally, the data should contain explicit opinion expressions as well as implicit opinions. One of our aims was to measure how these kinds of opinions are assessed in discourse. Given these criteria, we chose to build our own corpus and not to rely on existing opinion datasets. Indeed, in French, the only existing and freely available opinion dataset (the Blogoscopy corpus Daille et al. (2011) 17 ) was not available when we began our annotation campaign. In English, there are several freely available corpora already annotated with opinion information. Among them, we have studied four resources: the well known MPQA ) corpus 18 , the Sentiment Polarity DataSet and the Subjectivity DataSet 19 (Pang and Lee (2004)), and the Customer Reviews Dataset 20 (Hu and Liu (2004)). We chose not to build our discourse based opinion annotation on the top of MPQA for two reasons. First, text anchors which correspond to opinion in MPQA are not well defined since each annotator is free to identify expression boundaries. This is problematic if we want to integrate rhetorical structures into the opinion identification task. Secondly, MPQA often groups discourse indicators (but, because, etc.) with opinion expressions not leading to any guarantee that text anchors will correspond to a well formed discourse unit.
The Sentiment Polarity DataSet consists of 1,000 positive and 1,000 negative processed reviews annotated at the document level. However, it was not appropriate for our purposes because the documents in this corpus are very long (more than 30 sentences per document) which would have made the annotation of the discourse structure too hard. On the other hand, the Subjectivity DataSet 17. http://www.lina.univ-nantes.fr/?Blogoscopie,762.html 18. http://mpqa.cs.pitt.edu 19. www.cs.cornell.edu/people/pabo/movie-review-data 20. http://www.cs.uic.edu/˜liub/FBS/sentiment-analysis.html contains 5,000 subjective and 5,000 objective processed sentences. Only sentences or snippets containing at least 10 tokens were included along with their automating labelling decision (objective vs. subjective), as shown in (7). Since sentences are short (at most 3 discourse units), this corpus also did not meet our criteria. Finally, the Customer Reviews Dataset consists of annotated reviews of five products (digital camera, cellular phone, mp3 player and dvd player), extracted from Amazon.com. This corpus provides only target and polarity annotations at the sentence or the snippet level focusing on explicit opinion sentences (cf. (8) and (9) where [u] indicates that the target is not lexicalized (implicit)).
nicely combines the enigmatic features of 'memento' with the hallucinatory drug culture of 'requiem for a dream . ' To conclude, none of the above pre-existing annotated corpora firs our objectives. We thus built the following corpora, summarized in Table 1: • The French data are composed of two corpora: French movie/product reviews (F MR) and French news reactions (FNR). The movie reviews were taken from AlloCine.fr, book and video game reviews from Amazon.fr, and restaurant reviews from Qype.fr. The news reactions, extracted from lemonde.fr, are reactions to articles from the politics and economy sections of the "Le Monde" newspaper. We selected those topics (movies, products, articles) that are associated to more than 10 reviews/reactions. In order to guarantee that the discourse structure is informative enough, we also filtered out documents containing less than three sentences. In addition, for F MR, we balanced the number of positive and negative reviews according to their corresponding general evaluation (i.e., stars 21 ). For FNR, reactions that are responses to other reactions were removed.
• The English data are movie reviews (EMR) from MetaCritic 22 . The choice of movie reviews is motivated firstly by the fact that this genre is widely used in the field and secondly, by our aim to compare how opinions are expressed in discourse in different languages (movie reviews were also selected for the French annotation campaign). The selection procedure (number of reviews per movie, number of sentences per review) was the same as for the one used in French data selection.

Annotation scheme
The annotation scheme is multi-layered, and includes: (1) the complete discourse structure according to SDRT, (2) opinion orientation of EDUs, and (3) opinion expressions, and their associated features. Each level has its own annotation manual and annotation guide, as described in the next sections.
In the remainder of this paper, all the examples are extracted from our corpora. Examples from EMR are given in English while examples from F MR and FNR are given in French along with their direct English translation (when possible). Note however that there are substantial semantic differences between the two languages.

The document level
In this level, annotators were asked to give the document overall opinion towards the main topic using a five-level scale, where 0 indicates a very bad (negative) opinion and 4 a very good (positive) one. Then, annotators have to build the discourse structure of the document following the SDRT principles.
Our discourse annotation scheme was inspired from an already existing manual elaborated during the Annodis project, a French corpus where each document was annotated according to the principles of SDRT. This manual gives a complete description of the semantics of each discourse relation along with a listing of possible discourse markers that could trigger any particular relation. However, the manual did not provide any details concerning the structural postulates of the underlying theory. This was justified, since one of the objectives of the Annodis project was to test the intuitions of the naive annotators relevant to these issues. In CASOAR however, we aimed at testing the intuitions of naive annotators on how discourse interacts with opinion. We therefore modified the Annodis manual in order to make precise all the constraints annotators should respect while building the discourse graph. In particular, we made explicit the constraints concerning segment attachment and accessibility of complex segments. We stipulated in the manual that each segment in the graph should be connected and that the attachment should normally follow the reading order of the document and the right frontier principle (cf. Section 2.3). CDU constraints detailed how EDUs can be grouped to form complex units. Figure 2 shows an example of a complex discourse unit constraint. Suppose [1,2] and [2,3] are CDUs. Figures on the right and in the middle are correct configurations whereas the one on the left is not allowed for two main reasons: an EDU cannot belong to two distinct CDUs (as the EDU 2 in the CDUs [1,2] and [2,3]) and the head of a CDU 23 cannot appear as a second argument of a relation.
During the writing of this manual, we faced another decision: (1) should we annotate opinion texts using a small set of discourse relations or (2) should we use a larger set (i.e., the 19 relations already used in the Annodis project). The first solution is more convenient and has already been investigated in previous studies. For example, in Asher et al. (2008), we experimented with an annotation scheme where lexically-marked opinion expressions and the clauses involving these expressions are related to each other using five SDRT-like rhetorical relations: Contrast and Correction (introduced by signals such as : although, but, contradict, protest, deny, etc.), Support that 23. The head of a CDU is the first EDU that composes it. For example, 1 is the head of the CDU [1,2]. groups together Explanation and Elaboration, Result (usually marked by so, as a result) which indicates that the second argument is a consequence or the result of the first argument, and finally, Continuation. Somasundaran (2010) proposed the notion of opinion frames as a representation of documents at the discourse level in order to improve sentence-based polarity classification and to recognize the overall stance. Two sets of homemade relations were used: relations between targets (same and alternative relations) and relations between opinion expressions (reinforcing and non-reinforcing relations). Finally, Trnavac and Taboada (2010) examined how some nonveridical markers and two types of rhetorical relations (Conditional and Concessive) contribute to the expression of appraisal in movie and book reviews. In our case, we chose not to use a predefined small set of rhetorical relations selected according to our intuitions because we did not know in advance what were the most frequent relations occuring in opinion texts and how this frequency was correlated with corpus genre. Of course, this choice made it harder to do the annotations. But we think that this was a necessary step to investigate the real effects of discourse relations on both polarity and subjectivity as well as to evaluate the impact of discourse structure when assessing document overall opinion.
Among the set of 19 relations used in the Annodis project, we focused our study on 17 relations that involve entities from the propositional content of the clauses 24 . These relations are grouped into coordinating relations (Contrast, Continuation, Conditional, Narration, Alternative, Goal, Result, Parallel, Flashback) and subordinating relations (Elaboration, E-Elab, Correction, Frame, Explanation, Background, Commentary, Attribution). Table 2 provides a detailed list of these relations along with their definitions. In this table, α and β stand respectively for the first and the second argument of a relation. (C) and (S ) represent respectively coordinating and subordinating relations.
Annotators were asked to link constituents (EDUs or CDUs) through whichever discourse relation they felt appropriate, from our list above. In addition to this set of 17 relations, we also added the relation Unknown in case annotators were not able to decide which relation is more appropriate to link two constituents.

The segment level
For each EDU in a document, annotators were asked to annotate its subjectivity orientation as well as its polarity and strength.
Subjectivity orientation. It can belong to five categories: 24. Meta-talk (or pragmatic) relations that link the speech acts expressed in one unit and the semantic content of another unit that performs it were discarded.

Discourse relations Definitions
Causality Explanation (S) the main eventuality of β is understood as the cause of the eventuality in α Goal (S) β describes the aim or the goal of the event described in α Result (C) the main eventuality of αis understood to cause the eventuality given by β Structural Parallel (C) α and β have similar semantic structures. The relation requires α and β to share a common theme Continuation (C) α and β elaborate or provide background to the same segment Contrast (C) α and β have similar semantic structures, but contrasting themes or when one constituent negates a default consequence of the other Logic Conditional (C) α is a hypothesis and β is the consequence. It can be interpreted as: if α then β Alternation (C) α and β are related by a disjunction Reported Speech Attribution (S) relates a communicative agent stated in α and the content of a communicative act introduced in β Exposition/Narration Background (S) β provides information about the surrounding state of affairs in which the eventuality mentioned in α occurs Narration (C) α and β introduce an event and the main eventualities of α and β occur in sequence and have a common topic Flashback (C) is equivalent to Narration(β,α). The story is told in the opposite temporal order Frame (S) α is a frame and β is on the scope of that frame Elaboration Elaboration (S) β provides further information (a subtype or part of) about the eventuality introduced in α Entity-Elaboration (S) β gives more details about an entity introduced in α Commentary Commentary (S) β provides an evaluation of the content associated with α Correction Correction (S) α and β have a common topic. β corrects the information given in the segment α Strength. Several types of scales have been used in sentiment analysis research, going from continuous scales (Benamara et al. (2007)) to discrete ones (Taboada et al. (2011)). In our case, we think that the chosen scale has to ensure a trade off between a fine-grained categorisation of subjectivity and the reliability of this categorization with respect to human judgments. For our annotation campaign, we chose a discrete 3-point scale, [1,3] where 1 indicates a weak strength. Objective segments (O) are associated by default to the strength 0.

The opinion expression level
After segment annotation, the next step is to identify within each EDU at least one of these elements: the opinion expression span, opinion topic, opinion holder, and operators that interact locally with opinion expressions. Once all these elements are identified, annotators have to link every operator, topic and holder to its corresponding opinion expression using the Scope relation. This relation aims to link: an operator to an opinion expression under its scope, a holder to its associated opinion expression, and an opinion expression to its related topic. Since most opinion expressions reflect the writer's point of views (i.e., the main holder), we decided not to annotate the scope relation in this case so as not to make the annotation more laborious. Operators as well as topics are linked to the opinion in their scope only if several opinion expressions are present in an EDU. We detail below the annotation scheme.
Opinion expression span. Within each EDU, annotators can identify zero (in case of SI and O segments), one or several non overlapping opinion spans. An opinion span is composed of subjective tokens (adjectives, verbs, nouns, or adverbs), excluding operators 25 . Its annotation includes: a polarity (positive, negative, and neutral), a strength (on a discrete 3-point scale, cf. above), a semantic category and a subcategory. According to the opinion categorization described in Asher et al. (2008), each opinion expression can belong to four main categories: Reporting which provides, at least indirectly, a judgment by the author on the opinion expressed, Judgment which contains normative evaluations of objects and actions, Advice which describes an opinion on a course of action for the reader, and Sentiment-Appreciation containing feelings and appreciations. Subcategories include, for example, inform, assert, evaluation, recommend, fear, astonishment, blame, etc. Topics and holders. They are textual spans within a segment that are associated with a type. The opinion topic can have three types: main indicating the main topic of the document, such as "the movie", part of in case of features related to the main topic, such as "the actors", "the music", and finally other when the topic has no ontological relation with the main topic, for example "theater" in The movie was great. Shame that the theater was dirty. Also, we distinguish between two types of holders: main that stands for the author's review and other (as in My mother loved the movie).
Operators. Finally, we deal with four types of operators: (i) negations that may affect the polarity and the strength of an expression, (ii) modals used to express the degree of belief of the holder, (iii) intensifiers used to strengthen (we use the operator Int+) or weaken (Int-) the prior polarity of a word or an expression, and (iv) restrictors that narrow the scope of the opinion in the sense that the positivity and/or negativity of the expression can be evaluated only under certain conditions, as in the restaurant is very good for children. Operators have to be annotated when opinion expressions are under their scope as well as in case of implicit segments when appropriate. Figure 3 gives the annotation at the opinion expression and the segment level of the review (10), taken from EMR. In this figure, we provide for each opinion expression its polarity and strength. Similarly, we associate for each segment a triple that indicates its type (among: SE, SI, 0, SN, and SEI), polarity (among: +, -, neutral, both, and no polarity), and strength (in a three level scale). Figure 4 provides the associated discourse graph. In order to avoid errors in determining the basic units (which would thus make the inter-annotator agreement study problematic), we decided to discard the segmentation from the annotation campaign. Instead, EDUs were automatically identified. To train our segmenter, two annotators manually annotated a subset of F MR (henceforth F MR ′ ) by consensus. This yields a total of 130 documents and 1,420 EDUs, among which 1.33% were embedded. Automatic segmentation was carried out by adapting an already existing SDRT-like segmenter (Afantenos et al. (2010)), built on the top of the Annodis corpus 26 . The features used in Afantenos et al. (2010) include the distance from sentence boundaries, the dependency path, and the chunk start/end. Since we used a different syntactic parser, we modified certain features accordingly, and 26. The corpus used for training the parser was composed of 47 documents extracted from L'Est Rébublicain newspaper.

A complete example
This corpus is mainly objective and contains 1,400 EDUs, among them 10% were nested.

16
Evaluation in Discourse Figure 3: Annotation of (10) at the segment and the opinion expression level. discarded others. We performed a two-level segmentation. First, we constructed a feature vector for each word token, which is classified into: Right for words starting an EDU, Left for tokens ending an EDU, Nothing for words completely inside an EDU, and Both for tokens which constitute the only word of an EDU. Once all EDUs were found, subjective EDUs that contain at least one token belonging to our subjective lexicon 27 are filtered out because they are good candidates for a further segmentation. The proportion of such EDUs in F MR ′ was relatively small (around 12%). This second step was performed using symbolic rules which are mainly based on discourse connectives and punctuation marks.
27. Our lexicon is manually built and is composed of 270 verbs, 632 adjectives, 296 nouns, 594 adverbs, 51 interjections.   Table 4: Evaluation of the symbolic rules in terms of precision (P), recall (R) and F-measure (F).
Our discourse segmentation followed a mixed approach using both machine learning and rulebased methods. We first evaluated the classifier and then the symbolic rules. We performed a supervised learning using Maximum Entropy model 28 in order to classify each token into Right, Le f t, Nothing or Both classes as described above. We conducted three evaluations: (E1) a 10-fold cross validation on the Annodis corpus in order to compare our results to the ones obtained by Afantenos et al. (2010); (E2) training on Annodis and testing on F MR ′ to see to what extent our set of features was independent of the corpus genre; (E3) a 10-fold cross validation on F MR ′ . Table 3 shows our results for the Right, Le f t, and Nothing boundaries, in terms of precision (P), recall (R), and F-measure (F). Our results for the configuration (E1) are similar to those obtained by Afantenos et al. (2010) on Annodis. The best performance was achieved when training on our data (i.e., the configuration (E3)). Table 4 shows the results of the symbolic rules when applied on the outputs of the configuration (E3). Results concern both segment boundaries (averaged over all the four classes) and the recognition of an EDU as a whole with a begin boundary and its corresponding end. We evaluated both on F MR ′ when subjective EDUs are given by manual annotation and on F MR ′ Lex when they are automatically identified using our lexicon. Again, our rules performed very well.
This tool was used to automatically segment F MR and FNR documents. The resulting segmentation was manually corrected when necessary 29 . We did not design an automatic segmenter for English and segmentation in EMR was performed manually by two annotators by consensus.

Annotation campaign
We managed two annotation campaigns. The French one was the first and took six months. The English campaign came second and lasted three months. F MR and FNR was doubly annotated by three French native speakers while EMR was annotated by two English native speakers. French annotators were undergraduate linguistic students while English ones were teachers. Annotators 28. http://www.cs.utah.edu/˜hal/megam/ 29. We mainly corrected unbalanced bracketing. To this end, we designed a script that recognizes if for each begin bracket, there is a corresponding end bracket. If not, we manually ensured correct bracketing. We also checked if the other segmentation cases that we defined were correctly handled. Overall, manual correction was very fast.
benefited from a complete and revised annotation manual as well as an annotation guide explaining the inner workings of the GLOZZ platform 30 , our annotation tool. Since documents are already segmented, annotators first had to click on each EDU, specified its category, polarity, and strength (see Section 4.1.2), and then could isolate, within each EDU, spans of text corresponding to the annotation scheme described in Section 4.1.3. Discourse annotation was performed by inserting relations between selected constituents using the mouse. When appropriate, EDUs were grouped to form CDUs using GLOZZ schemata. GLOZZ also provides a discourse graph as part of its graphical user interface which helps the annotator to better capture the discourse structure while linking constituents. Figure 5 illustrates how a document, extracted from EMR, is annotated under GLOZZ. The first segment includes the spans This and movie annotated as main topics, definitely and all time annotated as intensifier operators and the best annotated as an opinion expression. The annotation associated to the first segment is shown in the features structure on the right. Segment 2 and 3 are related with a Continuation relation, and the structure Continuation(2, 3) is grouped into a CDU (the blue circle in the Figure). The French annotation proceeded in two stages. First, the annotation of the movie reviews; then, the annotation of news reactions. For each stage, we performed a two-step annotation where an intermediate analysis of agreement and disagreement between the three annotators was carried out. Annotators were first trained on 12 movie reviews and then they were asked to annotate separately 168 documents from F MR. Then, they were trained on 10 news reactions. Afterwards, they continued to annotate separately 121 documents from FNR. The training phase for F MR was longer than for FNR since annotators had to learn about the annotation guide and the annotation tool. Similarly, the English annotation campaign was done in two steps. Annotators were trained on 10 EMR and then the rest of the corpus (100 documents) was annotated separately. The time needed to annotate entirely one text was about 1 hour.

www.glozz.org
During training, we noticed that annotators often made the same errors. At the segment and the opinion expression level, these errors included: segments labelled as opinionated (SE and SEI) with no opinion expression inside; O or SI segments with an opinion expression inside; O and SN segments with a prior polarity; opinion expressions with no associated semantic category, etc. For example, if one annotator considered the following segment I am a huge fan of Tintin to be subjective, he should annotate the span fan as being an opinion expression. Some of the discourselevel errors include: violation of the right frontier constraint, cycles, overlapping CDUs, segments not attached to the discourse graph, etc. To ensure that the annotations were consistent with the instructions given in the manual, we designed a tool to automatically detect these errors. Among all the provided annotations, 15% of the French documents contained errors at the segment and the opinion level vs. 12% for the English documents. The annotators were asked to correct their errors before continuing to annotate new documents. With respect to discourse structure, just a few French documents were ill-formed. However, the English annotators felt uncomfortable with discourse annotation, and their annotations were full of errors. We retrained them but finally decided to annotate discourse in EMR by consensus.

Reliability of the annotation scheme
In this section, we report on inter-annotator agreements at the document, segment, and opinion expression levels. All statistics have been computed using the IRR library under R 31 .

At the document level
Recall that the document annotation level consists of two tasks: assigning to each document an overall opinion (on a discrete five-level scale) and then a discourse structure. Agreements have been computed on 152 F MR documents, 100 EMR, and 120 FNR.
Agreements on overall opinion. We used two different measures. First, Cohen's Kappa which assesses the amount of agreement between annotators. Second, Pearson's correlation that measures the linear correlation between two vectors variables: the annotators' overall opinions (variable 1) and the original overall opinions as given by Allociné or MetaCritic users (variable 2). The aim is identify whether the first variable tends to be higher (or lower) for higher values of the other variable. Pearson's correlation gives a value between [−1, +1] where +1 indicates a total positive correlation, 0 no correlation, and 1 total negative correlations. Table 5 gives our results in terms of Cohen's Kappa when overall opinion has to be stated on the five level scale 0 to 4 (Kappa multi-scale), the weighted Kappa (weighted Kappa multi-scale), and the Kappa after collapsing the ratings 0 to 2 and 3 to 4 into respectively positive and negative ratings (Kappa polarity). Compared to a non weighted version, weighted Kappa allows to compute agreements on ordinal labels. Hence, a disagreement of 0 vs. 4 is much more significant that a disagreement of 1 vs. 2. We also give the average Pearson's correlation between the overall opinion given by our annotators and the overall ratings already associated to each movie review documents 32 .
Our results are good in movie reviews in polarity rating and weighted Kappa but moderate in multi-scale rating, with a lower value obtained for news reactions. This shows that news reactions 31. https://cran.r-project.org/web/packages/irr/irr.pdf 32. Correlations are given only for F MR and EMR documents since in news reactions (FNR), authors are not asked to give the overall opinion of their comments.  Table 5: Inter-annotator agreements on document overall opinion rating.
are more difficult to annotate. Finally, when evaluating the correlation between the annotators' overall opinions and the authors overall scores, we observe that correlations are good.
Agreements on discourse structure. As described in Section 4.1, discourse annotation depends on two decisions: a decision about where to attach a given EDU, and a decision on how to label the attachment link via discourse relations. Two inter-annotator agreements have thus to be computed and the second one depends on the first because agreements on relations can be performed only on common links. For attachment, we obtained an F-measure of 69% for F MR and 68% for FNR assuming attaching is a yes/no decision on every EDUs pair, and that all decisions are independent, which of course underestimates the results. When commonly attached pairs are considered, we get a Cohen's Kappa of 0.57 for the full set of 17 relations for F MR and 0.56 for FNR, which is moderate. Here again, this Kappa is computed without an accurate analysis of the equivalence between rhetorical structures 33 . Figure 6 shows two discourse annotations for the French movie review in Example (11). We observe that the annotator (on the left) formed more CDUs than the other annotator (on the right) which causes both attachment and relation labeling errors. Our goal being to study the effects of discourse on opinion analysis, a detailed analysis of inter-annotator attachment agreements is out of the scope of this study and is left for future work. Overall, our results are higher than those obtained by Annodis (66% F-measure for attachment and a Cohen's Kappa of 0.4 for relation labeling) mainly for two reasons. First, our annotation manual was more constrained since we provided annotators a detailed description of how to build 33. See ) for an interesting discussion on the difficulty on how to compare rhetorical structures, especially when CDU are have to be taken into account. the discourse structure. Second, our documents are smaller (an average of 20 EDUs compared to 55 EDUs in Annodis) which implies less long distance attachments.   Table 6: Inter-annotator agreements on segment opinion type, polarity, and strength per corpus genre.
We observe that the inter-annotators agreements are better for movie reviews than for news reactions and that F MR achieves the best scores. We get very good Kappa measures for both explicit opinion segments SE (0.74) and the polarity (positive and negative) of a segment in French movie reviews (respectively 0.78 and 0.77). We get similar results in English with as an example a Kappa of 0.67 for the SE class and a Kappa of 0.75 and 0.74 for respectively positive and negative segment opinion type. These results are in agreement with state-of-the-art results obtained in contemporary annotation campaigns (see e.g.  In (12), one annotator (A) considered that segments 3 and 4 conveyed positive implicit opinions towards the movie while the second annotator (B) has labeled these segments as explicit by selecting the spans Good way and tie things up as being positive opinion expressions. In (13), (A) and (B) agreed to put the segments 4 and 5 into the SE category but disagreed on the category of the first three segments: for (A), segments 1 and 2 are implicit negative segments whereas for (B) they are purely objective. Similarly, for (B) segment 3 is objective and for (A) it is an explicit opinion because it contains the word slower which has been annotated as a negative opinion expression.
The difficulty to discriminate between explicit, implicit, and objective segments can also be explained by the lower Kappa measure obtained for no polarity with 0.60 in EMR and 0.68 in F MR compared to the Kappa obtained on positive and negative segment polarity. This difficulty is, we believe, an artifact of the length of the texts. Indeed, the longer a text is, the greater the difficulty for human subjects to detect discourse context. However, the study of this hypothesis falls out of the scope of this paper and is therefore left for future work. Nonetheless, these results are good in the range of state-of-the-art research reports in distinguishing between explicit and implicit opinions. For instance, Toprak et al. (2010) obtained a Kappa of 0.56 for polar fact sentences which are close to our SI category.
In FNR, our results were moderate for the SE and SN classes (respectively 0.56 and 0.58) and weak for the SI and O classes (respectively 0.48 and 0.40). We have the same observations for the agreements on segment polarities where we obtain moderate Kappas on all the three classes (positive, negative, and no polarity). This shows that the newspaper reactions were more difficult to annotate because the main topic is more difficult to determine (even by the annotators) -it can be one of the subjects of the article, the article itself, its author(s), a previous comment or even a different topic, related to various degrees to the subject of the article. Implicit opinions, very frequent, can be of a different nature: ironic statements, jokes, anecdotes, cultural references, suggestions, hopes and personal stances, especially for political articles. Here is an example of implicit segments extracted from FNR. Annotators disagreed on how to annotate the first segment: for (A), 1 is negative implicit while for (B) it is explicit (with the spans vraiment/really and plaindre/pity annotated respectively as an operator and an opinion expression): (14) [ Finally, the Kappa for segment strength averaged over the scale [0, 3] is bad. However, the Kappas are good on the extreme values of this scale, and moderate when using a weighted measure. For example, we get a Kappa of 0.67 and 0.58 in respectively F MR and EMR on the strength 0 vs. 0.4 in FNR. These results confirm that multi-scale polarity annotation is a difficult task, as already observed in similar annotation schema (cf. Toprak et al. (2010)). We think that low agreements were mainly due to the annotation manual that failed to clearly explain strength annotation. Indeed, for the same "basic" opinion expression, we got different annotations. For example, in similar contexts, the adjective good got different scores (+1 or +2). We think that the manual can be improved by explicitly stating the prior score of "basic" expressions (e.g., good (+1), brilliant (+2) and exceptional (+3)) and then asking annotators to score new expressions by comparing their strength to these expressions.

Results
We give now the results of the annotation campaign focusing on quantitative results on each annotation level, and more importantly on the impact of discourse on sentiment analysis.

Quantitative analysis at the document level
Our discourse annotations contain a total of 3,453 discourse relations for F MR, 1,740 for FNR and 1,677 relations for EMR. We analyzed our results according to two main axis: the distribution of relations per corpus genre and the importance of CDUs for sentiment analysis. Figure 7 shows these distributions, sorted according to their frequency in F MR, from the most frequent (on the left) to the less frequent one (on the right). The frequencies of each discourse relation across corpus genres are statistically different from what would be expected by chance using the χ 2 test. Note however that the difference between the observed and the expected frequencies of Conditional were not statistically significant. In this figure, we discarded the frequencies of the relations Flashback and Unknown for two reasons. First, Flashback was highly infrequent in all the corpora (0.12%, 0.06% and 0% for respectively F MR, FNR, and EMR) and second, the relation Unknown was not used in EMR since the discourse annotation in this corpus has been performed by consensus. It is however interesting to note that this relation was more frequent in F MR (around 2.06%) than in FNR (0.69%) mainly because the annotators were more experienced with respect to the "Reviews" corpus (annotated first).

Distribution of discourse relations per corpus genre
Overall, the frequencies can be grouped into three classes: (1) Continuation, Elaboration and Commentary (more than 10%), (2) Contrast, Entity-Elaboration, Result, Explanation, Attribution and Frame (from 3% to 10%) and (3) Correction, Goal, Narration, Parallel, Background, Conditional and Alternation (less than 3%). We noticed that some relations are more present in certain corpora. For instance, Commentary, Entity-Elaboration, Explanation, Attribution, Frame, Goal, Parallel and Alternative are more frequent in news reactions than in reviews. The frequencies of Parallel, Alternative and Frame are consistent with a logically more structured discourse for news reactions than for movie reviews. Also Goal and Explanation are more frequent which confirms that FNR contains more argumentative structures than in reviews. The same goes for the Attribution relation, which denotes that in FNR people tend to make reference to what other people said., e.g. The president thinks that..., or even that people tend to be more reserved when stating opinions, e.g. I guess that this is a good measure, unlike in the reviews, where people might tend to be more categorical, e.g. This movie is great, without modalizing the statement. Also, Entity-Elaboration is more frequent in FNR (more than 10%), which confirms that news reactions are multi-topic opinion documents. Another interesting comparison between corpus genres is the frequency of Commentary, more frequent in news reactions where commentaries are often ironic. Finally, the proportions of Elaboration, Contrast, Background, Narration and Result in the En- glish corpus were higher compared to the two other corpora, may be because English reviews tend to be more verbose.

Importance of CDUs
We have also analyzed the ratio of complex segments to the total number of rhetorical relation arguments in our annotations. Figures 8, 9, 10 show the proportions of relations between EDUs, between an EDU and a CDU, and between CDUs, sorted according to the increasing frequencies of relations between EDUs (all the relations are shown except unknown and Flashback). First, we see that some relations are local and tend to appear more often between EDUs (more than 70%), as in Example (15) taken from EMR. In news reactions, these local relations have the same distributions except for Attribution and Conditional which link simple segments in 60% of cases. This is more salient for Background with only 45% of instances. We will see in Section 5.3 that some of these local relations are very important for sentiment analysis while others can simply be ignored.
Background and Commentary have different behaviors in English reviews compared to French documents: Background seems to be more local in French documents whereas Commentary tends to be more local in English reviews. On the other hand, the following relations often have CDUs in at least one of their arguments: Elaboration, Explanation, Frame, Result, Contrast, Correction, Narration and Commentary. For example, Correction concerns CDUs in most of 55% of cases. This relation links segments sharing a common topic and such that the second argument corrects the information given in the first argument (which is often at a long distance attachment) (see the Correction in Example (16)). Another interesting behavior comes from the Contrast relation. Contrary to our expectations, only 40% of instances of this relation link EDUs in all the corpora. Example (17)

Quantitative analysis at the segment and opinion expression level
The total number of annotated segments was 3,825 for F MR, 2,071 for FNR and 2,578 for EMR. The histogram in Figure 11 gives a comparative analysis of how segments are distributed over the five classes (i.e., SE (explicit opinion), SI (implicit opinion), O (objective), SN (subjective non evaluative) and SEI (explicit and implicit segment)). A similar analysis is given in Figure 12, this time for segment polarity (i.e., positive, negative, neutral, no polarity and both). The frequencies of each segment opinion type and each segment polarity type across corpus genres are statistically different from what is expected by chance using the χ 2 test. We observed that the frequencies of the segments containing implicit opinions (SI) depend on the corpus genre: for F MR and EMR, frequencies are less important (respectively 26.5% and 24.5%) compared to FNR (47.1%). Moreover, in the three corpora, the purely objective segments are not very widespread (less than 20% of all segments). The same goes for segments that contain at the same time an explicit and an implicit opinion (SEI), with a yet lower frequency for ENR. As for the subjective non-evaluative segments (SN), they are rather infrequent as well, especially in French and English movie reviews. However, they are slightly more numerous for FNR, which shows that the reported speech constructions are more frequent in reactions to newspaper articles than in movie reviews. Another interesting genre bias concerns the polarity of the segments: whereas in French movie reviews positive segments are a majority in spite of balancing the corpus between overall positive and overall negative documents (in terms of their star counting), this is not the case for the reactions to newspaper articles, where negative segments are a majority. In EMR however, segment polarity distribution is more balanced than for F MR. We also observe that non evaluative segments (mainly from the objective and the subjective non evaluative segment type) are more numerous in English reviews than in French reviews. Finally, the proportion of both and neutral are a minority in all the corpora (respectively less than 3% and 2%). The last segments in Examples (18)  Within evaluative segments (i.e., SE, SEI and SN), 2,329 opinion expressions were annotated for F MR, 743 for FNR and 1,610 for EMR. Among explicit segments (i.e., SE and SEI), 97% contain a single opinion expression for F MR and EMR vs. 94% for FNR. This confirms the usefulness of the per-segment analysis since this simplifies opinion fusion with respect to a per-sentence analysis for instance. We further discuss this important result in Section 6.
The semantic categories of opinion expressions are similarly distributed for F MR and EMR with around 3% for Advice, and between 5 and 8% for Reporting. However, we observe that in English movie reviews, most opinion expressions are from the Sentiment-Appreciation category (48.2% vs. 24.2% for French) while, in F MR, opinion expressions are mostly judgments and evaluations (66.4% vs. 36.4% for English). As expected, we get different distributions of semantic categories for FNR, with a greater number of Reporting (27.5%) and Advice expressions (6.9%) and no instances of the Sentiment-Appreciation category. Concerning the annotations of topics and holders, the total number was respectively: 2,939 and 754 for F MR, 1,915 and 262 for FNR, and 1,981 and 499 for EMR. For movie reviews, topics are mainly from the part of category (around 60%) whereas few of them are out of topic (other) (around 10%). However in FNR, we observe a different distribution: the number of topics from the main category are lower (around 9%) whereas the number of other topic are greater (around 19.4%). For the holders, we get similar distributions over all the corpora: 2/3 of annotated holders are from the main category.
Lastly, we also noticed the importance of opinion operators: 1,371 for F MR, 924 for EMR and 488 for FNR. At least one such operator is present in 32% of subjective segments in news reactions vs. 40% for movie reviews. These operators are also present in implicit segments (18% for the French corpus vs. 25% for the English documents and 17% for news reactions) which indicates that valence shifter terms are good cues for detecting implicit opinions. The distribution of operators per category is shown in Figure 13. Most of them are intensifiers. Restrictors are from different types: they can be temporal (as some in [Some scenes are beautifully shot] and at times in [It can be entertaining at times]) or topic restrictions as in [This movie is made for 10 year old kids.].
In our previous work on using discourse in sentiment analysis, we have annotated opinion semantic categories at the segment level in movie reviews and letters to the editor in English and French. Our past results, reported in (Asher et al. (2008)), showed that the distribution of semantic categories in these corpora are comparable to those observed in the corpora annotated in this current 28 Evaluation in Discourse Figure 10: The distributions of discourse relations in EMR according to the type of their arguments. study. As far as the semantic categories are concerned, we can conclude that our observations are valid to French and English movie reviews and news reactions in general. We believe that our results on segment polarity and segment type can also be generalized. More annotations are however needed to validate this assertion.

Impact of discourse on sentiment analysis
In this section, we attempt to answer the challenges mentioned in the introduction of this paper: What is the role of discourse relations in subjectivity analysis? What is the impact of the discourse structure in determining the overall opinion conveyed by a document? Does a discourse based approach really bring additional value compared to a classical bag of words approach? Does this additional value depend on corpus genre? To this end, we explored the interactions between the discourse, the segment, and the opinion expression annotation layer. In particular, • Section 5.3.1 investigates the correlation between discourse and opinion semantic category of subjective segments (mainly from the SE, SEI and the SN category). Recall that an opinion expression can belong to four semantic categories, namely: Sentiment-appreciation, Judgment, Advice and Reporting. Our aim is to analyze to what extent semantic categories of opinion expressions can be an indicator for predicting discourse relations. Figure 11: Frequencies of segments per opinion type.
• Section 5.3.2 focuses on the impact of discourse on subjectivity analysis. Can discourse relations be used to predict subjectivity orientation of elementary discourse units?
• Section 5.3.3 analyzes the impact of discourse on polarity analysis. Can discourse relations be used to predict polarity of elementary discourse units?
• Section 5.3.4 studies the impact of segment opinion type and segment polarity on the determination of the document overall opinion. Do segments with implicit opinions contribute to the author's global opinion on the main topic of the document?
This section details experiment aspect addressing each of these challenges while Section 6 summarizes the conclusions answering these questions.

Discourse and opinion semantic categories
We tested two hypotheses: (H1) there is an association between the relative position of segments within the document and the semantic category of the opinion expressions they contain. If a correlation is found, then the position can be used for example to identify the semantic category of segments conveying implicit opinions. (H2) there is an association between discourse relations and the semantic categories of the opinion expressions that appear within the relation arguments. Position of segments vs. semantic categories. Table 7 gives the proportions (in percent) of opinion semantic categories according to the relative position of the segment they belong to. We considered two positions: beginning and end of the document. To compute them, we simply divided a document into 3 parts (beginning, middle, end). The first two segments being the beginning while the last two the end. In the Table 7, the configurations Begin-x (resp. End-x) stand for segments containing an opinion expression from an x category.
When using the χ 2 test, the hypothesis (H1) is confirmed at p < 0.05. We see that the proportion of the Advice category is higher when expressions of this type appear in segments at the end of the document. The proportion of the other categories is relatively stable. This increase is more impressive in reviews (more than 10%) than in news reaction (around 5%) which confirms that users in reviews tend to end their reviews by expressions of recommendations, hopes, or suggestions.
Discourse relations vs. semantic categories of their arguments. For each corpus, we constructed three contingency tables: • (T1) gives the number of discourse relations that have a right argument containing an opinion expression from a given semantic category. For each discourse relation R and for each semantic category c ∈ {S entiment − Appreciation, Judgment, Advice, Reporting}, we counted all the pairs R(se c, all) where se c is an SE segment containing an opinion expression from a category c and all stands for an EDU whatever its type (i.e., SE, SEI, O, SN or SI).  Table 7: Proportions (in percent) of opinion semantic categories according to the relative position of the segment they belong to.
• In Table (T2), we do the same by counting all the pairs R(all, se c).
• Table (T3) provides the frequencies for each relation R and the frequencies of R(se c, se c).
Tables 8, 9, and 10 give respectively the results of (T1), (T2) and (T3) for the French movie reviews corpus. The tables associated to the other two corpora looked similar.   Table 9: Frequency of discourse relations that have a left argument containing an opinion expression from a given semantic category.
Given the frequencies in these tables, the hypothesis (H2) was rejected using the χ 2 test. For each corpus genre, there is no statistically significant relationship between discourse relations and the opinion category of their arguments. However, in the French corpora, after removing the relations Goal, Conditional, Frame, Background and Attribution from the contingency table (T1), the  association between discourse relations and opinion category of right arguments was significant at p < 0.05 using the χ 2 test 34 . For EMR, the association is significant when removing the same set of relations as above and when discarding, in addition, the categories Advice and Reporting. In (T2), the association between discourse relations and left arguments was significant when removing the Advice category and the same set of relations as above except Attribution. Finally, for (T3), we get a statistically significant association when removing both the same set of relations as above and the categories Advice and Reporting.
Overall, the absence of a strong correlation between discourse relations and opinion categories can be due to the categories themselves that were not adequate to capture that relations well. To confirm or reject hypothesis (H2), it would be interesting to conduct a similar study using different categories.
Concerning the distribution of relations with regard to the opinion semantic category, the proportion of Attribution relations is relatively high when the first argument of this relation is from the Reporting category. We also have instances from Continuation and Elaboration. Similarly, the proportion of Result is high when its second argument contains an Advice expression. Examples like (20) are very frequent in our reviews corpora (here segments 4 and 5 contain explicit recommendations to see the movie and they are related to the first part of the document by a Result relation): (20) [It it is the best adventure movie of Other relations do not preserve subjectivity across our corpora: Background, Attribution and Frame. In news reaction, Attribution preserves subjectivity in 50% of cases whereas in reviews the proportion is about 20%. This might be because examples like [The chairman thought] [that it rained in his town yesterday] are more frequent in the first corpus genre (movie reviews) than in the second (news reactions) where attributions are more often used to introduce opinions and point of views. Subjectivity preservation in the case of Frame is about 40% in French document vs. 87% in English reviews because in French corpora, this relation often relates non evaluative segments to evaluative ones. Correction seems to preserve subjectivity in reviews (60% in English reviews and 83% in French reviews) but not in news reactions where the proportion is about 50%. We observe the contrary for Conditional and Entity-Elaboration where subjectivity preservation is more frequent in news reactions. Indeed, in F MR, consequences are often objective even when their corresponding conditions are evaluative as shown in (24). preservation frequencies. The polarity preservation frequencies of each discourse relation across corpus genres are statistically different from what is expected by chance using the χ 2 test. Note however that the difference between the observed and the expected frequencies of Conditional, Correction and Contrast were not significant.
As far as polarity is concerned, our hypotheses seem by and large verified as well, for all corpora. However, contrary to expectations, Contrast seems to change polarity in reviews but not in news reactions. In reactions, this can be explained by examples of the type: [The economical situation is grim,] [but the cultural life is grim as well] where there is the but connective linking the two segments, which makes the annotators place a Contrast between the two segments. However, in this particular case it would be more appropriate to link the two segments by the Parallel relation or with both Parallel and Contrast 36 , which is possible in SDRT and provides the right semantics for such relations (Asher (1993)). Note however that the frequencies of Contrast and Correction in all the corpora were not significant. We need more annotations to establish the relationships between these relations and polarity analysis.

Segment type, segment polarity, and overall opinion
We investigated whether implicit opinion segments contribute to the author's global opinion on the main topic of the document. We have computed the Pearson's correlation between the global 36. When preparing the gold standard, we reconsidered the relation labels only in 5% of the cases. opinion score (on a scale going from 0 for a strongly negative opinion, to 4 for a strongly positive opinion) and the subjectivity class and polarity of the segments. More specifically, for each of the three corpora, we have constructed a vector with the global opinion scores for all the annotated document instances 37 . Then, another set of four vectors has been built for each corpus, with the counts of segments of a given subjectivity class and polarity: SE Pos for explicit positive opinion segments (SE and SEI) class with a positive polarity; SE Neg for explicit negative opinion segments; SI Pos for implicit positive opinion segments (SI class with positive polarity); and SI Neg for implicit negative opinion segments. Similarly, we have computed the correlation between the overall opinion and segment polarity regardless of their types: All Pos for positive segments and All Neg for negative segments. In addition, we have measured the correlation between the overall opinion vector and the average segments scores (given between −3 and +3) of each document (All Avg). The results are shown in Table 11, averaged over all the annotators.
In movie reviews (F MR and EMR) there is a better correlation between global opinion score and explicit subjective segment counts (of both positive and negative polarities -for negative polarities, a good correlation means a negative Pearson's correlation of high absolute value) than between global opinion score and implicit subjective segment counts. In FNR, a different behavior is observed: the correlation is better for segments which contain implicit opinions. This brings us to the conclusion that the importance of implicit opinions varies, depending on the corpus genre: in 37. If one input document has been doubly annotated, we thus obtained two document annotation instances.   Table 11: Correlations between overall opinion and segment opinion type/polarity. movie review, more direct and sometimes terse, explicit opinions are better correlated to the global opinion score, whereas in news reactions, implicit opinions are more important. This could indicate a tendency to "conceal" negative opinions as apparently objective statements, which can be related to social conventions (politeness, in particular) (Pang and Lee (2008)). Now, when we have grouped segments by polarity (cf. All Pos and All Neg), we observe that the correlation with positive segments are better compared to those with negative polarity. The politeness bias is more salient in news reactions than in movie reviews where users tend to express their opinions in a more positive way. Finally, we see that correlations in All Avg are the lowest, which confirms that overall opinions is not only a simple aggregation of opinions taken in isolation. A more elaborated way of aggregation is needed. 6. Discussions

Interim conclusions
In this paper, we aimed at measuring the impact of discourse on sentiment analysis with a study of three corpora: French and English movie reviews as well as French news reactions. Here are the main conclusions of our corpus-based study: (a) Segment-based opinion analysis is more appropriate to study opinions in discourse. Our results showed that more than 90% of segments contain only one opinion expression. This demonstrates that the segment level will make polarity analysis easier compared to the sentence or the clause level. In addition, our automatic discourse segmentation is feasible and yielded very good results.
(b) Complex discourse units (CDUs) are an important part of the discourse structure of a document.
In the whole corpora, our results showed that the proportion of relations involving CDUs is higher compared to the proportion of relations linking EDUs. In particular, we observed that the arguments of the relations Contrast, Elaboration, and Result are CDUs in more than 55% of cases. CDUs related with a Frame are more frequent in movie reviews (more than 57%) whereas those related with a Commentary are more frequent in the French corpora (more than 64%). These results demonstrate Figure 19: Discourse relations and polarity in FNR.
that CDUs are important for assessing the overall opinion of a document.
(c) Implicit opinions are important. Our results showed that the importance of implicit opinions varies, depending on the corpus genre: for movie reviews, explicit opinions are better correlated to the global opinion score, whereas for news reactions, implicit opinions are more important when negative opinions are concerned.
(d) Semantic categories of opinion expressions can be good indicators for identifying some discourse relations. Indeed, we observed that the discourse relations Contrast, Continuation, Narration, Alternative, Result, Parallel, Elaboration, Entity-Elab, Correction, Explanation, and Commentary are correlated with the semantic categories (Reporting, Judgment, Advice, and Sentimentappreciation) of the opinion expression within their arguments.
(e) Discourse relations can be grouped according to their effects on the opinion orientation of elementary discourse units. We studied 17 discourse relations that involve entities from the propositional content of the clauses: 9 coordinating relations (Contrast, Continuation, Conditional, Narration, Alternative, Goal, Result, Parallel, Flashback) and 8 subordinating relations (Elaboration, E-Elab, Correction, Frame, Explanation, Background, Commentary, Attribution. Among these relations, some can be grouped according to their similar effects on both subjectivity and po-larity analysis: Correction and Contrast, Elaboration and Explanation, Continuation, Parallel, Narration, and Alternative. Table 12 summarizes the effects of these relations. For a given relation (or group of relations), " √ " (resp. "X") indicates that the relation preserves (resp. does not preserve) subjectivity (resp. polarity) in more than 75% of cases in at least two corpora. This table shows that some relations have no effect at all on sentiment analysis: Frame, Goal, Background, Conditional and Flashback while others impact on subjectivity analysis, on polarity analysis or influence both these two tasks. These results confirm that discourse relations can help in identifying segments conveying implicit opinions or retrieving segment contextual polarity which, for instance can be very useful in identifying ironic statements.

Portability of the annotation scheme
The results reported in this study were obtained on manually annotated discourse structures when the annotation scheme was instantiated on two corpus genres: movie/product reviews and news reactions. These corpora have similar characteristics: they are texts and not discussions/dialogues (remember that letters to the editor that responded to other letters were removed from FNR), they are relatively small (less than 30 EDUs per document), opinions are about one main topic and its related subtopics and are the viewpoints of one holder (mainly the author of the review). More important, the overall opinion is the result of a bottom-up aggregation process, from local opinions at the segment level to the global opinion at the document level. However, several other corpus genres do not meet these characteristics. Some are author-oriented like blogs where all the documents (posts and comments) are associated to the blogs' owners, others are both multi-topic and multi-holder documents like news articles, while others are composed of follow-up opinions as in discussion forums. To what extent is the CASOAR annotation scheme portable to these other sources of opinion?
Concerning blogs, we believe that our scheme can be easily applied. Blog comments are generally short, they are the point of view of one author towards the main topic of the blog article which is quite similar to news reactions. For news documents, things are more complicated since several viewpoints by several opinion holders are mentioned. Consider the following scenario. The author introduces and elaborates on a topic, 'switches' to other topics or reverts back to an older topic. This is known as discourse popping where a change of topic is signaled by the fact that the new information does not attach to the prior clause, but rather to an earlier one that dominates it (Asher and Lascarides (2003)). In this case, our three-level annotation scheme needs to be adapted. Though the discourse annotation model incorporates discourse pops, their effects on topics for opinions is presently not taken into account. Discourse pops often indicate shifts in topic, and so, instead of one topic, we will have to deal with many. At the expression level, we have to take this multi-topicality into account, by modifying the annotation of topic spans. At the segment level, we would have to link each opinion expression to its topic. At the document level, the notion of overall opinion has to evolve towards (topic, holder) overall opinion scores. Each score can be computed using a bottomup aggregation procedure over a discourse sub-graph focusing only on those segments that convey the opinions on a specific holder. This procedure needs however to be tested on news documents to show its feasibility.
Finally, adapting our scheme to discussion forums will require to us adapt our scheme to handle dialogues. A thorough linguistic analysis of the link between opinion and discourse in dialogue will be very interesting.

Towards discourse-based sentiment analysis
The CASOAR corpus is a first step towards automated discourse-based opinion analysis. We have already used a subset of this corpus in order to investigate how discourse can help in different sentiment analysis stages. In Benamara et al. (2011), we investigated how discursive features could improve subjectivity analysis. We automatically distinguished between subjective non-evaluative (SN) and objective segments (O) and between implicit (SI) and explicit opinions (SE), by using both local and global context features. Chardon et al. (2013) exploited the French gold standard corpus to determine what are the best strategies that need to be implemented to automatically compute a document overall opinion. Here we have made a complementary, in depth multi-lingual and multigenre analysis of a new corpus study for English and provided new results concerning the French corpus.
A final issue is how to validate our results on automatically parsed data. Since review style documents are relatively short, we believe that building such a discourse parser becomes easier. As far as we know, the only existing powerful discourse parser based on SDRT is the one that has been developed on the top of the Annodis corpus ). This parser achieves between 47 and 66% accuracy on the structure for the full set of 17 relations. We plan to adapt this parser to opinion texts. In particular, given our observations (cf . Table 12), we propose to discard certain relations from the learning process and to group others according to their similar effect on both subjectivity and polarity analysis. This will reduce the number of relations to be predicted to 10 instead of 17 actually which, we believe, will make our discourse parser more reliable.

Conclusion
In this paper, we presented the CASOAR corpus, a multi-layered annotation scheme for analyzing opinion in discourse that includes: the complete discourse structure according to the Segmented Representation Discourse Theory, the opinion orientation of elementary discourse units and opinion expression annotation. For each layer, we presented the annotation model, annotation guide, and results of its annotation campaign. We explored the interactions between these different layers-in particular, the impact of discourse structure on the overall opinion of a document and implicit opinions, the link between discourse and opinion semantic category, and the role of discourse relations on both subjectivity and polarity analysis. Our results demonstrate that opinion and discourse structure are strongly related and that discourse is an important cue for sentiment analysis, at least for the corpus genres we have studied.