Digging Communicative Intentions: The Case of Crises Events

In emergency situations users of social networks convey all sorts of what have been called communicative intentions, well-known since the work of Austin (1962) and Searle (1969) as speech acts (SA). While speech acts have been the focus of close scrutiny in the philosophical and linguistic literature (see (Portner, 2018) for extended discussion), their role has been only rarely understood and exploited in processing social media content about crisis events, our focus here. Current work on communicative intentions in social media are topic-oriented , focusing on the correlation between SA and specific topics such as crisis (e.g., earthquakes) but also politics, celebrities, cooking, travel, etc. It has been observed that people globally tend to react to natural disasters with SA distinct from those used in other contexts (e.g., celebrities, which are essentially made up of comments). Here, we explore the further hypothesis of a correlation between different SA types and urgency and pro-pose an in depth linguistic and computational analysis of communicative intentions in tweets from an urgency-oriented perspective. Indeed, SA are mostly relevant to identify intentions, desires, plans and preferences towards action and to ultimately produce a system intended to help rescue teams. Our contribution is four-fold and consists of: (1) A two-layer annotation scheme of speech acts both at the tweet and sub-tweet levels, (2) A new French dataset of about 13K tweets annotated for both urgency and SA, targeting both expected (e.g., storms) and unexpected or sudden (e.g., building collapse, explosion) events, (3) A thorough analysis of the annotations studying in particular the correlation between SA and the urgency of the message, SA and intentions to act categories (e.g., human damages), and SA and crisis types, finally, (4) A set of deep learning experiments to detect SA in crises related corpora. Our results show a strong correlation between SA and urgency annotations at both the tweet and sub-tweet levels with a particular salient correlation in the latter case, which constitutes a first important step towards SA-aware NLP-based crisis management on social media.


Introduction 1.1 Motivation
In ordinary interaction as well as in social networks, speakers unveil a variety of communicative intentions among which, make content known, express their own views and opinions or enhance action.Since Austin (1962) and later and more prominently Searle (1975), these communicative intentions are known under the term speech acts.
Before percolating into the computational literature, speech acts (henceforth SA) have been the object of extensive discussion in the philosophical and the linguistic communities ((Hamblin, 1970;Brandom, 1994;Sadock, 2004;Asher and Lascarides, 2008;Portner, 2018;Bach and Harnish, 1979) to mention just a few).According to the Austinian initial view, SA are to achieve action rather than conveying information.When uttering I now baptize you, the priest accomplishes the action of baptizing rather than just stating a proposition.Beyond these prototypical cases, the literature has quickly broadened the understanding of the notion of SA as a special type of linguistic object that encompasses questions, orders and assertions and transcends propositional content revealing communicative intentions on the part of the speaker (Bach and Harnish, 1979;Gunlogson, 2008;Asher and Lascarides, 2008;Giannakidou and Mari, 2021c): With an assertion, the speaker intends to present the propositional content and to add it to the common ground (Portner, 2018); with a question, the speaker asks the addressee to provide new information; with an order the speaker asks that the content be realized and with exclamatives, a subjective evaluation towards propositional content is conveyed.
Our study investigates the communicative intentions that SA conveys in urgency situations and more importantly, how intentions vary according to the degree of urgency of the information (urgent vs. not urgent vs. not useful -cf.examples below) when posted in social networks.We focus on messages posted on Twitter as tweets are widely used to generate valuable information in crisis situations (Reuter et al., 2018).For example, the Notre Dame fire that occurred in France has been the most used in Twitter in 2019 1 and in the recent earthquake in Turkey and Syria, some victims trapped in the rubble have been saved thanks to the messages they posted (Toraman et al., 2023).
SA are particularly helpful in identifying urgent messages.These are messages that raise situational awareness over a crisis situation and some specific aspects that include human/infrastructure damages, security instructions, etc.They provide actionable information that will help human teams to set priorities and decide appropriate actions (Vieweg et al., 2014;Castillo, 2016;Reuter and Kaufhold, 2018).Therefore, speaking subjects perform qualitatively very different language acts depending on the situation they find themselves in.They mostly aim to make interlocutors react (i.e., perlocutionary level) by different linguistic means (illocutionary level, this is the level at which the speech acts are encoded), in view of achieving a purpose. 2

When Communicative Intentions Reveal Urgency
By revealing speakers communicative intentions and aiming at triggering the addressee reaction, speech acts become essential in emergency situations where action is to be enhanced.We have thus used two different independent classifications: (i) a new, two level classification of speech acts, and (ii) an independent classification for urgency and actionability elaborated in Kozlowski et al. (2020).
The following are two examples 3 of how these two classifications proceed.We use the → notation, with, at its left, the tweet-level categories, and, at its right, the sub-tweet level categories.A precise definition of the labels will be provided later in the paper (see Section 3).(1) a.
[ The fire situation in the Landiras area is getting worse.] 1 [Please follow the instructions of the fire brigade and the police.] 2 (SA annotation) JUSSIVE → 1. PROPER ASSERTIVE; 2. OPEN OPTION (Urgency annotation) URGENT: WARNING/ADVICE b. [ 5th day of fire fighting, about 6000 hectares of our forest charred here.Still the same means at the disposal of our firemen: 2 air-crafts and 1 dash.] 1 [ What are you waiting for to give them the means to stop this fire?@EmmanuelMacron @GDarmanin #landiras] 2 (SA annotation) SUBJECTIVE → 1: PROPER ASSERTIVE; 2: EVALUATION (Urgency annotation) URGENT: MATERIAL DAMAGE As shown in these examples, a tweet is composed of several parts that contribute to the construction of the communicative intention of the whole message.These parts may convey (and they indeed often do) very different speech acts types.Therefore tweets need also to be analyzed at the subtweet level, in order to search for more precise and specific content that provides useful actionable information.
For (1-a), the writer publicly expresses an explicit demand (hence a JUSSIVE 4 speech act at the tweet level) for the population to follow the authorities' instructions as the wildfires in the Landiras region keep spreading.At the sub-tweet level, (1-a) first presents a description of the situation (cf.segment 1 that triggers a speech act of PROPER ASSERTIVE) and then provides an advice on how to behave (see segment 2 which qualifies as an OPEN OPTION in our classification, cf.infra).The latter is the most useful piece of content as it provides new and actionable information triggering action expectation.For emergency and actionability, (1-a) qualifies as URGENT at the tweet level, specifically providing content that falls in the actionability category ADVICE.
As a further example, insofar as the speech act annotation is concerned, (1-b) expresses an intention to complain about the current means at the disposal of the fire brigades.The overall tweet is considered as expressing a subjective stance of the speaker (hence the overall label SUBJECTIVE) in virtue of the question, which reveals a complaint (the part containing the question is labeled as EVALUATION, cf.infra for details).The first segment is a PROPER ASSERTIVE.As for the emergency annotation of the same tweet, (1-b) qualifies as URGENT at the tweet level, providing content that is labeled MATERIAL DAMAGES at the actionability level. 5

Previous Approaches and Research Questions
Since the introduction of dialogue acts (see, a.o., the DAMSL framework (Allen and Core, 1997;Core et al., 1998)), SA have been dedicated an extensive body of work in the computational linguistics literature where various approaches have been proposed to detect them in both synchronous (e.g., meeting, phone) (Stolcke et al., 2000;Keizer et al., 2002;Carvalho and Cohen, 2005;Joty and Mohiuddin, 2018) as well as asynchronous dialogues (e.g., emails, live chats, tweet threads) (Carvalho and Cohen, 2005;Joty and Mohiuddin, 2018;Bracewell et al., 2012).SA have shown to be an important step in many downstream NLP applications such as strategic actions prediction (Cadilhac et al., 2013), dialogues summarization (Goo and Chen, 2018) and conversational systems (Higashinaka et al., 2014).However, SA for emergency detection has received less attention in the literature and most of related work on communicative intentions in social media are topic-oriented, focusing on the correlation between SA and specific topics such as crisis (e.g., earthquakes, bombing, attacks) but also politics, celebrities, cooking, travel, etc. (Zhang et al., 2011;Vosoughi, 2015;Elmadany et al., 2018a;Saha et al., 2020b).These corpus-based studies show that there is a greater similarity of distribution between topics of the same type than between topics of different types.In particular, it has been observed that people globally tend to react to natural disasters with SA distinct from those used in other contexts (e.g., celebrities, which are essentially made up of comments).
Here, we explore the further hypothesis of a correlation between different SA types and urgency.We thus investigate whether SA can be used to sort urgent from not urgent messages.As far as we know, this is the first study that proposes an in depth linguistic and computational analysis of communicative intentions in tweets from an urgency-oriented perspective: What are the most frequent intentions in urgent vs. not urgent message?Are these intentions different from those found in non useful messages?And more importantly, are they particularly correlated with finegrained urgency categories (such as human/infrastructure damages, donations, security instructions etc.)?Finally, are the observed SA stable across different types of crisis (flood, hurricane, fire, attack, etc.)?To answer these questions and before moving to real scenarios that rely on SA-aware automatic detection of urgency (this is left for future work), we propose to (1) measure the impact of SA in detecting urgency during crisis events in manually annotated data, and (2) explore the feasibility of SA automatic detection in crisis corpora.

Overview of the Main Contributions
We build on Laurenti et al. (2022a) where we performed a preliminary analysis of the role of SA on urgency detection in about 6,6K tweets with of a focus on natural disasters (flood, hurricane, storm, etc.).In Laurenti et al. (2022a), we relied on a new annotation scheme of SA that takes into account the variety of linguistic means whereby SA are expressed (including lexical items, punctuation, etc), both at the message and sub-message level.We further extend this initial work by proposing: • The first largest French dataset of about 13,300 tweets annotated for both urgency and SA following the same annotation scheme.In addition, we expend the annotations to 6 new sudden crisis making the dataset spans over 20 crises. 6 • A qualitative and quantitative analysis of the annotation campaign intersecting the two-level classification of speech acts with a classification of urgency.In particular, we explore the correlations between SA vs. urgency, SA vs. intention to act categories as well as SA vs. the types of crises for both levels of SA annotations.Our results show a strong correlation between SA and urgency annotations at both the tweet and sub-tweet levels with a particular salient correlation in the latter case which constitutes a first important step towards SA-aware NLP-based crisis management on social media.
6.The annotated dataset will be available for research purposes upon request.
• A set of deep learning experiments to detect speech acts relying on deep learning architectures coupled with relevant linguistic features about how SA are linguistically expressed.We consider several experimental settings ranging from monotask to multitask learning including multi-label classification.Our results show that SA detection achieve very encouraging results proposing to the community a novel state of the art of SA detection in French social media.
• An error analysis of the automatic detection at both SA levels, highlighting main cases of mis-classification.
This paper is organized as follows.Section 2 presents related work in SA detection in social media as well as main existing crisis datasets.Section 3 provides the classification of SA we propose and the annotation guidelines to annotate them.Sections 4 and 5 respectively detail the dataset we relied on and the results of the annotation campaign.Section 6 focuses on the experiments we carried out to detect SA automatically.We end by some perspectives for future work.

Related Work
Speech acts have been extensively studied in the computational linguistics literature since early 2000's.Most studies focus on SA in human-human dialog conversations where several datasets have been annotated relying on various taxonomies of SA (also known as dialogue acts), such as QUESTION, ACKNOWLEDGMENT and FOLLOW-UP QUESTIONS (see Serban et al. (2018);Gonc ¸alo et al. (2022) for recent surveys in the field).Dialogues being out of the scope of this paper, we focus in this section on SA for social media content, a relatively under-explored area of research compared to dialogue.We first provide an overview of SA used to annotate tweets about various events including crises as well as other domains (politics, offensive language, etc.).We then review main approaches for SA automatic detection.As our dataset for the first time combines SA and urgency annotations, we end this section by presenting existing crisis-related datasets highlighting the novelty of this study.

SPEECH ACTS IN THE CRISIS DOMAIN
The main line of analysis of the role of SA in tweets consists in unveiling how speech acts (as used on Twitter) vary qualitatively according to the topic discussed.In this line of questioning, SA have been studied as filters for new topics.Zhang et al. (2011) in particular, resorts to a Searlian typology of SA that distinguishes between assertive STATEMENTS (description of the world) and expressive COMMENTS (expression of a mental state of the speaker).Zhang et al. (2011) also distinguish between interrogative QUESTIONS and imperative SUGGESTIONS.Finally, a category MISCELLANEOUS brings together the Searlian DECLARATIVES and the COMMISSIVES, used to make promises.Concerning the question of emergency, Zhang et al. (2011) showed that the SA's distribution on Twitter in the context of a natural disaster (e.g., earthquake in Japan) is distinctive: it is essentially composed by statements, associated to comments and suggestions / orders.In this context new information or ideas on how to (re)act are indeed expected and assertions are the most suitable to this aim.By contrast, discussion over a celebrity will mostly generate comments and almost no order or suggestion.Indeed, in this context, subjectivity matters more than immediate action.
Also inspired by Searle's typology, Vosoughi (2015); Vosoughi and Roy (2016) distinguish six categories: ASSERTIONS, RECOMMENDATIONS, EXPRESSIONS, QUESTION REQUESTS and MISCELLANEOUS.The authors use the definitions of Zhang et al. (2011), by distinguishing the topic discussed in the tweets, from the type of topic (Entity-oriented, Event-oriented topics, or Longstanding topics which are topics about subjects that are commonly discussed).Six topics were then selected (2 of each type): for entity-oriented, they are interested in Ashton Kusher and the Red Sox; for event-oriented, they studied the Boston bombings in 2013 and the Ferguson demonstrations in 2014; for Longstanding topics, they considered cooking and travel.The distribution of speech acts shows a greater similarity of distribution between topics of the same type than between topics of different types.On the other hand, the entity-oriented and event-oriented types are closer to each other, with a majority of assertions and expressions, whereas for the long-standing types, assertions are less abundant and recommendations well represented.
In this same perspective of topic identification and relying on the same topic characterization as above, Elmadany et al. (2018b) manually annotate 21,000 tweets in Arabic according to their topic type and distinguish events like Sinai bombings, Gulf crisis, Arab spring and world cup qualifications, entities (especially people) and various issues such as travel or cooking.Each tweet is associated to a pair of speech act/sentiment according to the following classification: ASSERTIONS, RECOMMENDATIONS, EXPRESSIONS AND REQUESTS, and among sentiments, the standard Positive, Negative, Mixed and Neutral categories.Their study reveals a salient association between assertions and people/events and neutrality on the one hand and an association between expressivity long-standing topics and negativity on the other.

SA IN OTHER DOMAINS
In a recent and extensive study of SA in social media, Bell (2020) takes on a different approach than other studies in the literature on Speech Act Theory and conducts an empirical investigation into the identity of illocutionary force indicating devices, which are the elements responsible for encoding a speaker's intentions.A corpus of 1,000 twitter threads is collected, manually segmented by an expert and annotated at the sub-tweet level, allowing multiple speech acts per tweet, as opposed to most other studies.They consider the following SA: ASSERTIVE, DIRECTIVE, INTERROGATIVE, EXPRESSIVE, COMMISSIVE, EXERCITIVE (with a commissive, the speaker commits themselves, with an exercitive, the speaker requires someone else's commitment).This study distinguishes direct and indirect illocutionary acts (i.e.acts performed by way of performing another).Regarding the direct force, the majority of segments (64.5%) were annotated as assertive.The second most frequent category was expressive, with 16.3%.On the other end, the least frequent category was exercitive, with 0.25%.Regarding the indirect force, 83.9% of tweets were determined to perform no indirect act, and those annotated as performing one were about 80% expressives.
In Plakidis and Rehm (2022), an annotation of SA is done using a subset of 600 tweets taken from a German corpora of offensive and non-offensive tweets.Mainly inspired by Searle (1975), and building upon Compagno et al. (2018) and Weisser (2018), the tweets are segmented in sentences, which are then annotated on two main levels : the syntactical level (eg.declarative, exclamative, imperative, etc.), which describes the type of sentence, and the speech act level, consisting of a coarse-grained and a fine-grained level, which describes the type of speech act.The categories used for the first speech act level are as follows: ASSERTIVE, EXPRESSIVE, DIRECTIVE, COMMIS-SIVE, OTHER and UNSURE.They are subsequently detailed into 23 sub-classes at the sub-tweet level.For example, the category ASSERTIVE is further detailed into the following 6 sub-categories: ASSERT ("It costs 200$"), SUSTAIN ("I'm going to buy it because it's very convenient"), GUESS ("I'm unsure he's right for her"), PREDICT ("It will be a few hundreds at most"), AGREE ("You are right"), DISAGREE ("I don't think so").The results suggest that offensive language contains more expressives and less assertive than non-offensive language.Tweets with implicit offensive language have a lower frequency of expressives and a higher frequency of assertives than tweets with explicit offensive language.
In view of the topic -offensive language -the distinction between assertives and expressives is reported as a prominent issue, which does not arise in the context of urgency detection, where the description of facts (assertives) and the evaluation of said facts (expressives) are more clearly distinct.
For completeness, we note that SA have also been studied in the context of political campaigns, notably by Subramanian et al. (2019), with a corpus of 258 official documents related to the 2016 Australian "federal election cycle": official statements, tweets, press clippings, etc. from which 7641 utterances are extracted.Each utterance is annotated with a SA and a target party (liberal or conservative).The categorization of SA articulates: ASSERTIVES, COMMISSIVES-ACTION-SPECIFIC, COMMISSIVE-ACTION-VAGUE, COMMISSIVES-OUTCOME (about a future reality state), DIRECTIVES, EXPRESSIVES, PAST-ACTIONS and VERDICTIVES (an assessment on prospective or retrospective actions).They observe an over-representation of assertives (40%), followed by verdictives (25%) and specific action (12%).The other categories represent less than 10% of the annotations.
It is interesting to note that commissives make up a almost a quarter of the assigned speech acts, whereas they are almost absent from our corpus, which is related to emergency.

SA AUTOMATIC DETECTION
SA prediction has been tackled either as a primary task (i.e., multi-class classification problem) or auxiliary task where SA information are used to boost the performances of classification tasks such as sentiment analysis, emotion detection or hate speech detection.Some works consider speech acts at the message level while others consider dialogue acts when uttered in conversations.
At the message level, most state of the art approaches make use of feature-based machine learning algorithms (SVM, Naive Baise, Decision Tree) relying on various surface, lexicon and syntactic features such as unigrams, punctuations, POS, emoticons and sentiment words (Zhang et al., 2011;Rojas-Barahona et al., 2012;Franovic and Šnajder, 2012;Vosoughi and Roy, 2016;Sherkawi et al., 2018;Algotiml et al., 2019).Deep learning architectures have also been explored.Saha et al. (2021) propose a multi-modal approach for detecting SA in Arabic tweets relying on a multi-tasking framework based on dyadic attention mechanism (Vaswani et al., 2017) and adversarial loss to predict simultaneously sentiment, emotion and speech acts.It employs intra-modal and inter-modal attention to fuse multiple modalities and learn generalized features across all the tasks.Subramanian et al. (2019) propose a target based speech act classification on a dataset of political discourse using a semi-supervised learning approach (biGRU) by incorporating contextualized word representations (ELMo) and a cross-view training framework to augment the initial dataset with in-domain unlabeled text.Finally, Saha et al. (2020a) combine BERT and capsule networks (Sabour et al., 2017) to asses the intent of tweets (expression, statement, suggestion, threat, request, question).
Another line of research focuses on predicting SA in social media conversational thread casting it into a sequence labeling problem.For example, Cerisara et al. (2018) use a two-level hierarchical recurrent network (Bi-LSTM and RNN) to predict dialog acts and sentiments.Joty and Mohiuddin (2018) experiment with an LSTM-RNN architecture to represent sentences of a conversation then CRF models to extract the inter-sentence dependencies.The approach has been evaluated on many synchronous and asynchronous corpora, including forum conversations from TripAdvisor.Other works propose to model SA in dialogues as a multi-label classification problem.For example, Xu et al. (2017) rely on a CNN model on top of pre-trained word vectors by utilizing a threshold learning mechanism.The model has been evaluated on the task of dialog state tracking.

Crisis Datasets
The literature on emergencies detection in social media has been growing fast in the recent years and several datasets (mainly tweets) have been proposed to account for crisis related phenomena such as flood, hurricane, storm and attacks. 7Messages are annotated according to relevant categories that are deemed to fit the information needs of various stakeholders like humanitarian organizations, local police and firefighters.Annotations are usually done at the text level relying either on crowd-sourced workers, humanitarian volunteers or domain experts. 8Relevance criteria found in the literature can be grouped into the following dimensions: • Relatedness (also known as usefulness or informativeness) to identify whether the message content is useful provides valuable information that might be relevant to rescue teams.This is generally cats into a binary classification problem: is the message useful vs. non useful.This dimension is used in almost all state of the art annotation guidelines (Imran et al., 2016;Kaufhold et al., 2020).
• Urgency (also known as criticality or priority) to filter out on-topic relevant information that can aid people in making decisions, advise others or offer immediate post-impact help, and on-topic irrelevant including offers, supports and solicitations for donations to charities (Imran et al., 2013;McCreadie et al., 2019a;Sarioglu Kayi et al., 2020;Kozlowski et al., 2020;Kejriwal and Zhou, 2020).
• Intention to act, also know as humanitarian information type (Alam et al., 2021).Urgency is often associated with a taxonomy of intention to act categories such as: caution or advice, donations, people missing, found, or seen and damage infrastructure (Imran et al., 2016;Olteanu et al., 2015).
• Eyewitnesses types.It is used to identify direct (first-hand knowledge and experience of an event), indirect (messages sharing valuable information from direct witnesses) and vulnerable direct eyewitness (users reporting warnings and alerts) (Zahra et al., 2020).Annotations in most existing datasets are usually carried out at the message level.Existing datasets are either annotated according to one of the dimensions above or using several dimensions in cascade like relatedness or urgency first, then information type for messages that have been identified as relevant.Most annotated datasets are in English.Well known datasets include TREC-IS 9 (McCreadie et al., 2019b, 2020), a shared task that aims to develop real-time monitoring systems capable of monitoring the development of incidents such as natural disasters, terrorist incidents or public health crises from online text data feeds.We also cite the CrisisFACTS2022 dataset 10 which aims at generating a summary of crisis.Few crisis datasets exist in other languages such as Spanish (Cobo et al., 2015), Arabic (Alharbi and Lee, 2019), Italian (Cresci et al., 2015).For French, the only publicly available dataset is the one developed by Kozlowski et al. (2020) who propose a three-level classification of tweets : Relatedness, Urgency and Intention to act categories to deal with missing people, human/infrastructure damage, etc.This dataset focuses on several natural disasters (hurricanes, flood, storms, etc.) going beyond the French portion of CrisisNLP 11 that only focuses on one type of crisis (landslide).

Contributions
As far as we are aware, communicative intentions have been explored in connection with urgency detection in two previous works.First, Laurenti et al. (2022a) propose a SA classification for French tweet in the crisis domain.They focus on ecological crises and propose a two-layer annotation scheme to manually annotate a dataset of 6,669 tweets both for urgency (URGENT, NOT URGENT and NOT USEFUL) and SA (tweet level: ASSERTIVE, SUBJECTIVE, INTERROGATIVE and JUS-SIVE).Quantitative analysis of the annotations showed a correlation between tweet-level SA and urgency categories.This dataset has been used for supervised SA classification where a set of deep learning experiments have been carried out based on the CamemBert transformer architecture to classify each tweet into four SA categories at the tweet level.Laurenti et al. (2022b) built on this pre-trained classifier and propose SA-aware urgency detection models, showing that injecting SA as external semantic feature is a promising direction to improve urgency detection in social media.
In the present paper, we rely on the annotation scheme initially proposed in Laurenti et al. (2022a) and advance these previous studies, making six new contributions: critics, supports, etc. ( 2) The distribution of SA across crisis types (sudden vs. expected events), and (3) A study of the SA evolution across time.
5. In addition to SA detection at the tweet-level relying on baseline architectures, we newly: (1) Address sub-tweet SA detection as well as joint tweet/sub-tweet predictions relying on monotask and multitask learning approaches while evaluating models performances to classify each message into a single class vs. multi-label.As far as we know, handling SA in social media content as a multi-label problem has not been explored before, (2) Experiment model adaptability across crisis types and layers.To the best of our knowledge, this is the first attempt to the automatic SA detection in a French social media dataset.
6. Finally, we provide a detailed error analysis of our results at both tweet and sub-tweet levels.
Overall this paper proposes an in-depth study of speech act in view of their contribution to enhance emergency detection.Before moving to real scenarios that rely on SA-aware automatic detection of urgency -which we leave for future work -our aim here is (a) unveil the contribution of speech act to emergency detection on a distributive basis, and (b) explore SA detection in French social media across various crisis types.This is, as far as we know, the first work that addresses the issue in such exhaustive manner.The second step that will consider injecting SA to improving urgency detection is out of the scope of this paper.

A Two-level Annotation of Speech Acts for Urgency Detection
We deployed two layers of annotation for speech acts: • SA1: at the first level, we use a classification including 5 distinct categories, which we apply to the tweet as an atomic unit.
• SA2: at the second level, 8 categories are used to annotate tweets at the sub-tweet level as opposed to the tweet as a whole.
The goal of this two-layers annotation is to allow us to dig fine-grained information about speaker's posture towards the event, to ultimately identify the main communicative intention of the tweet as a whole.In this section, all examples are taken from our corpus and provided in French together with their English translations.URLs and private user mentions have been replaced by <URL> and <USER> respectively.Each example comes with its SA1 (cf.Sections 3.1 and 3.2) and SA2 (cf.Section 3.2).Notation-wise, recall from Section 1.2 that we use arrows (→) to signify the relation between first-level and second-level SA categories, at the left and right of the arrow respectively.In addition, in order to show the interplay between SA and urgency annotations, all examples come with urgency annotations (URGENT vs. NOT URGENT vs. NOT RELEVANT) as well as six intention to act categories as follows: (1) URGENT applies to messages mentioning HU-MAN, INFRASTRUCTURE DAMAGES as well as SECURITY INSTRUCTIONS to limit these damages during crisis events, (2) NOT URGENT groups SUPPORT messages to the victims, CRITICS or any OTHER MESSAGES that do not have an immediate impact on actionability but contribute in raising situational awareness, and finally (3) NOT URGENT for messages that are not related to the targeted crisis.Please note that all urgency annotations have been removed during the SA annotation campaign (cf.Section 4.1).

Tweet level
Our classification of SA elaborates on the fundational Austinian and later Searlian distinction by (i) relying on propositional content and lexical clues such as modals (should, must, can, ...), evaluative adjectives, attitude verbs (think, believe, want, hope ...); (ii) introducing the category SUBJECTIVE, which reshuffles some of the earlier classifications ('wishes', for instance are SUBJECTIVE rather than JUSSIVE in our classification (e.g., Condoravdi and Lauer (2012)); (iii) considering presuppositional content as well (see Mari (2016) on French).
We distinguish four first-level categories which are mutually exclusive and define tweets as wholes, at a holistic level, as shown in Figure 1.(1) JUSSIVE, as defined by Zanuttini et al. (2012), enhance commitment to take action, as in (2).Importantly, there is no strict correlation between the imperative form and JUSSIVE.As the example shows, the imperative form is not needed to enhance action.In this respect, our classification aligns with accounts that do not ground speech acts in sentence types (see Portner (2018) for extended discussion). (2) Incendies #Feuxdeforêt #Gironde (2) ASSERTIVE.Assertions, like in (3), are considered to convey objective truth (as opposed to subjective truth (Giannakidou and Mari, 2021c).With ASSERTIVE, the speaker is committed toward the truthfulness of the proposition that is being uttered ((Portner, 2018) a.o.) and require their interlocutor to update the common ground (Ginzburg, 2012). (3) DIRECT.Deux immeubles s'effondrent à Lille: les secours cherchent une victime dans les décombres <URL> via @lavoixdunord (DIRECT.Two buildings collapse in Lille: rescue workers search for a victim in the rubble <URL> via @lavoixdunord) At this level of the classification, this is a simplification of what assertions are.When asserting, speakers can lie, or they can use a partial knowledge that undermines the likelihood of the assertion to express true.To nuance this simplification, we elaborate on the notion of assertion at the second level of the annotation, where we introduce some evidentiality-based distinctions.
(3) INTERROGATIVE.This category is dedicated to those questions that require an informative answer, like in (4) The questions that, besides triggering an answer, reveal bias and expectations on the part of the speaker (see Ladd (1981)) are classified as SUBJECTIVE (see below).(4) SUBJECTIVE.Finally, with SUBJECTIVE, as in (5) the speaker shares a mental state that can be either a personal evaluation or preference (see among many others (Lasersohn, 2005)) or an expressive state (an emotion or a feeling, (Giannakidou and Mari, 2015)).The interlocutor is asked to update the common ground not just with the content of the evaluation but with the evaluation itself (see Simons (2007), and for recent discussion on French: Mari and Portner (2021)).In our classification, 'wishes', for instance, are SUBJECTIVE rather than JUSSIVE as they do not trigger any commitment to act so to make the content of the wish true (this is the emotive content of the wish (Giannakidou and Mari, 2021a)).
(6) Feu d'artifices du 14 Juillet @villedeputeaux <URL> (14th of July fireworks @cityofputeaux) SA1: OTHER Urgency: NOT USEFUL One important feature of our classification is that it does not rely on sentence type, but on sentence interpretation.For instance, an imperative is not necessarily classified as a JUSSIVE.Imperatives that convey wishes, as we noted, are considered to be SUBJECTIVE.Likewise, the interrogative form, does not necessarily correlate with the interrogative category.An interrogative can express a point of view, or even knowledge, as in the case of rhetorical questions.In ( 7), the speaker is not really asking a question, but rather wants to express their opinion that the authorities are not doing the right thing, hence expressing a subjective point of view.
(7) @Prefet974 #Berguitta .. Alerte orange pour rien hier qui a penalisé l'économie et pas d'alerte rouge pour ne pas pénaliser l'économie quand le danger est réel.. on marche sur la tête?(@Prefet974 #Berguitta .Orange alert for nothing yesterday that penalized the economy and no red alert to not penalize the economy when the danger is real ... are we walking on our heads ?)

Sub-tweet Level
We consider each tweet as a discourse unit, composed of one or more statements or sub-segments, so that it can not only be classified at the holistic level but also at the level of its segments (identified in the following examples between '[ ... ]').In order to achieve this, we have elaborated on each of the four categories at the tweet level to annotate the tweets at the segment level relying on eight categories (see Figure 2).For JUSSIVE, the annotation distinguishes between (a) OPEN-OPTION -the speaker puts forward a possibility and leaves the addressee free to realize it or not (cf.( 8)) -, and (b) utterances that enhance a direct commitment on the part of a discourse participant, ie.For ASSERTIVE, both second-level categories are determined by the source of knowledge that the speaker relies upon, i.e. the evidentiality condition as defined by Saurí and Pustejovsky (2009).If the speaker grounds their utterance on a third-party source, the assertive utterance is (a) a RE-PORTED ASSERTIVE, whereas if there is no such explicit source, it is a (b) PROPER ASSERTIVE, see ( 10) and ( 11 It is important to note that the distinction between reported and proper assertive is meant to reveal a difference in degrees of commitment on the part of the speaker.On the assumption that a proper assertive reveals total commitment to the truthfulness of the content of the assertion, by signaling that the content of the assertion is reported, the speaker is considered as willing to distance themselves from the truth of that content (see discussion in Aikhenvald (2004); Giannakidou and Mari (2021a) and subsequent literature).
While we are aware that a certain amount of simplification remains (assertions can be lies, for instance (see extended discussion in Giannakidou and Mari (2021c)), this distinction allows us to introduce a certain degree of complexity in our treatment of the attitudinal domain.
For INTERROGATIVE, a distinction is made between (a) INFORMATIVE questions to which the speaker cannot answer and which require an answer triggering new information and the ones that are (b) UNINFORMATIVE indicating that the speaker is biased towards an answer, as in ( 14) and ( 15 As for SA1, we also add OTHER to the SA2 classification, for undecidable cases.

Data and Annotation
In this section, we provide details on the dataset used, the annotation procedure, and the results of the annotation campaign.

Dataset
Since our focus is on crises that occur in metropolitan France and its overseas departments, we rely on the only available corpus of French tweets by Kozlowski et al. (2020) 13 and augmented later on by Bourgon et al. (2022b) with sudden crises (attacks, explosion, fires, etc.).The collection is composed of 19,595 tweets collected using dedicated keywords about ecological crises that occurred in France from 2016 to 2022 and posted 24h before, during (48h) and up to 72h after the crisis: 2 12. See Larrivée and Mari (2022) for French and Ginzburg (2012); Giannakidou and Mari (2021b) for a more general discussion and cross-linguistic observations.13. https://github.com/DiegoKoz/french_ecological_crisisfloods that occurred in Aude and Corsica regions, 8 storms (Béryl, Berguitta, Fionn, Eleanor, Bruno, Egon, Ulrika, Susanna), 2 hurricanes (Irma and Harvey), 2 building collapses (Marseille, Lille), 2 chemical plants explosions (Lubrizol, Sanary), 2 fires (Notre-Dame fire, Gironde and Landes wildfires) and 1 terrorist attack (Trèbes). 14The data comes with additional metadata including: number of likes, retweets, followers and followings of the user.
In this dataset, each tweet is annotated following an urgency classification composed of three urgency categories as well as 6 intentions to act categories: (1) URGENT that applies to messages mentioning HUMAN/INFRASTRUCTURE DAMAGES as well as SECURITY INSTRUCTIONS to limit these damages during crisis events, (2) NOT URGENT that groups SUPPORT messages to the victims, CRITICS or any OTHER MESSAGES that do not have an immediate impact on actionability but contribute in raising situational awareness, and finally (3) NOT USEFUL for messages that are not related to the targeted crisis or information pertaining to events occurring outside the French territories.This scheme has been used to annotate the dataset by two annotators who achieved a Kappa inter-annotator agreement of 0.67 and 0.65 for urgency and intention to act classification respectively (Kozlowski et al., 2020).Table 1 presents the distribution by class for all available crises.Some crises (Plant Explosion Lubrizol, Plant Explosion Sanary, Notre-Dame Wildfire, Attack Trèbes) are only annotated for urgency.The ecological crisis (flood, storm, hurricane) are the most represented with 12,112 messages against 7,483 messages for sudden crisis (collapse, wildfire, plant explosion, attack).We also notice that for sudden crises, there are fewer SECURITY INSTRUCTION messages than ecological crisis, explained by the fact that these latter crisis are predictable.
The collection is extremely imbalanced with 57.96% NOT USEFUL and 20.26% for URGENT.This is largely due to how tweets are collected.Indeed, since tweets posted 24 hours before the crisis have been collected, a large amount of them are NOT USEFUL.The corpus is also imbalanced regarding the sub-level of urgency categories: 1.93% of the tweet are annotated as HUMAN DAM-AGE with 306 messages while SECURITY INSTRUCTION represents 9.88% of the corpus with 1,470 tweets.These proportions are in line with the ones reported in other crisis corpora (see Section 2.1).

Annotation Procedure
A subset of this dataset composed of 13, 378 tweets has been selected for SA1 annotations, among them 11, 229 have been annotated for both SA1 and SA2.Regarding SA1 dataset, it comprises almost all URGENT (3,857) and NOT URGENT (4,222) messages.Only 5,299 NOT USEFUL tweets have been selected, in order to reduce the size of that category, but keep it as the majority class.Similar urgency annotations split holds for SA2 dataset.Note that, during the annotation process, pre-existing urgency tags and metadata information are removed, as to not bias the annotators.
The annotators were native French speakers, both master's degree students in Linguistics.The procedure was as follows.First, each segment in a given tweet is annotated at the sub-tweet level (i.e., SA2), then the tweet level annotation (i.e., SA1) is deduced accordingly: • If the tweet is composed of one or several SA2 annotations that subsume the same SA1 category, the final annotation is SA1.For example, for a tweet composed of two segments annotated with SA2=[INFORMATIVE, UNINFORMATIVE], then SA1=INTERROGATIVE.
• In case of several segments annotated with SA2 that do not belong to the same SA1 category, annotators are asked to determine the main communicative purpose of the tweet, and what segment signifies the main communicative intention of the speaker ( (Simons, 2007;Mari and Portner, 2021) a.o.).The main criterion to identify the main intention relies on the determination of the background (known) -foreground (new) information.For example in ( 16), a tweet is composed of two segments: a PROPER ASSERTIVE, followed by an UNINFORMATIVE question that conveys an evaluation.The annotators have considered the second segment to be dominant, as the fist half is a description of a fact that occurred in the past and that is already part of the common ground.The main point of the tweet is the uninformative question about the present situation, as an expression of a criticism. 15The tweet is thus labeled at the first level as SUBJECTIVE.
The SA2 annotation and the background-foreground distinction provides a solid heuristic to identify the main point of the tweet.Furthermore, as we shall see in Section 5.1, the fist segment is mostly responsible for determining the overall categorization, this providing a reliable criterion to settle undecided cases.Finally, as we show in Section 5.3, specific subsegments correlate with urgency, thus enhancing emergency detection.
15. Recall that questions can convey a subjective stance rather than a request of information.
] The annotation has been performed using the BRAT annotation tool.(Stenetorp et al., 2012) 16 To ensure consistency between annotations at the SA1 and SA2 levels (i.e., a tweet composed of one segment and annotated with SA1=INTERROGATIVE and SA2=PROPER ASSERTIVE), automatic checks have been conducted and annotators are asked to solve their errors before moving to the next tweet.Figure 3 shows an example of the tweet "A fire is currently in progress in #SaintDizier in the city center.Avoid the area" annotated in BRAT, highlighting both the tweet level (in red) and the sub-tweet levels SA annotations (in white).The annotators performed a two-step annotation with an intermediate analysis of agreement and disagreement between the annotators.448 tweets have been annotated in the first step by both annotators to compute the inter-annotator agreement (Cohen's Kappa=0.62 for SA1 and 0.48 for SA2 17 ).This agreement exhibits a comparatively lower score than what is typically encountered in similar studies involving SA annotations in tweets, with for example 0.78 in Vosoughi and Roy (2016) and between 0.72 and 0.92 depending on task in Subramanian et al. (2019).We found that it is mostly caused by the level of subjectivity involved in this task, in particular, the choice of the dominant segment, as mentioned earlier, has been the source of a lot of discrepancies.To address this issue, we encouraged regular feedback sessions and discussions between the two annotators to address discrepancies, clarify guidelines, and ultimately improve their agreement levels.
Another cause of disagreement were due to the difficulty of disentangling SUBJECTIVE from ASSERTIVE, in particular when attitudes and modal expressions are used such as believe, think that, etc.Indeed, both the subjective expressions (think, believe, or even more complex modal-tenseaspect combinations as fallait (which translates as 'should have been' with an additional implicature of preference in ( 17))) or its content can be targeted, according to their contextual relevance.
(<USER> And now there's hardly any smoke... Should have stopped the traffic this morning, not in the middle of the day.) SA1: SUBJECTIVE Urgency: NOT URGENT 16. http://brat.nlplab.org17.We computed SA2 inter-annotation agreements on the basis of the dominant segment.

Results of the Annotation Campaign
We provide in this section a detailed analysis of the annotation campaign.We focus in particular on: (a) quantitative results of the SA annotations at both the tweet (SA1) and sub-tweet levels (SA2), (b) an analysis of how SA are expressed across different types of crisis, (c) the correlation between SA and urgency annotations, and finally (d) the evolution of SA over time since the crisis occurs.
We end this section highlighting the main findings of this corpus-based study.

SA Annotations: Quantitative Results
Table 2 shows the distribution of categories of SA1 annotations (i.e., tweet level).We observe that a majority of the tweets are classified as ASSERTIVE, with 53.42%.The second-most frequent class is SUBJECTIVE, with 28.18% followed by JUSSIVE with 11.72%.INTERROGATIVE and OTHER are the less frequent with 3.36% and 3.32% respectively.These distributions indicate that in crisis situations, users predominantly tweet to assert their thoughts and views, to express their personal opinions and feelings, and to share information and updates on the given situation.Conversely, the low percentage of JUSSIVE and INTERROGATIVE suggests that they are less likely to give advice or ask questions in these circumstances (see also (Zhang and Liu, 2014) Figure 4 provides the distribution of the SA2 dominant labels (i.e., the ones that drive the SA1 annotations).We observe that PROPER ASSERTIVE is the most frequent with 37.19% while the other ASSERTIVE sub-class, namely REPORTED ASSERTIVE, was dominant in 14.36% of the tweets.Regarding NON ASSERTIVE content, EVALUATIVE and EXPRESSIVE SA2 annotations obtained similar frequencies of about 14.94% and 13.19% respectively.
Figure 5 combines the previous two tables illustrating the distribution of each SA2 sub-categories with their corresponding SA1 annotations.We observe that the pattern (SA1 = ASSERTIVE, SA2 = PROPER ASSERTIVE) is the most frequent with 72.13%.For INTERROGATIVE, 72.58% of the segments are INFORMATIVE vs. 27.42% for UNINFORMATIVE while for JUSSIVE, 63.17% are OPEN OPTION vs. 36.83%OTHER JUSSIVE.Similar observations hold for the two remaining SA1 categories.Finally, the very low percentage of OTHER (i.e., 0.43%) suggests that annotators were able to easily associate a SA2 category to a given segment.This is not the case for SA1 annotations were this frequency increases to 3.32% showing that sub-level SA annotations are important to better capture users' communicative intentions.The number of OTHER SA2 annotations being relatively low (48 instances), we discard them for the further analysis below.
When analyzing tweet segmentation for SA2 annotations (recall that SA2 annotations consist of a sequence of segments [s 1 , s 2 , . . ., s n ], each with its associated SA2 category), we observe (see Table 3) that, among the 11, 229 tweets annotated for SA2, only about 23% are made up of more than one segment.Furthermore, 18.01% and 4.12% of tweets contain two and three segments respectively.While all SA2 classes display over 50% of presence in the first position, an interesting observation regarding the distribution of SA2 tags among possible positions is that it differs from class to class.Notably, while PROPER ASSERTIVE and REPORTED ASSERTIVE segments are over-  whelmingly found in the first position (over 93%), all other classes display a much higher rate of non-first position in the sequence, ranging from 24,27% for INFORMATIVE to 44,60% for OTHER JUSSIVE.
Table 3 together with Table 4 that shows the distribution of the most frequent sequences within a tweet, suggest that relying only on the first label in the case of multi-label sequences might be a viable approach.However, this approach should consider two potential difficulties.First, it could introduce a bias in favor of the dominant class PROPER ASSERTIVE, which tends to appear as the first element in multi-label sequences a lot more than the other classes in our data (about 98% of the time).Second, a specific pattern is identified, where a PROPER ASSERTIVE is followed by a different type of SA2 that is considered dominant, with the latter being in relation or in reaction to that initial assertion.In these cases, the reaction is the main, new, informative content that the rescue teams might be interested in, whereas the assertive content provides background information.
When analyzing the data further, we indeed observe that, for tweets composed of two sequences, the forms [PROPER ASSERTIVE, EVALUATIVE] and [PROPER ASSERTIVE, EXPRESSIVE] are a majority with 414 and 388 tweets respectively, followed by [PROPER ASSERTIVE, OTHER JUS-SIVE] and [PROPER ASSERTIVE, OPEN-OPTION] with 189 and 169 tweets respectively.For tweets composed of three sequences, the patterns [PROPER ASSERTIVE, EVALUATIVE, EXPRESSIVE] and [PROPER ASSERTIVE, OTHER JUSSIVE, EVALUATIVE] have been observed in 88 and 44 tweets respectively.
Examples ( 18) and ( 19) illustrate of the observed patterns.In ( 18), the INTERROGATIVE is a direct follow-up to the assertion while in ( 19), the JUSSIVE is a reminder/directives given directly in reaction to the assertion.A final interesting observation concerns the OTHER class, where PROPER ASSERTIVE is not over-represented.This is likely due to the fact that this category is used to classify tweets that do not fit any of the other classes, and it should therefore be expected that such tweets follow a different pattern than the other classes.

SA Annotations vs. Crisis Types
Our dataset is composed of 7 types of crisis, among them five 5 are unexpected or sudden events: Floods, Storms, Hurricanes, Building Collapses, Explosions and Fires/ Wildfires, and Terrorist Attacks.In this section, we analyze whether the type of crisis impacts the distribution of SA annotations.Table 5 shows the results.Overall, the distribution is quite similar across all crises (some more fine-grained observation will be provided in section 5.4), and are inline of those observed in Tables 2 and 5.The only exception being the Trèbes Attack, with 39.36% ASSERTIVES (the lowest frequency of Assertives in the corpus) and 45.88% SUBJECTIVES (the highest frequency).Tweets posted during the Sanary Explosion displays the polar opposite distribution: 79.92% ASSERTIVES (the highest frequency of Assertives in the corpus) and 9.38% SUBJECTIVES (the lowest frequency of Subjectives in the corpus).Those two events, despite having both resulted in several deaths and injuries, have, according to the difference in SA distribution, elicited vastly different reactions on twitter.A possible interpretation is that, in the case of the incident in Sanary, users simply shared and discussed facts, as opposed to the terrorist attack in Trèbes, where users expressed their emotions and sentiments.
The types of crises seem to highlight certain tendencies related to SA1 annotations.For example, the distribution is quite similar between the 3 Floods: with 3 of the highest numbers of ASSERTIVES, averaging to 64.73%, and the 3 lowest numbers (besides Sanary) of SUBJECTIVES, averaging to 17.78%.Similarly, the distribution for the 2 Fires is such that both sub-corpuses display, by quite a margin (besides Trèbes), the lowest numbers for ASSERTIVES and the highest for SUBJECTIVES with, respectively, 42.18% and 39.26%.Finally, the frequency of OTHERS is consis-  tently very low for the whole corpus (averages 3.32%), with two exceptions: Ulrika with 11.39% and NotreDame with 17.53%.Finally, when looking into the distributions of SA2 annotations across crisis types, we observe that PROPER ASSERTIVES are the most frequent first segment of the sequence for all the crises except the building collapse and terrorist attack where REPORTED ASSERTIVES were a majority.We also observe a high proportion of INFORMATIVES, OPEN OPTIONS and EVALUATIVES.The distributions for the storms, hurricanes, terrorist attack but also fire crises are different with more EXPRESSIVES than EVALUATIVES.

SA vs. Urgency Annotations
Our dataset is annotated both for urgency and speech acts.All the tweets in our corpus (13, 378) have been annotated for SA1 and urgency (i.e.URGENT, NOT URGENT and NOT USEFUL), whereas 11, 229 have been annotated tweets for SA1, SA2 and intentions to act, namely SECU-RITY INSTRUCTION, HUMAN DAMAGE and INFRASTRUCTURE DAMAGE for Urgent messages, and SUPPORT, OTHER and CRITICS for Not Urgent messages.SA1 vs. Urgency.Table 6 details the frequency of SA1 tags comparatively with the original urgency annotations.Regarding the two most frequent SA1 (ASSERTIVE and SUBJECTIVE), two observations emerge: (1) Among 3,857 URGENT messages (resp.4,222 NOT URGENT), 86.13% (resp.33.82%) are ASSERTIVE; and (2) only 5.81% of URGENT messages are SUBJECTIVE while 44.69% of NOT URGENT messages are.Similarly, we observe that 6.82% of JUSSIVE are URGENT vs. 15.99%NOT URGENT.Regarding NOT URGENT messages, ASSERTIVES mainly occur when messages contain information that is irrelevant to the crisis.It is interesting to note that the proportion of INTERROGATIVES are higher for NOT URGENT messages when compared to the URGENT ones (3.79% vs. 0.93%).Finally, among the 444 messages that have been annotated as OTHER messages, 81.08% are NOT USEFUL.These frequencies are statistically significant using the χ2 test (χ2 = 2, 831.84, df = 8, p < 0.01).When measuring the dependency strength between urgency and SA1 categories using the Cramer's V test, we get (V = .32,df = 8) which confirms the statistical correlation between these two classifications.These observations indicate a strong correlation between assertivity and urgency when removing the NOT USEFUL class (V = .54,df = 4).

%
URGENT Table 7 provides the same analysis, this time with SA1 vs. intention to act annotations.For all URGENT subcategories, ASSERTIVE has the highest frequency with a total of 5,243 tweets, among them 90.52% are HUMAN DAMAGES, 86.87% INFRASTRUCTURE DAMAGES, and 83.33% SECURITY INSTRUCTIONS.Regarding the 2,416 NOT URGENT messages, SUBJECTIVE make up 70.75% of CRITICS and 70.34% of SUPPORTS vs. 17.30% and 21.80% for ASSERTIVES respectively.However, for the 1,309 OTHER MESSAGES, which are not urgent messages that do not fall in either of the previous two categories, only 10.31% of them are classified as SUBJECTIVE, while 20.93% are JUSSIVES, and 59.21% ASSERTIVES.These frequencies are statistically significant using the χ2 test (χ2 = 2, 502.17, df = 24, p < 0.01).When measuring the dependency strength between intention and SA1 categories using the Cramer's V test, we get (V = .25,df = 24) which confirms the statistical correlation between these two classifications.SA2 vs. Urgency.Table 8 presents the frequency of sub-tweet SA tags (excluding OTHER) when paired with urgency labels.In this table, the frequencies of SA2 are statistically significant (χ2 = 2, 378.84, df = 16, p < 0.01, V = .32),showing that SA2 annotations are of particular importance for urgency detection.

%
URGENT When looking into the distributions of SA2 tags against intentions to act categories (cf.Table 9), we again observe an over-representation of PROPER ASSERTIVE with a total of 3,133 instances, among them 153 are about INFRASTRUCTURE DAMAGES whereas 72 HUMAN DAMAGES.UNIN-FORMATIVE has the lowest frequency of 112 instances.Overall, the relationship between SA2 and the urgency categories suggests that the degree of urgency of a message is correlated to the type of speech act used (χ2 = 2, 928.24, df = 42, p < 0.01, V = .25).The strength of the correlation increases to (V = 0.40, p < 0.01) when excluding the NOT USEFUL.

Evolution of Speech Acts Over Time
Recall that all the tweets in our dataset has been collected in three periods: 24h before, during (48h) and up to 72h after the crisis.Our aim here is to analyze the evolution of speech acts over time focusing on three periods: BEFORE, DURING and AFTER the event happened. 18Table 10 shows the distribution of SA1 categories per period in terms of percentage.
When looking at tweets over time since the crisis happens, we notice some interesting trends.Before a crisis, tweets are a mix of assertions and to a little extent subjective content.During the crisis, tweets become more focused and include a lot of strong statements and questions, showing people intend to provide informative content and express opinions and evaluations.there is still a focus on sharing information, but fewer opinions are shared.After the crisis, assertive language remains substantial, suggesting a continued focus on conveying information.The proportion of subjective and interrogative tweets decreases post-crisis.This nuanced understanding highlights the shifting dynamics in communication styles across different phases of a crisis, with assertiveness and information-seeking becoming more pronounced during heightened situations.Jussives are observed as more prominent before the crises happens, which is in line with the interpretation of the jussive: the speakers intend to enhance action most notably when preventing casualties is still possible, that is to say, before the crisis happens.We further detail our analysis, this time by studying the distribution of SA1 per crisis type and period (see Table 11). 19In the case of storms, floods and hurricanes, there is a notable surge in assertive messages, particularly after the event, indicating a shift toward providing clear information and directives to address the aftermath.Concurrently, there is an increase in subjective expressions, possibly reflecting the emotional impact on individuals.Likewise, Collapse sees a notable increase in assertive messages post-crisis, suggesting a focus on clear statements and instructions once the immediate danger has passed.For Explosion/Attack and Fire-related communication, there is a significant uptick in assertive messages during the event, possibly aimed at providing immediate guidance.

Interim Conclusions
The corpus-based study of speech acts in tweets annotated for urgency allows for multiple statistically relevant observations: • The vast majority of tweets are ASSERTIVES, seconded by SUBJECTIVES.More specifically, PROPER ASSERTIVES is the dominant class at the sub-tweet level.These results seem to indicate that, in a reaction to a crisis, French Twitter users mostly tweet to share information, generally in the form of a single utterance.This corroborates the findings in (Zhang and Liu, 2014), tending to show that in an emergency, factual information is more relevant than the expression of a personal view-point.
• PROPER ASSERTIVES are over-represented in the first position in every SA1 category of tweets.In particular, the high frequency of PROPER ASSERTIVES in the INTERROGATIVE, JUSSIVE and SUBJECTIVE tweets is explained by the fact that a significant part of those tweets follow a format comprising an assertion, followed by the speaker's reaction to said assertion, which constitutes the dominant SA, as shown in ( 16).This reveals an interesting finding: In crisis situations, speakers tend to assert or re-assert a piece of (already known) information, followed by their personal comment in relation to it, thus sharing their perspective.
• The distribution of SA1 annotations highlights a general consistency in the data across the different crises, as well as similarities in the SA distribution of similar crises.Finally, we found a statistically significant relationship between ASSERTIVITY and URGENCY, and between SUBJECTIVITY and absence of URGENCY.
The picture that emerges, is one on which speakers favor (what they consider) truthful information over orders and commands to enhance action (on the part of the rescuing teams, for instance).Indeed, in our classification ASSERTIVES do not include subjective evaluations, and thus convey content informationally reliable and objectively veridical (i.e.conform to the outer reality and not a mental state) (Giannakidou and Mari, 2017, 2018, 2021c) and thus ready for uptake and endorsement (e.g.Ginzburg (2012), Krifka (2019)) on the part of those who will bring help.The fact that speakers favor PROPER ASSERTIVES to indicate urgency reveals that they are fully committed to the truthfulness of the message, of which they might present themselves as the primary informational source.
On the contrary, we observe that SUBJECTIVES correlate with absence of urgency.Among subjectives EVALUATIVES/EXPRESSIVES are largely used to convey truths that are relativized to a 'judge' or an individual (a.o.(Lasersohn, 2005;Stephenson, 2007)) and are not eligible to function as reliable information for the rescuing services.A minority of subjectives encompass attitudes, whereby truth is also relativized to a particular mental state and cannot (without further negotiation) immediately become common ground (e.g., (Gunlogson, 2008;Mari and Portner, 2021)) and be ready for uptake on the part of the helpers.Collapses, while qualifying as sudden crises, behave like non-sudden ones, probably in virtue of the long searches for casualties that make them similar to non-sudden ones.
Finally, we have discovered that assertives are more prominent after the crises when these are non-sudden, and during the crises when these are sudden.This points to the fact that speakers are active in providing information during the aftermaths of non-sudden crises, which, most of the times, require sustained efforts in view of the intensity of the damages.Speakers are keener in using assertives during the crisis with sudden crises, aiming at providing contentful information as the unexpected crisis unfolds and no knowledge had been made previously available by media or other sources.

Experimental Settings
Now the dataset has been annotated, the next step is to automatically detect SA.We cast the problem into a classification task, leaving the complex task of discourse-based tweet segmentation into non overlapping units to future work (Morabia et al., 2019;Aljebreen et al., 2021).We propose the following experimental settings: • SA1 detection: Classify each tweet into one of our five SA1 categories, namely ASSERTIVE, SUBJECTIVE, INTERROGATIVE, JUSSIVE, and OTHER.
• SA2 detection: Classify each tweet into one of our eight SA2 categories.Note that the OTHER instances (48 tweets) have been removed from the dataset for the experiments as they are very less frequent in urgent tweets and have no regular linguistic patterns.We propose two settings: -Multi-class classification.Given a tweet t = [s 1 , . . ., s n ] and its associated SA annotations SA1 = [SA2 1 , . . ., SA2 j , . . ., SA2 n ] where SA2 j is the dominant segment, predicts its SA2 category SA2 pred .We evaluate the results considering (i) a strict match where SA2 pred = SA2 j (this is similar to a binary classification), as well as (ii) a partial match such that SA2 pred ∈ {SA2 1 , . . ., SA2 j , . . ., SA2 n }. -Multi-label classification.Multi-class classification only focuses on the dominant segment ignoring the speech acts conveyed by the other tweet segments (we recall that a tweet can be composed up to 5 segments, see Table 3).We believe these informations can be of particular importance for urgency detection.Therefore, we aim at capturing label dependencies among the segments by assigning multiple labels for each instance simultaneously (Zhang and Zhou, 2013;Liu et al., 2021).
• Detecting SA1 and SA2 simultaneously.This is a multi-task learning framework considering there are two classification tasks (SA1 and SA2).The classifiers for both tasks share and update the same low layers except the final task-specific classification layer.

Models
We rely on FlauBERT base (Le et al., 2019) the base cased French Transformers models (Martin et al., 2020) pre-trained on French texts from various sources from the general domain (e.g., Wikipedia and books), as implemented in HuggingFace.In addition, we use FlauBERT tuned , a FlauBERT model that was pre-trained on 358,834 unannotated tweets from the crisis domain (Kozlowski et al., 2020) achieving better performances compared to FlauBERT for urgency detection.We also experimented with CamemBERT base (Martin et al., 2019), the other French transformer architecture. 20ollowing Laurenti et al. (2022b), we also experiment with two multi-input models that use extra-features added on top of pre-trained contextual word embeddings, among which21 : the presence of URLs, punctuation (exclamation marks and question marks) and the presence of numbers, as they are often used in tweets to indicates phone numbers of emergency rescue services or weather forecast.We refer to these models as FlauBERT tuned+Feat and FlauBERT base+Feat .
In addition to the cross entropy loss (hereafter +C) and to fight class imbalance, we consider the focal loss (hereafter +focal) (Lin et al., 2017) or weighted cross entropy (hereafter +W).Our aim here is to compare with one of the most effective approach for handling imbalanced data (Cui et al., 2019).All our models were trained for four epochs with a learning rate of 2e − 5, on top of which a linear layer for classification was added.For better convergence, we use the Adam optimizer during backpropagation.To avoid exploding gradients, we use a gradient clipping of 1.0.
For the multi-label task, we adapt the FlauBert architecture22 to account for multilabel outputs relying on a sequence classification head on top of the pool layer.The input sequence comprises characters, sub-words, and words, which are processed by the transformer layers.On top of the pooled output, a linear layer is added for the classification task.We then examine each label independently for every message and determine whether the label is predicted by the model.We rely on label-based metrics (F1 macro) following the general trend in multi-label classification (Zhang and Zhou, 2013).This architecture was successfully employed in multilabel classification in other NLP tasks including judicial documents (Dai and Liu, 2020), sentiment analysis (Tang et al., 2020) and diagnoses of patients prediction (Hart, 2022).

Evaluation Protocol
To evaluate SA1 and SA2 models, we designed two evaluation protocols: • Random sampling.We mixed the tweets for all the crisis and randomly select 80% for train and 20% for test.For SA1 classification (resp.SA2), the final dataset is composed of 11, 181 tweets (resp.13, 378) split into 80-20 for train-test while keeping the same distribution as in the train set.
• Out-of-event.Following the general trends in crisis management (Kersten et al., 2019;Algiriyage et al., 2021;Bourgon et al., 2022a), we designed an out-of-type evaluation protocol by training on a pool of events related to different types of crises (e.g., Hurricane, Storm) and testing on a particular different type (e.g., Earthquake).The aim is to evaluate if a model can deal with new types of crisis, which is crucial to ensure the portability of the models to unseen events.To this end, we consider the distinction between expected vs. sudden events and experiment whether the use of speech acts differ according to the type of crises.Indeed, compared to ecological disaster like hurricanes and floods, sudden events (like earthquakes, terror attacks, explosions, technological incidents) are difficult to predict (Björck, 2016).These events, over which organizations have virtually no control, influence social behavior and the ways the emergency services are organized (James and Wooten, 2005;Coombs, 2014;Quarantelli et al., 2017).We propose two evaluation settings: (a) Train on expected events and test on sudden, (b) Train on sudden and test on expected.We consider flood, storm and hurricane as expected events while collapse, wildfire, plant explosion and attack as sudden events (see Table 1), which corresponds to a total of 6,311 tweets for the former vs. 7,067 for the latter.
All the SA1 (resp.SA2) models have been run five times on a randomly selected instances from the test set with a standard deviation of results being 5.3 × 10 −6 (resp.7.8 × 10 −4 ).We therefore report the averaged scores (accuracy, precision, recall, and macro F1).Finally, due to the high number of experiments, we only provide those achieved by the best configurations.Below we present our results.We end by a qualitative analysis highlighting main causes of misclassification.

RANDOM SAMPLING RESULTS
The experimental results are presented in Table 12, showcasing the accuracy (A), precision (P), recall (R), and macro-averaged F1-scores (F1).The results are grouped according to mono vs. multitask learning and whether these models use extra-features.Best scores are highlighted in bold font.
The results show that FlauBERT tuned has consistently achieved the best scores across all settings and that mono-task learning models outperform its multitask counterpart.However, it is interesting to note that FlauBERT MultitaskTuned+C+Feat resulted in the best accuracy of 81.81%.Injecting additional features was very helpful when coupling with the focal loss for FlauBERT tuned+focal+Feat and The results of the fine-grained SA experiments are presented in Table 14.The results indicate that partial evaluation leads to an improvement of approximately 8% in terms of accuracy with FlauBERT tuned+W achieving the best performance, resulting in an F-score of 65.67.This shows that a strict evaluation is not suitable to determine the dominant segment which is predictable given the pragmatic nature of selecting these types of segment.Regarding multitask architectures, the results are inline with those observed with SA1, making them less productive for multi-level SA detection.More importantly, the multi-label classification was the best, with FlauBERT base+C yielding the highest scores.Finally, although features have been very productive for SA1 classification, their injection into the FlauBert architecture for SA2 achieved mitigated results, see for example the boost in the F1-scores achieved by FlauBERT MultitaskBase+C+Feat vs. FlauBERT MultitaskBase+C while we observe a drop when comparing FlauBERT MultitaskTuned+C vs. FlauBERT MultitaskTuned+C+Feat .
We end this section by detailed results per class, as given by the multi-label model (cf.Table 15).Overall, the model achieves very good results for all the classes except the less frequent (see for example OTHER SUBJECTIVE and UNINFORMATIVE).

OUT-OF-EVENT RESULTS
Table 16 shows the results of our best SA1 (resp.SA2) models when tested following the out-ofevent protocol, i.e., FlauBERT tuned+focal+Feat (resp.FlauBERT base+C ), in terms of precision (P), recall (R) and the averaged F1-score (F1).
For SA1, and when compared to random sampling, we observe a small drop in the performances and this is more salient when the model is trained on sudden events.When we look into the results per class, we notice that three out of the five SA1 categories achieved similar results when trained vs. tested on expected events: 82.30 vs. 85.40F1-score for ASSERTIVE,51.58 vs. 54.16 for INTERROGATIVE,and 59.76 vs. 60.11 for JUSSIVE.Note that these scores are close to the one reported in Table 13 where the SA1 model has been evaluated in a random sampling scenario.OTHER and SUBJECTIVE however exhibits a different behavior: OTHER scores 44.44 vs. 31.58resulting in an important decrease in performances up to 7.3% in terms of F1-score when compared to random sampling.For SUBJECTIVE, the drop depends on the test set: When trained on expected events and tested on sudden, this category achieved around −6% compared to a random test.On the other hand, when trained on sudden and tested on expected events, the scores were similar (76.85 vs. 76.14 in the random configuration).This drop can be explained by the diverse linguistics means by which speech acts are expressed in sudden events (we recall the the distribution of each class in both settings are quite similar (see Table 5)).The SUBJECTIVE class encompasses a series of expressions that belong to different grammatical categories (verbs, adjectives, particles, interjections, ...) at different levels of the semantic and pragmatic interpretation (sentence, discourse, ...) and can either have an informational or just an expressive function.A finer grained typology of expressions is to be established to pin down the linguistic differences between the subjective expressions involved in sudden crises and those used in non-sudden crises situations.
Finally, regarding the SA2 results, we notice that testing on expected (resp.sudden) events does not significantly impact the results when compared to the random sampling.This shows that casting SA2 detection as a multi-labeling problem is quite effective.) best models results when tested in the out-of-event protocol.

Error Analysis
We end this paper by analyzing most causes of misclassifications as given by our best SA1 (resp.SA2) models, namely FlauBERT tuned+focal+Feat , the fine-tuned FlauBert with focal loss and feature injection (resp.the multi-label FlauBERT base+C trained with a cross entropy loss).Figure 6 presents our results.We also provide the confusion matrix for SA1 classification (see Table 17). 23 It shows that most errors come from the difficult distinction between ASSERTIVE and SUBJEC-TIVE, as also observed during the annotation campaign (see Section 4.2).In practice, the objectivesubjective distinction may not always be clear-cut, leading to a preference for ASSERTIVE classification, see the examples ( 9) and (10) in Table 18.We also observe other complex cases like in (2), ( 3), ( 7) and ( 8) that contain declarative statements and for which the model fails to distinguish between assertives and jussives.Some other examples lack context, and the gold label may not always be accurate (e.g., (1)), but they can still express assertive sentiments (e.g., ( 4)).Finally, the model struggles with some interrogative texts that are phrased declaratively (e.g., (5)) or affirmatively (e.g., (6)), leading to difficulties in identifying them as interrogative, therefore the model has trouble taking into consideration the interrogative mark present at the end of the text.23.We only report the confusion matrix for SA1 classification as the one for SA2 is too sparse.

Figure 1 :
Figure 1: A classification for tweets that makes use of four illocutionary categories.

Figure 2 :
Figure 2: Two-layers annotation for tweets and inner segments.

Figure 3 :
Figure 3: Example of a tweet annotated in BRAT.Jussif stands for JUSSIVE, while Propre and Autre-jussif for PROPER ASSERTIVE and OTHER JUSSIVE respectively.

Figure 5 :
Figure 5: Distribution of SA1 and SA2 annotations in our dataset.

Figure
Figure 6: Distribution of misclassified examples.

Table 1 :
Urgency distribution in our dataset per crisis.

Table 3 :
Distribution of SA2 labels based on their position in the sequence.

Table 7 :
Intention to act categories vs. SA1 annotations pairs statistics.

Table 9 :
Intentions to act categories vs. SA2 annotation pairs (percentage of each SA2 category per intention category).

Table 11 :
SA1 annotation vs. crisis period vs. crisis type, in percentage.The two best scores for each period are in bold font.

Table 12 :
FlauBERT MultitaskBase+C+Feat 79.17 59.43 56.23 57.65 FlauBERT MultitaskTuned+C+Feat 81.81 61.21 58.28 59.60 SA1 classification results.cross entropy for FlauBERT MultitaskTuned+C+Feat .When looking into the detailed results per class (cf.Table 13), we observe that the predictions are closely aligned with the distribution of each class in the dataset.In particular, ASSERTIVE and SUBJECTIVE were well-predicted with an F-score of 85.32 and 76.14 respectively, whereas JUSSIVE and INTERROGATIVE exhibit lower scores.

Table 13 :
SA1 results per class as given by FlauBERT Tuned+focal+Feat our best model.

Table 15 :
SA2 results per class as given by the multi-label FlauBERT base+C .