Characterizing the Response Space of Questions: data and theory

The main aim of this paper is to provide a characterization of the response space for questions using a taxonomy grounded in a dialogical formal semantics. As a starting point we take the typology for responses in the form of questions provided in Łupkowski and Ginzburg (2016). That work develops a wide coverage taxonomy for question/question sequences observable in corpora including the BNC, CHILDES, and BEE, as well as formal modeling of all the postulated classes. This paper extends that work to cover all types of responses to questions. We present the extended typology of responses to questions based on studies of the BNC, BEE, Maptask and CornellMovie corpora which include 607, 262, 460, and 911 question/response pairs respectively. We compare the data for English with data from Polish using the Spokes corpus (694 question/response pairs), providing detailed accounts of annotation reliability and disagreement analysis. We sketch how each class can be formalized using a dialogical semantics appropriate for dialogue management, concretely the framework of KoS (Ginzburg, 2012).


Introduction
There are various theories of what questions are (Groenendijk and Stokhof, 1997;Wiśniewski, 2015), and several computational theories of dialogue (Poesio and Rieser, 2010;Asher and Lascarides, 2003;Ginzburg, 2012), but no attempt yet at a comprehensive characterization of the response space of questions. Thus, our aim in this paper is to provide a characterization of the response space for questions using a taxonomy grounded in a dialogical formal semantics. 1 This task, nonetheless, is of considerable theoretical and practical importance: it is an important ingredient in the design of dialogue systems, spoken or text-based; it provides benchmarks for dialogue/question theories, and of course is a component in explicating intelligence to pass the Turing test (see Turing, 1950). 2 Łupkowski and Ginzburg (2013Ginzburg ( , 2016 tackled one part of this problem, offering an empirical and theoretical characterization of the range of query responses to a query (q-responses). Based on a detailed analysis of the British National Corpus and three other corpora, two task-oriented, (BEE (Rosé et al., 1999) and AmEx (Kowtko and Price, 1989)) and a sample from CHILDES (MacWhinney, 2000), they identified 7 classes of questions that a given query gives rise to; we refer to these classes as the L(upkowski)G(inzburg) classes of query responses. The study sample consisted of 1,466 query/query response pairs. As an outcome the following query responses taxonomy was obtained: (1) CR: clarification requests; (2) DP: dependent questions, i.e., cases where the answer to the initial question depends on the answer to a q-response; (3) MOTIV: questions about the underlying motivation behind the initial question; (4) NO ANSW: questions whose aim is to avoid answering the initial question; (5) FORM: questions which consider how to answer the initial question; (6) IND: questions which indirectly convey an answer, (7) IGNORE: responses ignoring the initial question, but addressing a shared situation-for more details see (Łupkowski and Ginzburg, 2016, p. 255). We take their work as a starting point and make the following hypothesis: (1)(H) Main hypothesis: responses drawn from or concerning the LG query classes plus direct answerhood exhaust the response space of a query.
Specifically this amounts to the following general types of responses (we present the detailed taxonomy in Section 3).

Evasion responses:
1 This paper is a substantially extended version of a paper that was presented at SigDial 2019 ("Characterizing the Response Space of Questions: a Corpus Study for English and Polish"). It includes a significantly broader review of the literature, the corpus study, manually annotated given the complexity of the categories, includes an additional corpus (the Cornell Movie corpus) and many more q/r pairs analyzed for English (1,235 vs. 2,240) and Polish (205 vs. 694); the discussion of annotation reliability is more extensive; the formal section has been rewritten and expanded considerably and the paper is also accompanied by two appendices covering formal background and the annotation guidelines.
2 For the analysis of the Turing test as a question-response system see, e.g. (Łupkowski and Wiśniewski, 2011).

80
(a) Ignore (address the situation, but not the question); (b) Change the topic ('Answer my question'); (c) Motive ('Why do you ask?'); (d) Difficult to provide a response.
The hypothesis has to be understood relationally-one is not really interested in the extension of the semantic entities (primarily propositions and questions) that can be given as responses. Rather, one is interested in the class each such entity is classified as since that is what determines the subsequent contextual evolution.
(2) I do not want to talk about that question. (Direct answer to what do you not want to do? Evasion answer to Where were you last night?).
We survey the existing literature in Section 2. Following this, we provide a description of the proposed taxonomy, in Section 3. In the sections that follow we proceed to test our main hypothesis using four corpora in English (BNC (Burnard, 2007), BEE (Rosé et al., 1999), HCRC MapTask (Anderson et al., 1991), CornellMovie (Danescu-Niculescu-Mizil and Lee, 2011)) and one corpus in Polish (Spokes; Pęzik 2014). Section 4 discusses respectively the corpora we used and data selected therefrom. Section 5 describes our annotation method. The hypothesis achieves wide coverage, as we discuss in Section 6; in Section 7 we discuss in extensive detail the reliability of the results.
In Section 8 we consider the requirements on semantic frameworks for a formal characterization of the various classes of the taxonomy. We sketch an account of the different classes in the framework of KoS (Ginzburg, 2012), building on though departing in some respects from the account developed in (Łupkowski and Ginzburg, 2016). We point to problems other existing frameworks face in providing a comprehensive account. A concluding Section 9 outlines a variety of natural extensions to the work described here. There are two appendices: Appendix A offers basic notions from the type logical framework TTR (Cooper, 2012(Cooper, , 2023 used in the paper, whereas Appendix B provides the annotation guidelines.

Related work
As Enfield (2010, p. 2658) points out 'While the grammatical and information structural properties of questions have received widespread attention in linguistics literature, there has been relatively little attention paid to the relationship between questions and their responses. ' Let us start with Berninger and Garvey (1981) who introduce three terms to refer to a reaction to a question: (1) response, which is any verbal production emitted by a partner following a question; (2) reply, which is a response relevant to the question; and (3) answer-a reply that directly or indirectly provides the missing information. In what follows, the authors introduce their rich taxonomy of possible replies for children conversation in a nursery school. The taxonomy covers six categories: (1) Possible answers; (2) Indirect answers; (3) Confessions of ignorance; (4) Clarification questions; (5) Evasive replies; (6) Miscellaneous.
In particular, we find questions as a form of replying to questions among the proposed types (in the form of clarification questions). Further replies of this kind may be observed among the proposed sub-types of evasive replies (see 8 and 9-however, they are not as fine-grained as the LG typology of q-responses). These cover the following (see Berninger and Garvey, 1981, p. 407-408). 81 1. Selecting own reference in making an assertion: (3) X: Where the morrow's house?
Y: Nope, well the morrow house has sniffles.
2. Selecting own reference in rejecting the presupposition of the question: (4) X: What's his name?
Y: Um pretend that he didn't have a name.
4. Temporarily stalling in providing an answer, but acknowledging that question has been heard: 8. Rejecting question as stated: (10) X: Do you hear the man that is with Lisa? Y: They're not with Lisa. I'm Lisa.
One may observe that the presented categories are co-extensive with the ones mentioned in the introduction to this paper. Possible and indirect answers are subsumed by the question-specific: answerhood category. Clarification questions correspond directly to the category of Clarification responses. And evasive replies and confessions of ignorance fall under our richer category dubbed evasion responses. Our proposed typology identifies also other types of question responses that are not tackled by Berninger's and Garvey's proposal. In later work, an interesting typology of question responses was proposed as a result of an extensive 10-language comparative project on question-response sequences in ordinary conversation. The project was carried out from 2007 as the part of the Multimodal Interaction Project at the Max Planck Institute for Psycholinguistics-see an overview in . The study adopted certain restrictions with respect to the questions which were taken into account. In order for a question-response pair to be coded the question had to be a formal question or a functional question. Questions seeking acknowledgment, offered in reported speech and requests for immediate physical action were not coded Enfield, 2010, p. 2621). The coding scheme for the response types presented in Enfield, 2010, p. 2624) is the following: Non-response was coded if the person did nothing in response, directed his/her attention to another competing activity, or initiated a wholly unrelated sequence.
Non-answer response covers a verbal or visible response that failed to directly answer the question as put. This includes laughter, 'I don't know', initiation of repair (e.g., 'What?') or other inserted sequences, gestural responses such as shrugs that do not answer the question. Nonanswer responses include also 'Maybe', 'Possibly' or responses that deal with the question indirectly (like e.g., A: 'Do you see Jack much?' B: 'He moved').
Answer Answers the question directly. Answers can be gestural (e.g., a head nod or shake) or verbal ('Uh huh', 'Yeah', or longer, more involved answers including partial repeats of the question to confirm or disconfirm).
Can't determine can't hear/see participants, etc.
As with the previous typology, one can observe that our categories of question response cover these discussed above. The types which are not covered (like parts of 'non-response' or 'can't determine categories') are a consequence of the set-up of the Multimodal Interaction Project, where annotators had video-taped conversations at their disposal. Our study is based on a wide range of already existing corpora (without access to video).
Another interesting issue concerns what constitutes the most frequent type of response. Berninger and Garvey (1981) observe that the vast majority of responses provided (for polar and for Whquestions) were the possible answers. Other types were rare: 'The only other classes of replies that occurred with sizeable frequency were evasive replies and confessions of ignorance following Wh-questions and indirect answers following yes/no questions' (Berninger and Garvey, 1981, p. 410).
Analogous results are reported in Stivers and Robinson (2006) for the group of adult American English speakers. The corpus gathered for the analysis consisted of 260 instances of question sequences in a multi-party interaction (retrieved from video recordings of naturally occurring interactions). In this case the authors do not provide an extensive typology of replies as discussed 83 above, but focus only on answer / non-answer patterns. The conclusion of the study is that an answer is the alternative preferred over a non-answer (Stivers and Robinson, 2006, p. 371)-85% of the cases in the analyzed sample were answers. Stivers and Robinson provide several explanations for such a distribution. One is that the form of a non-answer supplying response turn reflects their ranking as dispreferred (they are frequently delayed both within and between turns, prefaced by filled pauses and discourse markers such as 'Well', and expanded with accounts-see Stivers and Robinson 2006, p. 372). Moreover, conversational participants typically treat a non-response as indicating disalignment, rather than indicating that no response will be forthcoming. Another reason for the obtained distribution, according to Stivers and Robinson, is that speakers perform interactional work to provide answers and despite the fact non-answers are a readily available alternative category of response-they 'struggle to receive and provide answers if at all possible' (Stivers and Robinson, 2006, p. 374). Stivers et al. ( , p. 2616) point out that 'In English there is a strong normative order surrounding questions. In the first place, responses are normatively required [. . . ], and answers are preferred over non-answer responses'. This claim is confirmed in the study of 350 questions drawn from spontaneous conversation in American English presented in (Stivers, 2010). The results are that 76% of responses were answers, only 19% were non-answers and 5% non-responses (Stivers, 2010(Stivers, , p. 2778. This is in line with previous results reported in (Stivers and Robinson, 2006) discussed above. Interestingly, Yoon (2010) reports results for Korean which though indicative of a similar pattern (Answer > Non-Answer > Non-response) indicate a markedly different distribution: of the sample of 326 questions-responses, 52% were answers, 33% non-answers and 15% non-responses (Yoon, 2010(Yoon, , p. 2790. In this study, the question sample was limited to questions that functionally sought information, confirmation or agreement (Yoon, 2010(Yoon, , p. 2783. Enfield et al. (2019) present results of a fourteen-language (including e.g., English, Lao, Korean) study concerning the issue of how people answer polar questions. The data-set consisted of 172 videotaped interactions. The authors point out that they focus only on answers: "In our quantitative study of responses, we examine only confirming answers (rather than non-answers such as I don't know, I can't remember, or laughter; or disconfirming answers). This is because confirmations are more frequent than disconfirmations (...)" (Enfield et al., 2019, 288-289); it is worth noting that the non-answer examples acknowledged above are covered by our taxonomy. Enfield et al. (2019) conclude that the answers to polar question may be of two possible types: (i) interjectiontype answers (such as 'uh-huh' or equivalents 'yes', 'mm', 'head nods', etc.) 3 and (ii) repetitiontype answers. Wang (2020) uses the proposed taxonomy of polar-question answers in a study of Mandarin data, adding a 15th language to the already existing data.
Another notable source is Enfield (2010), who provides an analysis of questions and responses in Lao for a corpus of 351 questions drawn from 8 separate recordings. The results reported in this paper are interesting for the discussion of what counts as an answer to a question. The focus of the analysis is the structural fit between questions (wh-questions and polar ones) and their responses. The author offers the following hypothesis as to what answers to wh-questions are optimally coherent: '[the answer] should supply a referent of the relevant ontological category (i.e. a thing for a 'what' question, a person for a 'who' question, etc.).' (Enfield, , p. 2661. Green and Carberry (1999) provide useful insights into indirect answering. They study 25 dialogue examples originating in (Stenstrom, 1984), where 13% responses to polar questions were indirect answers. On this basis one can highlight four possible reasons for using indirect answers (see Green and Carberry, 1999, p. 392 The works discussed in this section indicate the need for a wider corpus study of the whole spectrum of responses to questions. These studies are limited in terms of the examples that were analyzed. They also impose certain limitations concerning the number of response categories to be identified. This is understandable, as their main aim was to explicate the answer/non-answer difference. We believe that an extensive corpus study should bring a fine grained characterization of the entire response space of questions. Moreover, we aim at providing an explicit dialogical semantics for each category in our corpus-based typology. One should also acknowledge here the existence of various question answer typologies created within the field of Question Answering (QA). QA may be characterized as 'a sophisticated form of information retrieval (IR), in which the system processes questions queried in a natural language format and provides either the content containing the answer or the answer itself' (Shah et al., 2019, p. 611-612). Usually, such typologies are proposed for well-structured knowledge bases (or extracted with the use of various NLP methods from the unstructured data sets). What is different from our approach is that these typologies focus on the non-dialogical context of large amounts of texts. As such, QA addresses the interaction between computer systems and users (see Shah et al., 2019, p. 612). The resulting answer typology usually takes the form of an ontology of the related data on which a given QA system is to operate-an example of such an approach is presented in (Hovy et al., 2002).

A taxonomy of responses to questions
Our taxonomy with its three main sub-partitions is displayed in Figure 1. The classes in red are those that were added by comparison with the query response taxonomy of Łupkowski and Ginzburg (2016). 4 We start by the most general division of question responses to those that are specific to the question asked those that are not, as discussed in the introduction. In the question-specific class we distinguish direct from indirect answers and dependent questions. 5 Direct answers (DA) provide an answer straightforwardly. 6,7 This is clearly visible in the following example-B is providing information required by A:  4 For an explicit presentation of the taxonomy sans answers, see (Łupkowski and Ginzburg, 2016, p. 256). 5 An anonymous reviewer for Dialogue and Discourse suggests that, directness may be understood as a separate dimension, which is independent from the others. They suggest that any type of response may be presented in direct or indirect manner (not just answers). This is a hypothesis we think is worth testing, though we do not do so in the current paper. 6 We give a more explicit characterization of answerhood in Section 8; for a thorough, historically based discussion see (Wiśniewski, 2015). 7 For the direct answers category we allow for additional sub-categories, which we did not use in the annotation, but which we return to discuss briefly in Section 8. These include: (1) no/yes answer to polar questions; (2) simple answers to wh-questions; (3) partial polar answers; (4) partial wh-question answers.
In (16) A needs to infer the answer to his/her questions from B's suggestion that this issue has been addressed before. One also encounters IND being a question-response, as in (17), which is rhetorical and in this sense does not need to be answered and indirectly provides an answer to the initial question (q1). The other two remaining super-categories reuse the classes proposed in Ginzburg, 2013, 2016) with some minor renaming. We start with the metacommunicative class, involving Clarification responses and acknowledgments.
Clarification responses (CR) address something that was not completely understood in the initial question (q1) 9 , like: Moving on to evasive question-responses, we mention first the type which addresses the motivation underlying asking q1 (MOTIV). Whether an answer to q1 will be provided depends on a satisfactory answer to q2, as in (21a); (21b) is an instance where the responder offers an answer negatively resolving the motivation issue: A related class, which was subsumed within MOTIV in (Łupkowski and Ginzburg, 2016), 11 but which we separate away here involves cases where the speaker states that it is difficult to provide an answer (DPR), points at a different information source, etc. or the speaker states that s(he) does not know the answer. Another type of evasive question-response is change-the-topic (CHT). Instead of answering q1, the agent directly provides q2 and attempts to turn the table on the original querier. The original querier is pressured to answer q2 and put q1 aside, as exemplified in (24a) and most explicitly in (24b). 12 (24) a. An IGNORE type of query-response appears when q2 relates to the situation described by q1 but not directly to the initial question. This can be observed in (26). A and B are playing Monopoly. A asks a question, which is ignored by B. It is not that B does not wish to answer A's question and therefore asks q2. Rather, B ignores q1 and asks a question related to the situation (in this case, the board game).

Corpus data used for the study
In order to test our main hypothesis, we used corpora from two languages: English and Polish.

English: BNC, BEE, MapTask, CornellMovie
The data for English comes from the BNC (Burnard, 2007), BEE (Rosé et al., 1999), MapTask (Anderson et al., 1991;Skantze et al., 2006) and the CornellMovie corpora (Danescu-Niculescu-Mizil and Lee, 2011). Although both self-answering and multiparty turns figured in the initial development stage of the taxonomy, we restricted attention to two-person dialogue in the study reported here. 607 Q-R turns were taken from the BNC, 262 Q-R turns from BEE, 460 Q-R turns from the MapTask, and 911 Q-R turns from the CornellMovie corpus. The BNC data covers mainly free conversations: initially 864 Q-R pairs from BNC were annotated, but after elimination of multi-party segments, 607 Q-R pairs were retained. As for BEE, 37 undergraduate students with little background in electricity or electronics participated in conversations with a tutor. We randomly selected the students' numbers (23, 25, 27, 28, 29, and 31) and annotated the dialogues generated between those students and the tutor. In this way, we obtained 262 Q-R pairs. The MapTask consists of dialogues recorded for a route following task in which one participant directs a second participant along a route in a map, though the route giver and route follower maps are not identical. 297 of the 460 MapTask Q-R pairs are from the HCRC MapTask corpus (Anderson et al., 1991), whereas 163 of them are from the Higgins pedestrian navigation and guiding project (Skantze et al., 2006). The HCRC Map task corpus contains 128 dialogues, 64 of which involve eye contact between participants, while the remaining 64 dialogues involved no eye contact. In this study, we chose 28 dialogues, of which 14 with eye contact and 14 without eye contact. The filenames of these dialogues were selected randomly. We annotated all Q-R pairs occurring in each dialogue and obtained 297 Q-R annotated pairs after eliminating cases involving self-answering, incomplete questions, and overlapping. As for the Q-R turns from the Higgins project, we annotated six folders (files no.1 -no.6) and in each of them, there are 4 or 5 different dialogue files. As a result, we also annotated 28 dialogues and obtained 163 Q-R turns. The CornellMovie corpus is a collection of fictional conversations extracted from raw movie scripts. We annotated all available two-person dialogues from the first 8 movies listed in the corpus, ranging from the movie ID m0 to m7, thereby obtaining 911 Q-R pairs. This covers various genres such as comedy, romance, adventure, biography, history, action, crime, science fiction, thriller, fantasy, and horror. The data used for this study were retrieved from the Spokes corpus. The corpus currently contains 247,580 utterances (2,319,291 words) in transcriptions of spontaneous conversations. For the purposes of this paper, two studies were conducted (with two different sets of annotators). For the first study, we selected four files from the corpus (10,244 words). For the second study, 21 files were selected (86,052 words). The files cover casual conversations concerning, e.g., youth, TV shows, children, wine, or travel plans. Within each file, the question-response pairs (Q-R) were selected manually. In total, we obtained 694 Q-R pairs for two studies. 13

Annotation method
For the annotation, all the question-response pairs were supplemented with a full context. The guideline for annotators contained explanations of all the classes and examples for each category. Moreover, the OTHER category was included. The complete annotation guidelines are presented in Appendix B of this paper.
English data annotation: The 607 BNC Q-R turns used in this study were randomly extracted from the British National Corpus (BNC) and manually annotated by one English L1 speaker and two English L2 speakers who have masters degrees in Linguistics and underwent several training sessions with one of the authors, a native speaker of English with significant experience in dialogue annotation.
Among the 607 Q-R turns, 334 of them were annotated by the first and second annotators, whereas the remaining 273 Q-R turns were annotated by the first and the third annotators. Therefore, an inter-annotator study was conducted in two groups: first vs. second annotators, and first vs. third annotators.
Polish data annotation: The first sample of 205 Q-Rs was annotated by the main annotator and two other annotators (one of whom has previous experience in corpus data annotation, all annotators were Polish native speakers). The annotators received the annotation guidelines and underwent a short training phase based on selected examples. The second sample of 489 Q-Rs was annotated by the main annotator and two other annotators who are different from that of the first sample (the main annotator remained the same as in the first sample, all annotators were Polish native speakers). As in the previous case, annotators received the annotation guidelines and underwent a short training based on selected examples.

Results
The detailed results of the annotation are presented in Figure 2. We discuss the annotation reliability in Section 7. We also provide additional data for this paper (covering annotated Q-R pairs and disagreement cases) which are hosted on the OSF web-page (https://osf.io/mq6r7/).

English
In all four cases, the OTHER class is less than 0.5%, hence coverage is above 99.5%. The most frequent response classes in all four corpora are direct answers; the second most frequent class in the BNC is Difficult to provide an Answer (DPR=7.91%), while in CornellMovie, the next biggest is indirect answers (IND=18.33%), whereas for the MapTask and BEE these are IGNORE (6.09% and 3.82% respectively).

Polish
The two most frequent classes of responses for Spokes are answers: direct ones (DA=64.27%) and-much smaller-indirect ones (IND=10.66%). The next two most frequent classes are DPR (stating that a person does not know the answer to the question, or it is difficult to provide one, DPR=7.78%) and utterances ignoring the question asked (questions and declaratives, IGNORE=6.92%).

Discussion
When comparing results for English and Polish, it is apparent that the largest category is direct answers (DA). Also, indirect answers constitute a large group among recognized responses types in both languages. This result is in line with the findings reported by Stivers and Robinson (2006) and Yoon (2010)-summarized in Section 2.
As might be expected given the previous results presented in Łupkowski and Ginzburg (2016), the most frequent question-response for English and Polish data is the clarification request. What is interesting is the relatively high number of ignoring responses observed for English and Polish. In (Łupkowski and Ginzburg, 2016) we analyzed only question-responses and this type was observed rarely (0.57% for N=1,051 for BNC). This time, IGNORE has been used also to classify declaratives, which may explain the higher number observed; we discuss a possible semantic explanation for this in Section 8, where we suggest that it is in some sense only "weakly evasive". Other evasive responses (relatively) frequent in both languages are CHT and DPR.
We can also make some comments concerning cross-corpus differences. As we already mentioned in Section 4, our BNC and Spokes data cover mainly free conversations, while BEE and MapTask contain task-oriented dialogues. One might expect differences between these dialogue genres. These expectations are indeed fulfilled: the MapTask and BEE corpora have the highest number of direct answers in our study sample (80.87% and 88.93% respectively). In contrast, for the BNC and Spokes corpora these numbers are substantially lower (respectively 69.36% and 64.27%). When it comes to clarification responses, we observe that the numbers are lower for taskoriented corpora than for the BNC and Spokes (this is in line with our previous results for BNC and BEE, reported in (Łupkowski and Ginzburg, 2016, p. 256-257)). We also observe that, for the evasive response types discussed above, the tendency is analogous, i.e., we observed lower numbers for task-oriented dialogues than for free conversations. For the CornellMovie corpus, which is a collection of fictional, scripted conversations extracted from raw movie scripts, we observe tendencies akin to the BNC and Spokes. This is more or less expected, as elite scriptwriters aim for and succeed in mimicking natural conversation. One notable exception is the CHT response category (11.75% vs 2.31%-3.03%). One may hypothesize that such an evasive response is especially useful for movie dialogue writers-however, this observation needs further investigation.

Annotation reliability 7.1 Inter-annotator study
We conducted the following inter-annotator reliability study on the English BNC and Polish Spokes corpora, as they were double annotated by multiple annotators. However, other English corpora such as BEE, MapTask, and CornellMovie were annotated only once.
English The reliability of the annotation was evaluated using the κ (Carletta, 1996) and α (Krippendorff, 2011) coefficients. We used the Scikit-learn (Pedregosa et al., 2011) data mining and data analysis tool in Python with its sklearn.metrics package for calculating Cohen's kappa, and also used the Python implementation Krippendorff 14 for the calculation of Krippendorff's alpha. In this case, Cohen's Kappa for the first and second annotators is 0.7053 (substantial), whereas for the first and third annotators it is 0.6430. Krippendorff's alpha for the first group is 0.7022, while 0.6373 for the second group. All disagreements were then discussed in detail by one of the annotators and the aforementioned author and resolved. As a result, we obtained a gold standard for this BNC annotation task. In addition, as seen in Figure 3 and Figure 4, we created a confusion matrix for each of these three annotators by comparing their annotations with the gold standard. We also calculated precision, recall, and F-1 measures of each class for all three annotators as shown in Table  2. All were calculated by using the data analysis tool Scikit-learn in Python with its sklearn.metrics package.
We can learn the annotation performance of each annotator by investigating the results shown in the confusion matrices in Figure 3 and Figure 4, as well as from the precision, recall, and F-1 scores reported in Table 2. For the largest categories, on the whole, DA and CR were easy to annotate; IND was more tricky. In more detail: for the first group annotation in English, there are 334 annotated Q-R pairs in total, and among them 232 are DA, 33 are CR, and 16 are IND. The first annotator correctly annotated 206 of DA as DA but misannotated 19 of them as IND, 2 as ACK, one as CR, 3 as DPR, and one as IGNORE. Therefore, the first annotator obtained a precision of 0.99, recall 0.89, and the F-1 score of 0.94 for the response type DA. The second annotator on the other hand, correctly annotated 219 out of 232 actual DA as DA but misannotated 2 of them as CR, 1 as CHT, 7 as IND, 2 as IGNORE, and 1 as OTHER. Therefore, the second annotator gained a precision of 0.98, recall 0.94, and the F-1 score 0.96 for the response type DA. As for the response type CR, the first and second annotators obtained a recall score, 0.94 and 0.97 respectively. That is, the first annotator correctly identified 31 out of 33 CR, and only misclassified 2 as IND. The second annotator identified all 32 CR correctly, and only misclassified one as IGNORE. The precision and F-1 score of CR for the first annotator is 0.94, and 0.94 and 0.96 for the second annotator. Regarding the response type IND, the first annotator correctly annotated 15 out of 16 as IND but misclassified 1 The third annotator obtained a precision score of 0.93, recall 0.87, and F-1 score of 0.90. As for the response type CR, the first annotator correctly annotated 9 cases, but misidentified 3 as IGNORE, and 2 as IND. The precision, recall, and F-score for CR are 1.00, 0.64, and 0.78 respectively for the first annotator. The third annotator also performed similarly in the classification of CR, and she correctly annotated 8 out of 14 cases, but missclassified 5 as DA, and 1 as OTHER. The third annotator obtained similar performance scores as the first annotator, which are 1.00, 0.57, and 0.73 for precision, recall, and F-1 scores respectively. As to the response type IND, 5 out of 8 cases were annotated correctly by the first annotator. However, there are 2 cases misclassified as DA, one as DPR. The precision, recall, and F-1 score obtained by the first annotator are 0.19, 0.62, and 0.29 respectively. The annotation of IND also caused more difficulties to the third annotator. She correctly identified 4 out of 8 IND cases but misclassified 3 as DA, and 1 as IGNORE. The performance scores of the third annotator for the response type IND are 0.13, 0.50, and 0.21 respectively for precision, recall, and F-1 score. The annotation performance of both annotators on the response types IGNORE are similar. However, the F-1 scores of CHT are 0.95 and 0.70 for the first and the third annotator respectively, 1.00 and 0.00 for the response type DP. In addition, the annotators' performance of the second group is better than the first group in terms of the annotation of response classes DPR and ACK.
Polish The reliability of the annotation for Polish was also evaluated using the κ (Carletta, 1996) and α (Krippendorff, 2011) coefficients. As mentioned above, the main annotator was the same person in both samples. However, other annotators were different in these two annotation groups. The reported values were calculated using the same method and tools as for English. For the first  sample, the best inter-annotator κ and α scores were achieved by the second and the main annotators, 0.6579 and 0.6582 respectively. While for the second sample, we observed the highest inter-annotator agreements between the first and the main annotators, which are 0.5467 and 0.5466 for κ and α, as shown in Table 3. All disagreements were discussed in detail by the main annotator and resolved. In addition, we also used the data analysis tool Scikit-learn in Python with its sklearn.metrics package to created a confusion matrix for each of five annotators by comparing their annotation with the gold standard, as well as calculated the precision, recall, and F-1 measures of each response type. The annotation performance of each Polish annotator is presented in detail in Table 4, Table 5, and Figure 5. The frequency of each response type for the Polish first group annotation is DA:107, IND:29, CR:9, DPR:22, IGNORE:23, ACK:3, CHT:11, DP:1, and MOTIV: 0. Regarding the most frequent response type DA, the first annotator correctly annotated 87 out of 107 DA cases but misclassified 13 as IND, 3 as IGNORE, 2 as DPR, 1 as CR, and 1 as OTHER.
As a result, the first annotator obtained a precision of 0.94, recall 0.81, and F-1 score of 0.87. The second annotator correctly identified 6 more DA cases than the first annotator but also misannotated 5 as IND, 3 as CHT, 1 as CR, 2 as IGNORE, and 1 as DPR. The performance scores are also very close to those of the first annotator, which are 0.94, 0.89, and 0.91 for precision, recall, and F-1 score respectively. As to the response type IND, the first annotator correctly annotated 21 out of 29 IND cases but misclassified 3 as DA, and other 5 as CHT, CR, IGNORE, MOTIV, and OTHER respectively. The precision, recall, and F-1 score of IND for the first annotator are 0.48, 0.72, and 0.58. The second annotator correctly annotated 18 out of 29 IND cases but misclassified 5 as IG-NORE, 2 as CHT, and other 4 as DA, DP, DPR, and OTHER respectively. The precision, recall, and F-1 scores for the second annotator are 0.75, 0.62, and 0.68. Regarding the response type CR, the first annotator successfully identified 5 cases, whereas the second annotator identified 7. The F-1 scores for the first and the second annotator are 0.59 and 0.82 respectively. As for IGNORE, the first annotator correctly identified only 12 cases out of 23, whereas the second annotator correctly classified 21. The F-1 scores of the response type IGNORE for the first and the second annotator are 0.53 and 0.78. Comparing the F-1 scores for each response type, we learned that the second annotator performed better than the first annotator in general.
When it comes to the Polish second group annotation, there are 489 annotated Q-R pairs in this sample. The frequencies of each response type are DA:339, IND:45, CR: 20, DPR:32, IGNORE:25, ACK:15, CHT:10, DP:2, and MOTIV:1. As for the response type DA, the first annotator correctly annotated 328 out of 339 DA cases but misclassified 4 as DPR, 2 as CHT, 2 as CR, and the remaining 3 as DP, IND, and IGNORE. The precision, recall, and F-1 score for the first annotator are 0.93, 0.97, and 0.95 respectively. The second annotator, on the other hand, correctly identified 266 out of 339 DA cases. The second annotator misclassified 33 DA cases as IND, 13 as DPR, 14 as IGNORE, 10 as CHT, 2 as ACK, and 1 as CR. As a result, the second annotator obtained a precision of 0.99, recall 0.78, and F-1 score of 0.88. Regarding the response type IND, the first annotator correctly annotated 34 out of 45 IND cases but misclassified 10 as DA, 1 as CH, and obtained a precision of 0.85, recall 0.76, and F-1 score of 0.80. The second annotator correctly identified 38 out of 45 IND cases, but misannotated 7 as IGNORE. The precision, recall, and F-1 score for the second annotator are 0.49, 0.84, and 0.62 respectively. As to the response type CR, the first annotator correctly annotated 9 out of 20 CR cases but failed to identify the remaining 11 cases. The precision, recall, and F-1 scores for the first annotator are 0.82, 0.45, and 0.58. The second annotator correctly annotated 11 out of 20 CR cases but misclassified 2 as DA, 3 as IND, and 4 as IGNORE. The precision, recall, and F-1 scores for the second annotator are 0.92, 0.55, and 0.69. Regarding the response type IGNORE, the first annotator correctly identified 14 out of 25 IGNORE cases but misclassified 6 as CHT, 3 as DA, and 2 as IND. The precision, recall, and F-1 scores are 0.82, 0.56, and 0.67 for the first annotator. Whereas the second annotator correctly annotated 23 out of 25 IGNORE cases and only misclassified 2 of them as CHT. Even though the second annotator obtained a high recall of 0.92, he has a low precision and F-1 score, which are 0.43 and 0.58 respectively. In addition, the first and the second annotators performed similarly on the annotation of the response types DPR, ACK, CHT, and MOTIV. However, as for DP, the second annotator obtained 1.00 for all the precision, recall, and F-1 scores, and the first annotator obtained 0.40, 1.00, and 0.57 respectively.
As for the performance of the main annotator in both groups of annotation samples, he outperformed all the other annotators in most of the cases. However, in the first group of samples, the main annotator failed to correctly capture the response type DP, which has only one case in this sample. As for all other response types, he obtained very high F-1 scores, which are above 0.90 in most cases, and 0.83 and 0.87 for IND and CHT respectively. While in the second group of samples, he did not perform as well as the other annotators regarding ACK. He only obtained an F-1 score of 0.55, while the other two obtained 0.97 and 0.94 respectively. In addition, he also only got an F-1 score of 0.50 for the response type DP. What's more, the first annotator outperformed the main annotator also on the annotation of IND, where the first annotator obtained an F-1 score of 0.80, whereas it is 0.73 for the main annotator.    The previous inter-annotator reliability study was carried out on the full taxonomy of question responses. However, we also performed inter-annotator reliability tests on several subsets of the taxonomy, to learn which subsets of the taxonomy can be reliably annotated. We also used Cohen's Kappa score for this task.
English The detailed Cohen's Kappa scores on different subsets of the taxonomy for English are presented in Table 6. As shown in the table, the response types DA, CR, ACK, DPR, DP, MOTIV were annotated with almost perfect agreement level (above 0.9) (McHugh, 2012) between annotators in both groups of the experiment. However, the response types such as IGNORE, CHT, IND caused a sharp decrease in the agreement level. The indirect answer (IND) is the one that drops the agreement level between annotators significantly. Polish The agreement level among annotators on different subsets of the taxonomy for two groups of Polish annotation are displayed in Table 7 and Table 8 respectively. Comparing the overall results on two tables, we found that the agreement level among the annotators in the first group is generally higher than that of the second group. In the first group, the response types DA, CR, ACK, DPR, DP, MOTIV were annotated with a strong agreement level (0.8-0.9) (McHugh, 2012) between first and the main annotators, and also between the second and the main annotators. However, those response types were annotated with a moderate agreement level (0.60-0.79) (McHugh, 2012) between the first and the second annotators. As for the second group in Table 8, the response types DA, CR, ACK, DPR, DP, MOTIV were annotated with a moderate agreement level (0.60-0.79) nearly among all annotators. In both groups, the agreement level dropped evidently when IGNORE, CHT, IND were added. To sum up, response types such as DA, CR, ACK, DPR, DP, MOTIV can be reliably annotated by all annotators in both languages, whereas the response types such as IGNORE, CHT, IND cause more confusion to the annotators. Among all response types, the indirect answer (IND) is the one that is most difficult to annotate.

Disagreement analysis
For English: Among the commonly annotated 607 BNC Q-Rs, there are 108 cases where annotation disagreements between two annotators occurred as shown in Table 9. The main disagreements concerned DA versus IND (52), IGNORE versus CHT/ACK/DP/DA/DPR/IND (33), and ACK versus   (30). Invariably, the direct/indirect disagreements occurred with 'why', 'how' and 'what is X doing' questions, where answers are by and large sentential and for which there has been significant controversy in the theoretical literature on how to characterize answerhood (Kuipers and Wiśniewski, 1994;Asher and Lascarides, 1998). In the above conversations, (30a) is an example of DA versus IND, where the first annotator categorized it as IND, while the second person annotated it as DA. After discussion, we decided to classify it as IND given that a certain amount of inference is needed to know the exact time of the bus service. For (30b), the first annotator annotated it as IGNORE, while the second annotator marked it as DA, however, after discussion, we decided that it should be categorized in DA since the response emphasizes the fact that "because it is actually not nice ". For (30c), the first annotator annotated the answer as IGNORE, while the second person categorized it as CHT, and after discussion, we keep IGNORE as the correct annotation since the answer is also related to the main topic "sock". (30d) is an example of ACK versus OTHER, where the first annotator annotated it as OTHER, while the second annotator treated it as ACK. However, as a result of considering the surrounding context, we concluded that it is actually a direct answer to the question. For Polish: For the whole annotated sample, we observed 41 cases with disagreement between all three annotators (as shown in Table 10). The main disagreements concerned DA versus DPR (12), which is a notable difference by comparison with the English data. 15 We also observed some DA versus IND disagreements but much less common (4). It is also the case that the IGNORE category appears often in the disagreements summary (versus DA, CR, IND, CHT, and DPR).
Among the analyzed disagreement cases, two are especially interesting as the disagreement of all annotators is observed for consecutive turns in a dialogue. The first problematic case is for [016O, 62-65]. A and B are discussing B's application for a scholarship.
A: a w tej twojej szkole ty jako <PAUSE> twoja kandydatura została złożona tylko <PAUSE> czy jeszcze jakiś innych osób też [and in your school it is you <PAUSE> you are the only candidate <PAUSE> or maybe there are some other people who also applied] 15 We hypothesize that the reason for this may be the background of annotators as logicians. From a logical perspective the exhaustiveness of an answer is important (see e.g. Wiśniewski, 2013). Thus, certain partial answers provided by dialogue participants were probably tagged as DPR. This may be due to the fact that partial answers were not explicitly pointed out in the guidelines. In this case, the disagreement between annotators was whether the first B's utterance should be classified as 'it is difficult to provide an answer' (DPR) or as an indirect answer (IND). As for the second B's utterance, the suggested types were DPR and IGNORE.
Another example where the disagreement was observed for two consecutive utterances is [01AO, 256-259]. Most probably, this is caused by the fact that four participants took part in this dialogue (which makes an interpretation of question responses much more difficult). Here C's utterance was tagged as OTHER, DP, and CR. It seems that in this case, C's utterance may be treated as a simple repetition of B's question, and as such, it should not be recognized as DP. As for A's utterance, it was tagged as IND, DPR, and DA by the annotators. In this case, the answer does not require any form of inference. It simply states that it will be the same amount of money you can earn in certain places. The place and the amount of money are then pointed out by the following D's utterance. That speaks for interpreting A's utterance as a DA (however, a partial one). 16

Formal Analysis
There is a two-way relationship between corpus studies of questions and responses and formal semantic theories of questions and of dialogue. Notions from the latter play an important role in the design of the former. And one can strive to show that the categories posited are coherent formally using formal theories. Conversely, the ability to fully describe the data that emerges from such corpus studies can be used as a means for evaluating different approaches. Our aim in this section is to address both directions alluded to above.
Our explication is formulated using the frameworks of TTR (Cooper and Ginzburg, 2015;Cooper, 2023) (for the semantic ontology) and KoS (Ginzburg, 2012; (for the theory of dialogue context); the relevant notions of TTR are sketched in Appendix A, whereas those of KoS are introduced in the text.

The classes DA, DP, IND
We assume that questions are propositional abstracts-extensive motivation for this view is provided in (Ginzburg, 1995;Ginzburg and Sag, 2000;Krifka, 2001); the particular implementation of this view in TTR can be found in (Ginzburg, 2012;Cooper and Ginzburg, 2015). 17 (33) exemplifies the denotations (contents) we can assign to a unary, binary wh-interrogative and to polar questions. We use r ds here to represent the record that models the described situation in the context. The meaning of the interrogative would be a function defined on contexts which provide the described situation and which return as contents the functions given in (33). The unary question ranges over instantiations by persons of the proposition "x runs in situation r ds ". The binary question ranges over pairs of persons x and things y that instantiate the proposition "x touches y in situation r ds ": (33) a. who ran → λr: x: Polar questions are analyzed, following an initial proposal of Ginzburg and Sag (2000), as 0-ary abstracts, which in TTR is a question whose domain is the empty record type [] (that is, the type Rec of records). 18 This makes a 0-ary abstract a constant function from the universe of all records. It allows to distinguish the denotations of positive and negative polar questions, as exemplified in (33c,d) and as motivated by a variety of linguistic phenomena (Hoepelmann, 1983;Cooper and Ginzburg, 2012). At the same time, it ensures that the answerhood relations they give rise to are (truth conditionally) equivalent, given that the simple answerhood relations they give rise to are equivalent and other answerhood relations are defined in terms of these. 19 Simple answerhood is the range of the propositional abstract, plus their negations. We exemplify what this amounts to for some cases in (34), using as we do mostly in the sequel familiar λ-notation for wh-questions and p?-notation for polar questions, rather than the official TTR notation above: 20 Assuming questions to be propositional abstracts means that they can be used to underspecify answerhood. This is important given that NL requires a variety of answerhood notions, both for classifying responses and also for the role questions play as arguments to predicates such as 'know', 'tell', and 'depends', which in turn play a role in associated discourse reasoning (Groenendijk and Stokhof, 1997;Wiśniewski, 2015). In fact, simple answerhood, though it has good coverage in practice, is not sufficient. It does not accommodate conditional, weakly modalized, and quantificational answers, all of which are pervasive in actual linguistic use (Ginzburg and Sag, 2000): Thus, we suggest that the semantic notion relevant to direct answerhood is the relation aboutness-a relation between propositions and questions that any speaker of a given language can recognize, independently of domain knowledge and of the goals underlying an interaction.
The most detailed discussion of Aboutness we are aware of is (Ginzburg and Sag, 2000, pp. 129-149), which offers (36a) (reformulated here in TTR) 21 as a characterization of Aboutness that can accommodate data such as (35). 22 This requires the situational type component of the proposition to be a subtype of the join of the situational type of the question's simple answer set. As it stands, this definition allows in principle very informationally strong types as direct answers, since nothing bounds the proposition from above. Plausible upper bounds for direct answerhood familiar in the semantics of questions from the classic proposal of Karttunen (1977) are the meets of the question's atomic and negative atomic answer set. 23 This condition is formulated in (36b): 24,25 21 See Appendix A for some additional details. 22 Ginzburg (1995) suggested that Aboutness is closed under conditionalization: i.e., for any r, p if p is about q, then so is if r, then p: (i) A: Who will win tomorrow's match? B: If it isn't raining, the French.
(ii) A: Did someone switch the oven off? B: Unless you explicitly told them to, no one did.
The definition given in the text covers non-conditionalized answers. One crude strategy to obtain the latter, as proposed by Ginzburg (1995), is to extend the definition for non-conditionalized answers by closing it under conditionalization. 23 For a polar question p? the meets of the question's atomic and negative atomic answer set are respectively p and ¬p, whereas for a wh-question λx.P (x) (e.g., 'who left') they are respectively P (ai) ('Bo left and Millie left . . . '), whereas ¬P (ai) ('Bo did not leave and Millie did not leave . . . , i.e., equivalent to 'No one left'). 24 For a wide ranging discussion of a variety of answerhood relations, see (Wiśniewski, 2015). He leaves the composition of his "base answer set", the Principal possible answers (PPAs), as a parameter of the theory, to be fixed independently from the questions, since his account is stated in an artificial logical language that is not directly tied to linguistic forms. Hence, his account is compatible in principle with most semantic approaches to questions. 25 Our use of subtyping as a means of characterizing aboutness reflects that, as an anonymous reviewer for Dialogue and Discourse points out, both direct and indirect answerhood involve inference. As we discuss below, for the latter the notion of inference is an agent-relative notion. Despite the proposals mentioned above for explicating direct answerhood, a comprehensive, empirically-based, experimentally tested account for a variety of wh-words is still elusive and an important task for future work.
An additional important notion a theory of questions needs to provide for is a notion of exhaustiveness or resolvedness, though this is in general pragmatically parametrized (Ginzburg, 1995;Asher and Lascarides, 1998;van Rooy, 2003). Whether a response is resolving (or merely goal fulfilling without so doing) can determine whether the response will be accepted as sufficient to end discussion of the question or requires a follow up. Hence, the need for a finer-grained subdivision of the answer categories, as we hinted in footnote 7.
Given a notion of aboutness and some notion of (partial) exhaustiveness/resolvedness, one can then define question dependence (needed for the class DP), for instance, as in (37), though various alternative definitions have been proposed (Groenendijk and Stokhof, 1997;Groenendijk and Roelofsen, 2011;Wiśniewski, 2013). For all these definitions, as with aboutness, their coverage awaits testing on empirical data: Depend(q 1 , q 2 ) iff any proposition p such that p resolves q2, also satisfies p entails r for any r such that r is about q1, (Ginzburg, 2012, (61b), p. 57).
We have introduced answerhood notions corresponding to direct answerhood and to questiondependence, two of the three response categories we identified as Question-Specific in Section 3. Before we introduce the third notion, indirect answerhood, we sketch an account of dialogue context, which will allow us to integrate all three in a semantics for dialogue.
The simplest model of context, going back to Montague (1974), is one which specifies the existence of a speaker, addressing an addressee at a particular time. This can be captured in terms of the type in (38) However, over the last four decades it has become clearer how much more pervasive reference to context in interaction is. Expectations due to illocutionary acts-one act (querying, assertion, greeting) giving rise to anticipation of an appropriate response (answer, acceptance, counter-greeting), also known as adjacency pairs (Schegloff, 2007). Extended interaction gives rise to shared assumptions or presuppositions (Stalnaker, 1978), whereas epistemic differences that remain to be resolved across participants-questions under discussion are a key notion in explaining coherence and various anaphoric processes (Ginzburg, 1994(Ginzburg, , 2012Roberts, 1996). These considerations among several additional significant ones we discuss below lead work in KoS to two strategic moves: (i) instead of assuming a single context to be operative, a distributed notion is emergent from individual Total Cognitive States (TCS), one per participant. A TCS has two partitions, namely a privateabout which we will not elaborate here-for details see (Larsson, 2002), and a public one.
(39) TCS = public : DGBType private : Private (ii) we posit a significantly richer structure to represent each participant's view of publicized context, dubbed the dialogue gameboard (DGB), whose basic make up to process question-specific moves is given in (40): Here facts represents the shared assumptions of the interlocutors-identified with a set of propositions. The parameters spkr and addr together with the addressing condition (at a given time) track verbal turns and mutual engagement. The remaining fields concern locutionary and illocutionary interaction. Within moves the first element has a special status given its use to capture adjacency pair coherence and it is referred to as LatestMove. The current question under discussion is tracked in the qud field, whose data type is a partially ordered set (poset). Vis-sit represents the visual situation of an agent, including his or her focus of attention (foa), which can be an object (Ind), or a situation or event (Sit), relevant inter alia for processing gestural answers. We call a mapping between DGB types a conversational rule-Conversational rules are the means for specifying how DGBs evolve. The types specifying its domain and its range we dub, respectively, the pre(conditions) and the effects, both of which are subtypes of DGBType: they apply to a subclass of records that constitute possible DGBs and modify them to records that constitute possible DGBs. Conversational rules are written here in a form where the preconditions represent information specific to the preconditions of this particular interaction type and the effects represent those aspects of the preconditions that have changed.
The first conversational rule we formulate relates to the basic effect a query has on the DGB-as a consequence of a query a question becomes the maximal element of QUD: With this initial view of context and context change in hand, we can return to discuss indirect answerhood. The notion of direct answer is clearly complex and, as we have indicated, probably needs, at least for dialogue management purposes, to be refined. With indirect answers the situation seems even more tricky, which in part reflects why this category is one of those with most inter-annotator variability. Indirectness encapsulates various notions, as we have already discussed in Section 2. There is a considerable literature on indirect speech acts, building on and reacting to initial notions from Grice (1975) and Searle (1975). Roughly speaking, these involve cases where the speaker's intention is not transparently reflected in an utterance's grammatically governed content-the content whose resolution is driven by conventional mechanisms. 26 The classic Gricean model involves initial recognition of a literal content (corresponding to what we have referred to above as 'grammatically governed content') 27 and then, via domain-specific means, inference of the speaker's intention. Significant doubts about this time course, about the necessity of actually consulting a/the literal content, and what should be viewed as the literal/direct content have been debated extensively in the pragmatics literature, much of it in recent years on an experimental basisfor detailed review see (Noveck, 2018). Indirect speech acts are of course also an important theme in the AI planning literature, e.g., (Cohen and Perrault, 1979), incorporated in dialogue semantic frameworks in (Larsson, 2002;Asher and Lascarides, 2003;Ginzburg, 2012).
While a detailed analysis is beyond our scope here, one can distinguish at least two cases, which we might label as shallow and deep indirect answers. The former corresponds to cases like (11) and (13) repeated here as (42a,b) respectively, where the entailment of a direct answer is due to shallow shared knowledge (for (42a): find(a,b,t 1 ) → look_for(a,b,t 0 ), so by contraposition ¬∃t look_for(a,b,t) → ¬ find(a,b,t 1 )) or to domain-independent erotetic reasoning (Wiśniewski, 2013), which adjusts the question asked to a close variant (Larsson, 1998) (e.g., ?∃x.P (x) → λx.P (x), for (42b)). Some initial refinement of IND along these lines is hinted in footnote 8 above. This contrasts with the deep indirect answers, exemplified in (42c), which involve reasoning about the speaker's intentions, most often though not invariably based on domain-specific information. For detailed discussion of deep indirect answers within SDRT, see Lascarides, 2001, 2003); for an account within KoS, see (Ginzburg, 2012, §8.3).
(42) a. Q: And also did you find my blue and green striped tie? R: I haven't looked for it. 26 By this we mean content whose contextual parameters are conventionally specified, e.g., 'Jill left' conventionally specifies predication of some concept of leaving applying to a person the speaker refers to as 'Jill'; resolving which concept of leaving and which Jill is less clearly rule-driven, though is a complicated mix of speaker/audience interaction, contextual salience etc. 27 We use the latter somewhat pedantic term to differentiate it from the former, which has a variety of problematic associations. As will become clear in section 8.2, we do not assume that in general speaker and addressee need identically resolve even the grammatically governed content. c. (Context: in queue for toilet on an aircraft) ANON WOMAN: How desperate are you? ME: (shrugs), Go ahead. (Ginzburg, 2012, p. 304) Two basic conditions seem to characterize these cases: first, the indirect answer p is NOT a direct answer to the question q in the sense of the definition in (36b); second, p together with some shared knowledge, i.e., an element of FACTS for some dialogue gameboard dgb, the bridging proposition bridgeprop, entails r, which is a direct answer to q: 28 (43) Given p : P rop, q : Question, dgb : DGBT ype InDirectAns(p,q,dgb) iff ¬DirectAns(p,q) and there exist bridgeprop, r : P rop such that DirectAns(r,q) and In(dgb.FACTS,bridgeprop) and → (p ∧ bridgeprop, r).
To exemplify: for (44a) asked by A who B knows needs to get up after sunrise, we could assume that the indirect answer p conjoined with (presumably shared) bridgeprop entails r: 29 (44) a. A: Is it time to rise? B: It is still dark outside. b. p = Dark(here, now) c. bridgeprop = If it is dark here now, the time now is before A needs to rise.

d. r = ¬N eedRise(A, now)
We can now formulate a rule that explicate how answers and depended-upon questions get introduced in dialogue. This rule characterizes the contextual background of reactive queries and assertions-if q is MaxQUD, then subsequent to this either conversational participant may make a move which is either a (direct or indirect) answer or a question on which q depends).

The classes CR and ACK
MetaCommunicative utterances, including acknowledgements, clarification responses (CRs) (also known as other repair and as other communication management) and (metacommunicative) corrections are challenging for most existing frameworks for dialogue semantics. For a start, given the mismatch they reveal between the dialogue interlocutors, they require a distributed approach to context. This rules out accounts where all semantic rules are assumed to apply to the common ground, made prominent in the view of QUD due to Roberts (1996). 30 This was also the case for the view of discourse structure in earlier work in SDRT (e.g., Lascarides 1998, 2003).
In more recent work (e.g., Lascarides and Asher 2009), SDRT adopts a view advocated in KoS and also in the framework of PTT (Poesio and Rieser, 2010) that associates a distinct contextual entity with each conversational participant. A deeper challenge is that the analysis/generation of metacommunicative utterances requires access to the entire sign associated with a given interrogative utterance. This is for two main reasons. On the one hand, any constituent, certainly down to the word level can be the object of an acknowledgement and a clarification response, as exemplified for clarification responses in (46). Moreover, as discussed in detail in (Ginzburg, 2012), there are a variety of parallelism constraints relating to the form of such utterances that require reference to the non-semantic representation of the utterance. An illustration of this is given in (47)  This issue, first discussed in some detail in (Ginzburg and Cooper, 2004), rules out the lion's share of logic-based frameworks where reasoning about coherence operates solely at the level of content. For instance, in SDRT the semantics/pragmatics interface has no access to linguistic form, but only to a partial description of the content that is derived from linguistic form. This has been argued to be necessary to ensure the decidability of SDRT's glue logic (see e.g., Asher and Lascarides 2003, p. 77).
In order to accommodate this class of utterances, it is crucial that the cognitive states keep track of the utterance associated with the question. In KoS this is handled via the field PENDING whose type (LocProp) is a record with two fields, one instantiated by an utterance token u, the other by an utterance type T u (the sign classifying u); this allows inter alia access to the individual constituents of an utterance.
This leads to the following modified architecture for DGBs-they are distributed across dialogue participants (in other words-each participant is assigned their own DGB) and they include the field Pending consisting of ungrounded utterances:  Ginzburg and Cooper (2004); Purver (2004); Ginzburg (2012) show how to account for the main classes of CRs using rule schemas of the form "if u is the interrogative utterance and u0 is a constituent of u, allow responses that are co-propositional 31 with the clarification question CQ i (u0) into QUD.", where 'CQ i (u0)' is one of the three types of clarification question (repetition, confirmation, intended content) specified with respect to u0.
For instance, responses such as (46b) can be explicated in terms of the schema in (49): (49) if A's utterance u is yet to be grounded and u0 is a sub-utterance of u, QUD can be updated with the question What did A mean by u0 More formally: the issue q0, what did A mean by u0, for a constituent u0 of the maximally pending utterance, A its speaker, can become the maximal element of QUD, licensing follow up utterances that are CoPropositional with q0. Assuming a propositional function view of questions, CoPropositionality allows in propositions from the range of Range(q0) and questions whose range intersects Range(q0). Since CoPropositionality is reflexive, this means in particular that the inferred clarification question is a possible follow up utterance, as are confirmations and corrections, as exemplified in (51).
(50) Parameter identification:   Łupkowski and Ginzburg (2016) suggest that common to all classes of evasion utterances is a lack of acceptance of q1 as an issue to be discussed. In MOTIV-type responses the need/desirability to discuss q1 is explicitly posed, in CHT-type responses there is an implicature that q1 is of lesser importance/urgency than r 2 (expressing either a proposition or a question), whereas for IGNORE type responses there is an implicature that q1 as such will not be addressed. Łupkowski and Ginzburg (2016) also note that whereas q 1 is not accepted for discussion, it remains implicitly in the context. In (52), where move (2) could involve either a MOTIV query (2a), or a CHT query (2b), the original question has definitely not been re-posed and yet B still has the option to address it, which s/he should be unable to do if it is not added to his/her context before (52 (2)). Similar remarks mutatis mutandis apply to the DPR utterance in (52b): This basic characteristic can be captured in the cognitive state architecture discussed above, given that QUD is assumed to be partially ordered; this is a crucial difference from a view of QUD as a stack or similar (Roberts, 1996;Farkas and Bruce, 2010).
Concretely, Łupkowski and Ginzburg (2016) proposed to handle metadiscursive utterances such as MOTIV by viewing them as responses specific to the issue ?WishDiscuss(B,q) for a given question q and responder conversational participant B. This same approach can be applied to DPR, which Łupkowski and Ginzburg (2016) did not analyze, assuming that these involve responses specific to the issue λxKnowAnswer(x, q). We assume this formulation of the issue given the possiblity of responses along these lines of 'Sam knows ','You don't know?' etc. 32 In fact, we will deviate somewhat from the account of Łupkowski and Ginzburg (2016) in proposing a more uniform account than they did of all four classes for reasons we explain below. In order to do this, we will define a single type EvasiveResp that encompasses the commonalities between the four classes; each class will then be specified by merging EvasiveResp with information specific to that particular class. In all cases, in line with the fact that q remains accessible, as exemplified in (52), QUD is specified to include both q and a pertinent 'metaquestion'. An additional commonality for all except DPR is turn change, underspecified for QSPEC given that for the latter it is not required, whereas in these cases it is more or less essential for coherence; this specification will be defused for DPR by using asymmetric merge.
pre : QUD = q1, Q : poset(Question) effects : Given this, MOTIV and DPR are specified as follows: 33 32 Utterances like 'I don't know' and other DPR are differentiated from some other metadiscursive utterances in that the former can be used by the same speaker as a follow up, whereas the latter only if the speaker is correcting herself for having asked the question: (i) A: Who should we invite?
(iii) . . . # Do we need to talk about this now? (iv) . . . # I don't wish to discuss this now.
Note also that 'I don't know' can be used as an editing phrase (Tian et al., 2015)-'She's I don't know 29.'. 33 The basic idea of merge for record types is illustrated by the examples in (i,ii).
(i) f:T1 ∧ . g:T2 = f:T1 g:T2 (ii) f:T1 ∧ . f:T2 = f:T1∧ . T2 In asymmetric merge, T1 ∧ . T2, the second argument takes priority over the first, e.g.,  (Cooper, 2012(Cooper, , 2023 With respect to both CHT and Ignore, we adopt a somewhat different perspective than that offered by Łupkowski and Ginzburg (2016), for both empirical and conceptual reasons. Considering the much larger dataset considered in this paper, their view of CHT seems too "cooperative" and that of IGNORE too "hostile". The analysis they offered for IGNORE built on an earlier analysis in (Ginzburg, 2012) intended to capture Gricean irrelevance, floutings of the Gricean maxim of relevance as in (55). That analysis was designed to explain how the initial utterance in effect gets expunged from the DGB.

(55)
A: Rozzo just gave a terrible talk. B: It's really hot and unpleasant here.
However, IGNOREs often occur in quite cooperative environments such as the MapTask, where under time pressure the response is driven by the observed situation. Indeed, Table 9 indicates that IGNOREs were most frequently confused with answers (direct and indirect) and with CHTs; the former datum suggests, therefore, that IGNOREs are susceptible to be viewed as addressing something related to the question asked. On the other hand, as far as CHT goes, the analysis of Łupkowski and Ginzburg (2016) was, arguably, too "cooperative". Łupkowski and Ginzburg (2016) assume that r 2 is constrained to be unifiable with q 1 via a question q 3 (e.g., q 1 = what do you (B) like? r 2 = what do you (A) like? q 3 = Who likes what?). This assumption was motivated by a certain paralellism that seems to occur frequently between q 1 and r 2 when the latter has the form of a question. Imposing this condition, which requires a question inference mechanism for testing this unifiability, significantly constrains the CHT relation. However, in the more general case, where responses are not constrained to be questions, this condition seems less justified and, even focussing on question responses, the constructed example (56) seems quite natural: 34 (56) A: When are you going to respond to the allegations? B: Anyway, when are we going to get credit for our world leading vaccination program?
The simplest analysis for IGNORE would make the pertinent meta-question be an arbitrary question about entities in the visual situation. Similarly, for CHT the simplest analysis would involve allowing a response specific to an arbitrary question. The obvious problem this would raise in both cases is massive ambiguity since many responses from other classes would be analyzable in such terms. To avoid this problem, we need to introduce an additional restriction, for instance along the lines of the afore-mentioned irrelevance; in other words, lack of coherence with the current context. What would this amount to? Being neither QSpecific with respect to q 1 uttered by A to B, nor being co-propositional with a clarification question generated by q 1 's utterance, nor QSpecific with respect to ?WishDiscuss(B,q1) or λxKnowAnswer(x,q1). Putting these conditions together amounts to the IrRel relation of Ginzburg (2012), which holds between an utterance and a DGB.
Given this, we formulate the rules for CHT and IGNORE as in (57a) and (57b). The fact that in both cases the topic addressed is irrelevant(IrRel) to the (precondition) DGB in the sense just discussed captures a similarity between the two. At the same time, there is also a significant difference in that IGNORE intrinsically uses material from the DGB, namely at least one entity from the visual situation as a constituent of the propositional nucleus of the question to establish coherence with the question posed. A further difference between the two-and deviation from (Łupkowski and Ginzburg, 2016)-is an emergent presupposition in the case of CHT that the responder does not wish to discuss q 1 .

Conclusions and Future Work
In this paper, we have presented an initial study for what is, as far as we are aware, the first, detailed, formally underpinned characterization of the response space of questions. Concretely, our initial hypothesis, stated in the introduction as (1) is repeated here as (58): (58)(H) Main hypothesis: responses drawn from or concerning the LG query classes plus direct answerhood exhaust the response space of a query.
We think the data provided in previous sections validates this hypothesis, though we have made some small adjustments-conflating several classes. Achieving such a characterization is a fundamental challenge for semantics with a very wide variety of applications. It establishes theoretical benchmarks for theories of dialogue, for dialogue systems, and for semantic theories of questions.
Apart from the need to scale up the evidence quantitatively, we are currently engaged in work on the following strands: • Extending the characterisation of response spaces to other moves: we have partitioned the response space into question-specific and non-question-specific (Metacomm, CHT, IGNORE, MOTIV, DPR). This suggests that other moves such as assertions and commands can be characterized in similar terms, where the non-question-specific class is applicable to all.
• The account we have developed is domain general, abstracting over differences between different conversational types/genres/language games etc. To what extent the current account will change once one takes such differences into account is an important question.
• Cross-question type comparison: the Q-R pairs annotated in the current study were selected randomly, whereas it is clearly of interest to consider the distribution of responses relative to fixed classes of questions (e.g., different classes of wh-questions, polar questions etc.) • Apply machine learning to acquire the response classification scheme: Yusupujiang et al. (2022) provide an initial study comparing both classical machine learning algorithms as well as pretrained language models such as BERT (Devlin et al., 2018). This achieves encouraging results on some classes (e.g., DA and CR), while struggling with heavily inference-based classes like indirect answers, and IGNORE/CHT. This learnability trend is closely in line with that achieved by the human annotators in the current paper.
• Spoken dialogue system implementation: we plan to test the usability of these categories in dialogue systems. For this, one needs dialogue systems with sophisticated NLU, along the lines sketched in (Maraev et al., 2018(Maraev et al., , 2020. • Cross-linguistic testing: a significant challenge is how to test the classification with languages lacking large or even hardly any speech corpora. We anticipate using online games with a purpose to this end (see e.g. Łupkowski and Ignaszak 2017;Łupkowski et al. 2018;Yusupujiang and Ginzburg 2020). For an initial study concerning the response space of queries in Uyghur, see (Yusupujiang and Ginzburg, 2022).
Finally, it is worth mentioning that at least part of our response typology can be be straightforwardly related to one of the well known annotation standards for dialogues, namely the ISO 24617-2 (Bunt, 2019). 35 The standard focuses on functional segments of dialogue acts. These segments are understood as "minimal stretches of communicative behavior that have a communicative function, 'minimal' in the sense of not including material that does not contribute to the expression of the function or the semantic content of the dialogue act" (Bunt, 2019, p. 4). When it comes to the general-purpose functions, dialogue acts may be information-providing (making certain information available to the addressee) or information-seeking (where information to be obtained can be of any kind, relating to the underlying task or activity, or even relating to the interaction itself). Among the information-providing functions, two sub-categories are distinguished: answer functions (where the speaker is providing information in response to an information need) and informing functions (where the speaker wants the addressee to know or be aware of something. One may notice that parts of our typology relate to the scope of the information-providing functions. DA and IND fall under answer functions, and ACK, IDK, DPR as well as CR may be categorized as informing functions. What would be interesting is to find a place for evasive responses in the DIT++ scheme (probably among the dimension-specific functions). What remains an open question is how to incorporate question-responses into the aforementioned scheme. 35 Stemming from the DIT++ annotation scheme (Bunt, 2009). The basic relationship between records and record types is that a record r is of type RT if each value in r assigned to a given label l i satisfies the typing constraints imposed by RT on l i . More precisely, The record     l 1 = a 1 l 2 = a 2 . . . l n = a n     is of type     l 1 : T 1 l 2 : T 2 . . . l n : T n     iff a 1 : T 1 , a 2 : T 2 , . . . , a n : T n .
To exemplify this, (64a) (the temperature of a given location at a given time) is a possible type for (61b), assuming the conditions in (64b) hold. Record types are used to model utterance types (Saussurean/Formal Grammar signs) and to express rules of conversational interaction. Sometimes one needs to partially specify a general type by tying down one or more of the fields to a specific value. For this we use a manifest field as in (65) TTR assumes in addition the following type construction operations: (67) a. Function types: (T 1 )T 2 is the type of functions from elements of type T 1 to type T 2 .
b. Set and list types: Set(T ) and List(T ).
c. Boolean types: (i) Given a type T , there exists ¬T .
(ii) Given a set X of types T i , there exist X T i and X T i .
X T i and X T i have "classical" witnessing conditions: (68) a. r : X T i iff for at least one i ∈ X r : T i b. r : X T i iff for all i ∈ X r : T i In contrast, negation is a notion based on incompatibility that is a classical-intuitionist hybrid: (69) a. a : ¬T iff there is some T such that a : T and T precludes T b. T precludes T iff: • T = ¬T , or • T and T are non-negative and there is no a such that a : T and a : T One can show that T and ¬¬T are equivalent, but the former is a positive, the latter a negative type. On the other hand, a need not be of type T and there need not be a type T that precludes T ; in other words: a : T ∨ ¬T is not a tautology. The basic reasoning for this goes back to (Barwise and Perry, 1983): (70) a. If I observe Jo cutting onions, the situation I observe neither tells me that B. Johnson is smoking a cigar, nor that he is not smoking a cigar.
b. Hence, s visual : Cutting(j, o), s visual : CigarSmoke(b.johnson), hence: it is not the case that s visual : CigarSmoke(b.johnson), but neither is it the case that s visual : ¬CigarSmoke(b.johnson) The final notion we mention are propositions. 37 Propositions are construed as typing relations between records (situations) and record types (situation types), or Austinian propositions (Austin, 1961;Barwise and Etchemendy, 1987); more formally:  In all other cases, put the OTHER tag. W innych przypadkach, proszę użyć tagu OTHER a w kolumnie obok opisać jaką funkcję spełnia ta reakcja na pytanie w tym konkretnym przypadku.