Visual attention-capture cue in depicted scenes fails to modulate online sentence processing

Everyday communication is enriched by the visual environment that listeners concomitantly link to the linguistic input. If and when visual cues are integrated into the mental meaning representation of the communicative setting, is still unclear. In our earlier findings, the integration of linguistic cues (i.e., topic-hood of a discourse referent) reduced discourse updating costs of the mental representation as indicated by reduced sentence-initial processing costs of the noncanonical word order in German. In the present study we aimed to replicate our earlier findings by replacing the linguistic cue by a visual attention-capture cue that directs participants’ attention to a depicted referent but is presented below the threshold of conscious perception. While this type of cue has previously been shown to modulate word order preferences in sentence production, we found no effects on sentence comprehension. We discuss possible theory-based reasons for the null effect of the implicit visual cue as well as methodological caveats and issues that should be considered in future research on multimodal meaning integration.


Introduction
Everyday communication is multimodal, comprising linguistic as well as extra-linguistic (e.g., visual) information. A growing branch of psycholinguistic research highlights effects of extralinguistic cues (e.g., eye gaze or gestures) and the visual environment on language processing (e.g., Crocker, Knoeferle, & Mayberry, 2010;Nappa & Arnold, 2014;Spevack, Falandays, Batzloff, & Spivey, 2018;Staudte, Crocker, Heloir, & Kipp, 2014). By contrast, traditional models of sentence comprehension do not explicitly account for the role of visual attention during the comprehension process (e.g., Friederici, 2002;Marslen-Wilson & Tyler, 1980). However, discourse models (or situational/mental models) go beyond sentence-level processing. Discourse models propose that during communication, interlocutors build a non-linguistic mental representation of relevant discourse referents and events based on the incoming linguistic and visual perceptual input amongst multiple other factors (e.g., Bower & Morrow, 1990;Gernsbacher, 1991;Grosz & Sidner, 1986;Johnson-Laird, 1980;Van Dijk & Kintsch, 1983;Zwaan, 2004;Zwaan & Radvansky, 1998). Therein, referents of high attentional state are assumed to be mentally represented with a higher degree of mental accessibility and/or a higher activation level compared to referents of low attentional state (e.g., Arnold, 2010;Arnold & Lao, 2015;Givón, 1988;Gundel, Hedberg, & Zacharski, 1993). We will refer to those referents of high attentional state as being more salient. For the present study, we differentiate between linguistic salience which is verbally induced by, for instance, subject-hood or topic-hood of a referent vs. visual salience which is induced by, for instance, exogenous visual cues to a depicted referent. Exogenous visual cues initiate a reflexive attention shift of the addressee to the location of a stimulus. In our study we used the terms implicit vs. explicit visual cues to enable a more precise distinction of exogenous visual cues, which were used in previous studies (analogous to the distinction by Myachykov, Thompson, Garrod, & Scheepers, 2012, p. 3). Implicit visual cues are presented below the threshold of perception (i.e., subconsciously). Explicit visual cues are presented above the threshold of perception (i.e., consciously) (for an overview about neuronal modulations by stimulus-driven (i.e., sensory cue-based) visual attention mechanisms, see Corbetta & Shulman, 2002). With the present study, we aim to test if visual salience induced by an implicit visual attention-capture cue to a depicted referent impacts online sentence-initial processing in a similar way as it has previously been shown for linguistic salience (Burmester, Spalek, & Wartenburger, 2014). Hence, we raise the underlying question if the accessibility degree of mentally represented discourse referents is affected by this type of implicit visual cue or if this is limited to linguistic cues.
In the linguistic domain, information structure is used to make certain entities of the discourse more salient. For instance, topic or aboutness topic is an information structural concept describing the entity (e.g., a referent) the sentence is about, that is, topic is attributed to that part of information about which the speaker intends to increase the listener's knowledge (Gundel, 1985;Reinhart, 1981). Hence, topic is ascribed not solely a formal linguistic but also a cognitive concept that activates the listener's mental representation at the beginning of a sentence (Portner, 2007). In the majority of languages, salient information -in terms of the grammatical subject and/or topic of the sentence-dominantly occupies the sentence-initial position, because subjects and topics own a higher accessibility degree compared to their complements, that is, objects and comments (e.g., Bock & Warren, 1985;Dryer, 2013;Tomlin, 1995). German is a language with a strong subject-first preference (e.g., Hemforth, 1993;Weber & Müller, 2004): The canonical word order in German main clauses is subject-verb-object (SO) (see example sentence (1)). Morphological case marking at the respective noun phrases enables the identification of the grammatical function of subject (via nominative case (NOM)) and object (via accusative case (ACC)) for masculine nouns. (Note that in the example sentences (1) and (2), the nouns "Wal" [whale] and "Hai" [shark] lack overt case affixes, while the determiners are overtly case marked, which nevertheless allows the unequivocal identification of subject and object.) (1) SO: Der Wal streichelt den Hai.
[the [NOM]  Despite the strong subject-first preference in German, information structural characteristics allow reordering of sentential constituents such that the object can precede the subject (see example sentence (2) for a non-canonical object-verb-subject (OS) main clause). However, OS sentences in German are much less frequent than SO sentences (e.g., Bader & Häussler, 2010) and need a suitable context which increases the salience of the sentence-initial object. In our previous work, for instance, linguistic salience in short, fictitious stories of two animals was induced by a topic question (i.e., "What about the shark?"), which revealed one of two previously mentioned (i.e., discourse given) referents as the topic of the scene (Burmester et al., 2014). Compared to a neutral cue not indicating topic-hood but a wide focus (i.e., "What exactly is going on?"), subsequent online sentence-initial processing of OS sentences is eased. This facilitating impact of linguistic salience (i.e., topic-hood) is reflected in the event-related potentials (ERPs) in the form of a sentence-initial Late Positivity, which is attributed to reduced discourse updating costs (e.g., Schumacher & Hung, 2012).
In 2018 we directly compared linguistic and visual salience cues (Burmester, Sauermann, Spalek, & Wartenburger, 2018): Visual salience induced via an explicit gaze-shift of a virtual person to a depicted referent speeds sentence-initial reading times of German SO and OS sentences similar to linguistic salience induced via a topic cue. Hence, the sentence-initial processing ease was evident 1) independent of whether salience was induced linguistically or visually compared to a preceding neutral cue, and 2) independent of whether the salient referent is mentioned as the sentence-initial subject or object (Burmester et al., 2018). This is line with other studies supporting the view that utterance comprehension is facilitated when the speaker's gaze increases the visual salience of depicted referents (e.g., Hanna & Brennan, 2007;Knoeferle & Kreysa, 2012;Staudte & Crocker, 2011). However, not only speakers' eye gaze, which provides explicit information about referential intentions (henceforth: intentional information), influences utterance comprehension, but also various other visual salience cues. Staudte et al. (2014) showed that listeners benefit from an explicit (non-gaze) arrow cue (henceforth: attentional information) during utterance comprehension similar to eye gaze. Both the arrow and the gaze cue effectively direct listeners' visual attention to a depicted object, to finally anticipate this salient object for an upcoming verbal reference (Staudte et al., 2014). Arnold and Lao (2015) showed that another abstract type of visual attentional cue (i.e., a black rectangle with a size of approximately 1.0° x 1.0° of visual angle 1 presented for 200 ms at the target referent's location) together with the position of the referent in the visual display manipulates listeners' trial-initial attention in depicted scenes. Still, when listeners interpret a subsequent pronoun, their trial-initial attention only secondarily influences which antecedent they select as the most accessible referent in discourse. Instead, pronoun interpretation is primarily driven by the linguistic cue of sentenceinitial mention. Overall, such studies provide evidence that explicit visual attentional cues affect sentence-and discourse-level processes, although to a different extend than linguistic cues.
Evidence in favour of the impact of implicit visual cues comes from language production studies. Here, implicit similar to explicit visual cues effectively manipulate speakers' attention to referents in a depicted scene. This manipulation of the speakers' attention is reflected in the 1 Note that in order to establish comparability of visual cues published in previous research, we calculated the visual angle of the cues used by Arnold and Lao (2015) and Myachykov et al. (2012) post hoc as these studies did not report the visual angle. Calculations of the visual angle account for the visual cue's size and distance from participants' eyes. For instance, for Arnold and Lao (2015) the calculation was based on a screen distance of 650 mm (22 -34 inches reported), screen size width of 390 mm, screen resolution width of 1280 pixels, and cue size of 38 pixels: = ( (((38/2)/(650 * 1280/390)))) * 2 = 1.0° sentential structure they choose in picture descriptions: Gleitman, January, Nappa, and Trueswell (2007) used an implicit visual cue by means of a black rectangle with a size of approximately 0.5° x 0.5° of visual angle presented for about 60 -75 ms at the location of one of two subsequently depicted referents. Other sentence production studies used explicit visual cues by means of a black arrow (Tomlin, 1995), red dot or referent preview (e.g., Myachykov et al., 2012;Turner & Rommetveit, 1968) followed by the presentation of referents that are performing a simple transitive action. As a result, these implicitly or explicitly cued referents are more salient or accessible 2 than other, uncued referents as reflected in a greater likelihood of salient referents being mentioned sentence-initially as the grammatical subject and/or topic of the sentence (e.g., Arnold, 1998Arnold, , 2010Tomlin, 1997). Both cue types even lead to production of otherwise disfavoured linguistic structures (in English). For instance, in cases where the patient of the transitive action is cued, speakers produce the less frequent passive voice with salient referents (i.e., the patient) in sentence-initial position. While production and/or eye-tracking data indicate shifts in the addressee's attention, ERPs allow us to investigate whether and when during the course of sentence processing increased effort is needed. Numerous ERP studies have provided insights into underlying discourse-level mechanisms elicited by different types of linguistic cues during online sentence processing (e.g., Bornkessel, Schlesewsky, & Friederici, 2003;Burkhardt, 2006;Burmester et al., 2014;Kaan, Dallas, & Barkley, 2007). Based on specific neural correlates, the Syntax-Discourse Model (Schumacher & Hung, 2012), as an instance of a neurocognitive account of discourse processing, specifies two temporally distinct processing mechanisms of meaning computation: discourse linking (N400) and discourse updating (Late Positivity). In Burmester et al. (2014), the facilitative impact of the linguistic salience cue (i.e., topic-hood) elicits a reduced Late Positivity around 500 -700 ms time-locked to the sentence-initial position of OS sentences, but not of SO sentences. In line with the assumptions of the Syntax-Discourse Model, the reduced Late Positivity in the non-canonical OS word order is attributed to reduced processing costs for updating the current discourse model following the linguistic topic cue compared to the neutral cue. This interpretation of the Late Positivity as an index for integration and updating processes of mental representations is further supported by recent studies (e.g., Delogu, Drenhaus, & Crocker, 2018; or within the neurocomputational model of language comprehension by Brouwer, Crocker, Venhuizen, & Hoeks, 2017). However, the assumptions of the Syntax-Discourse Model as well as of other discourse models (Hagoort & Van Berkum, 2007) go beyond the impact of purely sentential context on meaning computation, but include situational context information. Even more explicitly, the Coordinated Interplay Account (Crocker et al., 2010) highlights the role of visual attention for listeners' mental representations. This account assumes closely temporally synchronized stages of visual and linguistic information processing during sentence comprehension as supported by multiple "visual world" eye-tracking studies (e.g., Knoeferle & Kreysa, 2012) and also ERP studies (e.g., Knoeferle, Habets, Crocker, & Münte, 2007). For instance, visual cues reduced online processing costs of OS sentences: Facilitating cues included explicit, intentional, speech-aligned (beat) gesture cues indicating a specific sentence part as salient (Holle et al., 2012), or explicit visual presentations of the visually depicted event of the target sentence (Knoeferle et al., 2007). To the best of our knowledge it has not been reported so far how implicit visual cues that purely direct the addressee's attention to depicted referents impact online sentence processing.
Using ERPs for investigating the impact of implicit visual cues on sentence comprehension might contribute to our understanding of the underlying neurophysiological mechanisms during sentence processing which might be comparable to those evoked by linguistic cues. Our study aims to answer the question whether -parallel to our earlier findings concerning linguistic salience-a referent in sentence-initial (i.e., topic) position is easier to process if visual salience is induced via an implicit attention-capture cue. Hence, by using an implicit visual cue in the present study we intend to conceptually replicate our earlier ERP-findings, that is, the sentenceinitial Late Positivity modulation evoked by an (explicit) linguistic cue (Burmester et al., 2014). The implicit visual cue of the current study was presented for 66 ms analogously to the Gleitman et al. (2007) study in which a similar type of cue significantly manipulated speakers' attention in depicted scenes, and hence, modulated what speakers mentioned first during sentence production. In accordance with the earlier findings concerning linguistic salience, we predict that visual salience of a depicted referent induces modulations of the Late Positivity at sentence-initial position of subsequent OS sentences. Besides the Late Positivity, the linguistic topic cue in Burmester et al. (2014) elicited an early perceptual repetition effect due to word repetition in the topic but not in the neutral condition. This effect was reflected in a reduced early positivity around 200 ms at sentence-initial position of both SO and OS sentences. In the present visual cueing paradigm, no word repetition occurs. Therefore, we do not expect any modulations of this early positivity. In addition, the Burmester et al. (2014) study revealed a word order effect in terms of generally greater processing costs for OS than SO sentences that we expect to replicate in the present study.

Participants
Thirty-one native speakers of German participated after giving informed consent. Except for one participant, participants were right-handed as assessed by a German version of the Edinburgh Handedness Inventory (Oldfield, 1971). All had normal or corrected-to-normal vision and had no reported neurological disorder. Participants were reimbursed or received course credits for participation. Data of two participants were excluded from further analysis, that is, one participant due to left handedness, and one participant due to a technical error during recording the electroencephalogram (EEG). The analysed group consisted of 29 participants (15 female, mean age 24.8 years, age range 19.4 -25.2 years).

Design and material
In the present study (analogous to Burmester et al., 2014) participants were presented with short stories of two animals that were going to perform a fictitious transitive action (e.g., a whale and a shark, one of which is going to stroke the other) while an EEG was recorded to investigate ERPs during online sentence processing. In contrast to Burmester et al. (2014), stories were additionally depicted by pictures of the two animals and the action instrument (cf. Figure 1). The study used a 2 x 2 within-subject design with the fully crossed factors CUE (TOPIC vs. NEUTRAL) and WORD ORDER (SO vs. OS sentences), resulting in four conditions: TOPIC SO, NEUTRAL SO, TOPIC OS, NEUTRAL OS. A total of 160 different stories (40 per condition) was created based on coloured pictures of 40 animals (monomorphemic nouns of masculine gender which were 1-syllabic (n = 18) or 2-syllabic (n = 22)) and 10 actions (which were monomorphemic, 2-syllabic transitive, and accusative-assigning verbs). For 90% of nouns, NOM and ACC case were overtly marked only at the determiner. In the remaining 10 % of nouns, NOM case was overtly marked only at the determiner, but ACC case was overtly marked at the determiner and noun (e.g., "den Löwen" [the[ACC] lion[ACC]]object). Nouns and verbs were controlled for normalised written lemma and type frequency values according to the dlex database (Heister, Würzner, Bubenzer et al., 2011). Moreover, other semantic and discourse factors such as animacy and discourse-givenness of sentential arguments impact referent accessibility and hence the ordering principles at the sentence-level (e.g., Clark & Clark, 1977;Grewe, Bornkessel, Zysset et al., 2006). We controlled for these factors by exclusively choosing animate referents that were explicitly mentioned in the lead-in sentence.
Each trial started with a red fixation cross signalling the beginning of a new story. Afterwards a blank screen for 500 ms was followed by a phrase-wise presented lead-in sentence (see Figure  1 (1)) introducing the two relevant animals of the scene, the action instrument and a corresponding prepositional phrase (e.g., the place where the animals were finding the action instrument). With regard to information structure, the lead-in revealed both animals as discourse-given (Prince, 1981) and the action as inferable based on the mentioned instrument (Prince, 1992). The following visual context (2) consisted of the implicit visual attention-capture cue located at one of three picture positions: (i) the upper left or (ii) upper right animal (i.e., TOPIC CUE, respectively), (iii) the bottom centre position of the action instrument (i.e., NEUTRAL CUE). The visual cue was presented in the form of a black square with approximately 0.3° x 0.3° of visual angle presented against a light greyish background colour for a duration of 66 ms. The cue was immediately followed by the pictures at the respective positions similar to previous visual cueing paradigms (Gleitman et al., 2007;Myachykov et al., 2012). Hence, either one of the two animals was cued in order to direct participants' attention to the topic referent (TOPIC CUE), or the action instrument was cued in order to direct participants' attention to a wider scope of the scene, the to-be-performed action of the two animals (NEUTRAL CUE). The coloured pictures of the animals and actions were presented with 9.3° x 9.3° of visual angle. After a black fixation cross, the target sentence (3) was presented phrase-wise either in SO or OS WORD ORDER describing the thematic role relations of the depicted animals (i.e., who is performing the action with whom), followed by a blank screen for 200 ms. The target sentence consisted of a first determiner phrase (DP1), verb, second determiner phrase (DP2), and a prepositional phrase specifying the animals' location or the action instrument. DP1 was either both subject and agent of the action or object and patient/undergoer of the action. DP2 always carried the inverse syntactic and thematic role of DP1. Note here that in our study syntactic and thematic role always coincided. With the closing prepositional phrase (e.g., "with the brush") we aimed to prevent that processing of DP2 is contaminated by "wrap up" effects typically occurring at the end of sentences (e.g., Just & Carpenter, 1980). The phrase-wise presentation durations were chosen in analogy to previous studies : DPs and prepositional phrases presented for 500 ms, respectively; conjunctions, auxiliary verbs, and main verbs presented for 450 ms, respectively; with a 100 ms interstimulus interval. In 20 % of trials a sentence-picture-verification task (4) probed participants' attentive reading of the stories. For this, 32 pictures (eight per condition) depicting the content of the preceding target sentence were created -half with correct, half with exchanged (i.e., incorrect) thematic role assignments (e.g., shark stroking whale vs. whale stroking shark). Pictures of the sentence-picture-verification were presented for 2 s before participants had to press the corresponding button within a 2 s time window. The verification task was followed by a blank screen for 500 ms. The experimental items were identical to the ones in Burmester et al. (2014) except that the lead-in sentence was not presented in a self-paced-reading manner but automatically and that the linguistic context question was replaced by the presentation of the visual context.
Due to the fictive character of the stories as in children's books, both animals could be plausible agents or patients of the action. In the visual context, animals of a story were always facing each other. In the target sentence, animals equally often occurred as the agent or patient of an action. Animals were distributed equally across conditions and were always performing the action with a different animal. Introducing the animal first or second in the lead-in sentence as well as the presentation of the animals on the left or right side of the screen was counterbalanced across conditions. To avoid possible effects of structural priming (e.g., Scheepers & Crocker, 2004), trials were presented in pseudo-randomized order with maximally two consecutive trials of the same condition and word order in the target sentence. Preferences of thematic role assignment or topic continuity due to preceding trials were minimized by at least five intermediate trials before an animal was repeated. There were four lists of 160 trials each. Lists were created such that within each list each item (i.e., animal pair and action) occurred once and across the four lists each item occurred in each condition. We presented each participant with one of these lists of 160 trials. These lists did not include any filler trials to arrange the experimental session in an appropriate time frame for participants' motivation and concentration ability, and hence, minimize artefacts and alpha waves in the EEG signal.

Procedure
Participants were tested individually, seated in a sound-attenuated booth with 80 cm distance to a computer screen (1680 x 1050 pixels screen resolution). After the preparation for EEG recording, participants were visually presented (on screen) with all pictures used in the subsequent experiment with their corresponding word forms to become familiar with the pictures, that is, the 40 animals and 10 actions. Afterwards participants received a written instruction in which they were asked to read each story attentively and silently and to answer the sentencepicture-verification task after some of the stories as accurately and fast as possible. Participants were asked to sit relaxed, and to avoid eye-movements, blinks, and other muscle movements. Participants had a button box (Cedrus® response pad model RB-830) on their lap and performed three practice trials to become familiar with the procedure. To answer the sentence-pictureverification task, the green and red response button (according to correct vs. incorrect pictures) were assigned to the right fore and middle finger (which was counterbalanced across participants). Participants were instructed that they will be presented with a new story as soon as they see a red fixation cross and they press the yellow button of the button box on which they should put their left thumb throughout the whole experiment. The experiment was visually presented by means of the Presentation® software (version 14.1; www.neurobs.com). The whole experiment included pauses after each 40 trials and lasted approximately 30 minutes. In a postexperiment questionnaire, participants were asked if they have an idea about the purpose of the study and if they noticed anything in the course of the experiment, for instance, any cues or disturbances during picture presentation. (1) Participants read the phrase-wise presented lead-in sentence followed by a blank screen (200 ms), fixation cross (1000 ms), and blank screen (200 ms). (2) The implicit visual cue presented for 66 ms (TOPIC vs. NEUTRAL CUE) was directly followed by the pictures. After another fixation cross, (3) the SO or OS WORD ORDER target sentence was presented phrase-wise. In the TOPIC cue condition, the SO or OS sentence mentioned the cued referent (i.e., whale) first (i.e., as DP1). (4) In 20 % of trials participants had to answer a sentence-picture-verification task afterwards. Abbreviations: SO = subject-verb-object, OS = object-verb-subject, DP = determiner phrase.

ERP data analysis
For ERP data analysis, the Brain Vision Analyzer software (version 2.1, Brain Products Gilching, Germany) was used. To exclude slow signal drifts and muscle artifacts from the EEG raw data the Butterworth zero phase filter (low cutoff: 0.3 Hz; high cutoff: 70 Hz; slope: 12 dB/oct) was applied additional to the notch filter of 50 Hz. For the correction of artifacts caused by vertical eye movements the algorithm by Gratton, Coles, and Donchin (1983) was applied. We applied an automatic artifact rejection to reject blinks and drifts in the time window of -200 to 2150 ms relative to the onset of the target sentence as well as -200 to 500 ms relative to onset of the visual cue (rejection criteria: max. voltage step of 30 µV/ms, max. 200 µV difference of values in intervals, lowest activity of 0.5 µV in intervals). On average 1.64 % of trials was rejected. ERPs were averaged for each participant and each condition within a 2150 ms time window time-locked to the onset of the target sentence and within a 500 ms time window time-locked to the onset of the visual cue, with a 200 ms pre-stimulus onset baseline, respectively. For the statistical analysis, IBM SPSS Statistics (version 25.0) was used. The chosen parameters for ERP analysis were identical to the ones used by Burmester et al. (2014) to maintain comparability with the ERP results on the impact of linguistic context information on online sentence processing, in which the same sentences were presented to participants. Hence, based on previous psycholinguistic research, we analysed language-related ERP components of the target sentence in the following time windows time-locked to the onset of DP1, verb, and DP2, respectively: 100 -300 ms (P200), 300 -500 ms (N400), 500 -700 ms (Late Positivity). Via computation of mean amplitudes of three electrodes, respectively, nine regions of interest (ROIs, which were identical to the ones in Burmester et al., 2014)  . For statistical ERP analysis, mean amplitude values of ERPs within each condition were analysed following a hierarchical schema (e.g., Burmester et al., 2014). Firstly, we computed a fully crossed repeated measures analysis of variance (ANOVA) with the fixed factors CUE (TOPIC vs. NEUTRAL), WORD ORDER (SO, OS), and ROI (nine levels, see above) for each of the three time windows time-locked to the onset of DP1, verb, and DP2, respectively.
In addition to the analyses of the target sentence, we analysed early ERP components (i.e., N1, P2) relative to the onset of the visual cue, henceforth termed CUE POSITION (i.e., TOPIC LEFT, TOPIC RIGHT, and NEUTRAL BOTTOM), in the time window of 100 -200 ms as well as of 250 -350 ms (see e.g., Luck & Hillyard, 1994;Mangun, 1995 for attention-based early visual processing changes reflected in different early evoked potentials). Notably, the pictures followed the cue immediately, hence these time windows started 100 ms and 250 ms after cue onset but also 34 ms and 184 ms after picture onset.
We report Greenhouse and Geisser (1959) corrected F-and p-values, the original degrees of freedom (df) in brackets, and the Greenhouse and Geisser epsilon (ε) factor for non-sphericity adjustments of the original df according to Jennings & Wood (1976) (only for F-tests with more than one df in the numerator). Statistically significant effects (i.e., p <. 05) involving an interaction with ROI were resolved by computing post hoc paired t-tests to reveal the topographical distribution of the effect. We controlled for the Type I error due to multiple pairwise t-tests of levels of the fixed effects in the nine ROIs by adjusting the significance level according to the Bonferroni correction. Thus, for post hoc t-tests the following Bonferroni adjusted p-values (two-tailed) were considered as statistically significant at α = .05: p <. 006 to resolve the WORD ORDER x ROI interaction, and p <. 002 to resolve the CUE POSITION x ROI interaction. For presentation purposes the displayed ERPs in Figure 2, 3, and B-1 in Appendix B are 10 Hz low-pass filtered.

Behavioural data analysis
For the statistical analysis of the response accuracy of the sentence-picture-verification task, logit mixed models fitted by the Laplace approximation were calculated using the lme4 package (Bates, Mächler, Bolker, & Walker, 2015) provided by the R environment (version 3.5.1, R Core Team, 2013). To analyse the binary distributed response accuracy data (correct vs. incorrect) with the logit mixed models, CUE, WORD ORDER, and the interaction of both were defined as fixed effects, and Participants and Items were defined as random effects. Fixed effects were coded as +/-.5 to resemble the contrast coding of traditional ANOVA analyses. Model fitting started with the simple model (i.e., the two fixed effects and their interaction, and Participants and Items as random intercepts). In a step-wise manner, slope-adjustments were included if they significantly improved the explanatory power of the simpler model without that slope adjustment as revealed by log-likelihood tests (e.g., Baayen, 2008). The statistics of the fixed effects of the final models are reported with estimates (b), standard errors (SE), z-and p-values.

Results
As reported in the post-experiment questionnaire, participants did not notice any manipulation of their visual attention suggesting that the cue was not consciously perceived and hence, truly implicit. In this Section, we first describe the ERP results with respect to initial and subsequent processing of the target sentence, before reporting the results of an additional analysis of the present ERP data following the visual cue together with the published data following the linguistic cue (Burmester et al., 2014). Secondly, we present the ERP results with respect to the onset of the visual cue. Thirdly, we report the behavioural results of the probe sentence-pictureverification task. Figure 2 illustrates the grand average ERPs at one representative electrode time-locked to the onset of the target sentence (i.e., DP1) followed by the subsequent sentence positions (i.e., verb and DP2) for both CUES (TOPIC and NEUTRAL) and both WORD ORDERS (SO and OS). For grand average ERPs at selected electrodes of each ROI, see Figure B-1 in Appendix B illustrating the CUES within each WORD ORDER.

ERP results of sentence processing following the implicit visual cue
For sentence-initial processing, statistical analyses in the time windows of 100 -300 ms and 300 -500 ms time-locked to the onset of DP1 neither revealed any statistically significant main effects of CUE (TOPIC vs. NEUTRAL) or WORD ORDER (SO vs. OS), nor significant interactions of CUE, WORD ORDER and/or ROI [p > .1] (see Appendix A for the complete statistical output). The analysis in the following time window of 500 -700 ms revealed a statistically significant main effect of WORD ORDER [F(1, 28) = 6.254, p = .019] and a significant interaction of WORD ORDER x ROI [F(8, 224) = 3.004, p = .029, ε = 0.424], but no statistically significant effects or interactions of the factor CUE. Separate post hoc analyses for SO and OS sentences (averaged across the cue conditions) within each ROI yielded a statistically significant enhanced positivegoing ERP for OS sentences compared to SO sentences in the LEFT CENTRAL ROI [t(28) = -3.605, p = .001] and in the MIDLINE CENTRAL ROI [t(28) = -3.369, p = .002]. In summary, the ERP results of all three sentence positions (i.e., DP1, verb, DP2) did not show any statistically significant modulation by the preceding visual cue. An impact of the varying word order with enhanced positive-going ERPs for OS compared to SO sentences was evident in multiple time windows time-locked to the onset of DP1 (i.e., 500 -700 ms) and verb (i.e., 100 -300 ms, 300 -500 ms, and 500 -700 ms).

ERP results compared to sentence-initial processing of the linguistic cue (Burmester et al., 2014)
Since the identical sentence material was used in the present study with the visual cue as in the study with the linguistic cue (Burmester et al., 2014), we aimed at directly comparing the impact of the visual vs. linguistic cue modality on sentence-initial processing. For this purpose, we computed additional comparisons of the published ERP data following the linguistic cue with the ERP data following the visual cue by adding the between-subject factor MODALITY (VISUAL vs.  Figure 2 and B-1 in Appendix B) in the present study and Figure 2 and Table 3 in Burmester et al., 2014).
In summary, the visual cue had no impact on sentence processing in the present study. Therefore, the visual cue was not effectively increasing the salience of the cued referent (i.e., TOPIC CUE). To examine whether the visual cue per se modulated participants' processing, we computed a further ERP analysis time-locked to the onset of the visual cue. If yes, this should be reflected in differential, especially early sensory-evoked potentials in dependence of the CUE POSITION on screen (i.e., TOPIC LEFT, TOPIC RIGHT, and NEUTRAL BOTTOM).

ERP results of the visual cue per se
With the following ERP analysis we aimed to test if the implicit visual cue modulated participants' early sensory-evoked potentials (i.e., N1, P2) time-locked to the onset of the visual cue. The visual cue was presented for 66 ms and was directly followed by the pictures. However, since the timing and the position of the pictures were always the same, early processing differences are likely to be related to the different prior CUE POSITIONs. Therefore, ERP analyses were calculated to assess the impact of CUE POSITION (i.e., TOPIC LEFT, TOPIC RIGHT, and NEUTRAL BOTTOM) in the N1 (100 to 200 ms) and P2 (250 to 350 ms) time window post onset of the visual cue and its topographical distribution by the factor ROI. 3 As can be seen in Figure  Post hoc analyses show that the TOPIC LEFT cue elicited a significantly more pronounced positive deflection compared to the NEUTRAL BOTTOM cue (p < .001). Note, we cannot clearly disentangle the response to the cue and the response to the picture, as the pictures were always presented 66 ms after the cue. The modulation by CUE POSITION might therefore either reflect the direct effect of the cues themselves or their impact on the processing of the subsequently presented pictures.
In both cases, the results indicate that participants processed the implicit visual cues. But still, given that none of the participants noticed the presence of the cues nor was sentence processing influenced by the cue, we can assume, that the cues were processed only subconsciously.

Behavioural results
In the sentence-picture-verification task (in 20 % of trials) participants showed a high response accuracy across conditions indicating that participants were attentive throughout the experiment:

Discussion
We aimed at answering the question if an implicit visual cue to a depicted referent impacts sentence-initial processing similarly to a purely linguistic cue as revealed by ERPs (Burmester et al., 2014). With regard to the linguistic cue, the indication of the aboutness topic referent, which was subsequently mentioned in sentence-initial position, reduced the Late Positivity during online processing of OS sentences. In the present study, the linguistic topic cue was replaced by an implicit visual cue to a depicted referent in a visual scene. However, the impact of the linguistic cue on sentence-initial processing was not replicated. In fact, none of our analyses revealed any statistically significant effects of the visual cue on sentence processing. The findings of the linguistic cue (Burmester et al., 2014) were interpreted within the Syntax-Discourse Model (Schumacher, 2014) which highlights the role of context information for meaning computation during sentence comprehension. Within this model, the impact of context information (i.e., including sentential and situational context, amongst others), is reflected in modulations of processing costs for discourse linking and updating the hitherto built mental representation of the listener. Following this model, the impact of a referent's topic-hood of the sentence-initial referent in OS sentences was attributed to reduced discourse updating costs as -in contrast to a neutral cue-the linguistic topic cue indicated one referent more salient amongst others which rendered this referent more likely to be mentioned sentence-initially. Obviously, the visual topic cue used in the present study did not increase the salience of the cued referent to a similar extent and hence did not elicit a facilitative effect on sentence processing in terms of reduced discourse updating costs.
However, the long-time neglected role of extra-linguistic (e.g., visual) information in traditional accounts of sentence comprehension (e.g., Frazier & Fodor, 1978;Friederici, 2002;Marslen-Wilson & Tyler, 1980) has been complemented by recent models underlining the close temporal integration of multimodal (e.g., visual and linguistic) information in the listener's current mental representation (e.g., Bower & Morrow, 1990;Zwaan, 2004). For instance, as supported by neuroscientific methods, a one-step model of sentence comprehension -integrating concomitant information from different modalities within one step-has been suggested (Hagoort & Van Berkum, 2007). More specifically, this model postulates that linguistic (for instance, sentential structure, semantics) and pragmatic information (from prior discourse, extra-linguistic information such as speaker's gestures or the visual world) is immediately processed by the same brain regions (namely the left inferior frontal gyrus) in order to directly map all information onto a discourse model as the basis for sentence interpretation. Similarly, the Coordinated Interplay Account by Crocker et al. (2010) explicitly outlines the integration of visual scene information during sentence processing, especially in cases were visual information becomes highly relevant for the interpretation and disambiguation of spoken sentences. By linking to the so called "blank screen paradigm" in which attention shifts during sentence comprehension even occurred when depicted objects were no longer presented (Altmann, 2004), Crocker et al. (2010) suggest that mental representations of a previously presented scene still influence the comprehension process. However, the weighting of (competing) visual and linguistic information and the role of specific inherent features of visual salience cues on the strength of impact during language comprehension is not specified in any of these models.
In the following, we will discuss our findings concerning the absent visual cue impact on sentence processing against the background of previous research using different visual cues while raising possible issues with respect to the present study design. Afterwards, we will briefly discuss the word order effect, which we replicated (Section 4.2).

Null effect of visual cue on sentence-initial processing
Against the background of previous research investigating the interaction of visual cues with linguistic processing, we discuss the absent cue effect of the present study 1) with respect to different aspects of informativity of the visual scene for the comprehension process, and 2) with respect to the different impact of visual cue types (implicit vs. explicit) on sentence comprehension and production.
With respect to the informativity of the visual scene, some previous studies emphasise the need for relevance of visual information for meaning computation during language processing. For instance, listeners use depicted events -similarly to linguistic cues (i.e., case marking)-for syntactic reanalyses of locally structurally ambiguous German sentences as reflected in a reduced P600 (Knoeferle et al., 2007). Further evidence from the field of referential processing shows that visual information impacts accessibility of discourse referents only if the linguistic context is moderate, uninformative, or ambiguous (e.g., Nappa, Wessel, McEldoon et al., 2009;Vogels, Krahmer, & Maes, 2013). With respect to the present study design, the visual scene might not have added crucial information to the comprehension process, for instance, to assign thematic roles in the subsequent target sentence. Compared to visual scenes used in production studies (e.g., Gleitman et al., 2007;Myachykov et al., 2012), visual scenes in the present study did not depict thematic role relations for the following reasons: We aimed at minimizing confounding effects of prominence-related factors known to affect participants' gaze fixations of depicted transitive events (e.g., agent-directed fixations followed by patient-directed fixations, Ganushchak, Konopka, & Chen, 2017) as well as linear ordering preferences of sentential constituents (e.g., agent precede patient theta roles, Jackendoff, 1972). In short, we tested how a single implicit cue indicating the subsequent sentence topic would affect referent accessibility and therefore we eliminated all additional factors that could have masked this single cue. Indeed, previous research shows that the more (additional) information is conveyed by the prior discourse -by the linguistic context or by depicted scenes-the more predictable are specific upcoming words as reflected in an immediate ease of sentence processing (e.g., Burmester et al., 2018;Otten & Van Berkum, 2008;Van Berkum, Brown, Zwitserlood et al., 2005). More specifically, in an experimental design rather similar to the one in the present study, we could show that depicted thematic role information of the salient referent boosted the cue-based ease of sentence processing compared to non-predictable thematic role information (Burmester et al., 2018). Moreover, for linguistic context information, Otten and van Berkum (2008) found greater priming effects of the exact discourse message than of the mere word primes. Drawing the parallel to our earlier findings, this could speak in favour of predictive processing mechanisms following the linguistic cue explicitly indicating the upcoming sentence-initial topic (Burmester et al., 2014), while with the visual cue we rather manipulated accessibility in a way similar to a word prime. Hence, maybe further depicted information such as thematic role relations would have activated further semantic features in order to constrain a more precise and coherent discourse context, and finally support predictive processing.
Moreover, with respect to the informativity of the visual scene, the simultaneous visual presence of multiple referents has been argued to reduce referent accessibility (as revealed by reduced pronoun use in, for instance, Arnold & Griffin, 2007or Fukumura, Van Gompel, & Pickering, 2010. In the present study, two possible referents were simultaneously depicted, which -parallel to the preceding argument-could have caused a competition of referent accessibility. This competition of accessibility was not the case in our ERP study in which the linguistic topic cue exclusively increased the accessibility of one referent while not mentioning the other (i.e., "What about the 'topic referent'?"; Burmester et al., 2014). However, in our reading time study (Burmester et al., 2018) multiple referents were presented simultaneously and nevertheless an explicit visual (gaze) cue increased the accessibility of one amongst three depicted referents as reflected in sentence-initial processing ease similar to a linguistic topic cue. But, in contrast to the present study, participants in our reading time study were already familiarised with the visual scene by a multimodal lead-in before the gaze cue was presented. Therefore, participants might have had greater attentional capacities at the moment of processing the gaze cue than during processing the implicit cue with the subsequently presented pictures in the here presented study. Alternatively, the type of the cue matters, as gaze cues are more explicit and intentional in nature than the here used implicit abstract cue.
Taking previous studies using different visual cue types into account, we can assume that visual cues indeed modulate meaning computation during sentence comprehension while they seem to differ with respect to their impact on referent accessibility. As just mentioned, one type of visual cues that has been shown to clearly influence language processing, is eye gaze. This cue is a strong social-communicative cue signalling shared-attention of speaker and listener, it is hence an intentional cue. Crucial evidence provided by a few studies that compared both attentional and intentional cues supports the importance of the intentional component of visual cues for listeners: Similar to speaker's gaze or pointing gesture, a visual cue (i.e., black square presented for 50 ms at the location of a possible referent) influences listener's pronoun interpretation, but only if the listener was previously instructed that this abstract visual cue is intentionally created by the speaker (Nappa & Arnold, 2014). However, the same visual cue but without the previous instruction of being intentionally created by the speaker does not reveal an impact. Analogously to this finding, Holle et al. (2012) showed that an intentional, conversational gesture co-occurring with the speaker's speech facilitates comprehension of German ambiguous SO and OS sentences: This short hand movement (i.e., beat gesture) emphasising the subject of the sentence reduces additional processing costs for disambiguation towards the OS word order as indicated by a reduced P600. In contrast to this gestural cue, listeners do not make use of an explicit visual attentional cue, that is, a moving point. Hence, the intentional component of visual cues plays a significant role in meaning computation during the comprehension process. This significance could be explained by assumptions of Pickering and Garrod's (2004) interactive alignment model. Accordingly, for a successful dialogue, interlocutors are assumed to develop aligned situational models. Thus, maybe intentional cues such as eye gaze offer a window into the interlocutors' mental representations in dialogue that might trigger other attentional mechanisms than those elicited by purely attentional, visual cues. However, the present study is a first step to better understand the impact of implicit attentional visual cues on sentence-initial processing, which demonstrates to us the subtle differences and difficulties of adapting study designs of sentence comprehension (Burmester et al., 2014) and sentence production (Gleitman et al., 2007).
Indeed, studying the impact of implicit visual cues on sentence comprehension might engender some caveats weakening the measurable outcome during later sentence processing. For the visual cue of the present study, we used a black square with 0.3° x 0.3° of visual angle against coloured pictures similar to the one by Gleitman et al. (2007), who used a slightly bigger cue (i.e., 0.5° x 0.5° of visual angle) against full-colour clip-arts. Moreover, Myachykov et al. (2012) used a red dot with 0.7° x 0.7° against black-white-line drawings that -similar to the cue by Gleitman et al. (2007)-influenced sentence-initial mention during production. So maybe the visual cue in our study was too subliminal and hence, its impact too short-lived to be measurable during subsequent sentence processing. Following this line of thought, modulations of participants' attention by the implicit visual cue might be less strong and less long lasting with respect to their impact on referent accessibility compared to the high accessibility degree indicated by the explicit mention of the topic in the linguistic context. This explanation is in accordance with eye-tracking data reported by Arnold and Lao (2015): Listener's trial-initial attention to depicted referents was indeed modulated by multiple visual scene-based factors, that is, different visual attentional cues, the order of the depicted referents from left to right, and listener's idiosyncratic biases. But, the attentional cues themselves did not significantly predict subsequent pronoun interpretation and instead, the linguistic cue of sentence-initial mention was the strongest predictor. In addition, in an experimental task that requires sentence understanding, linguistic cues such as in Burmester et al. (2014) might less likely be ignored by the reader than visual cues and picture contexts such as in the present study. Hence, it is important to check whether visual cues and depicted scenes are indeed processed by the addressee of the stimuli.
However, based on the absent impact of the implicit visual cue on sentence processing in the present study, we suggest that this type of visually induced salience of discourse referents elicits a different, less strong, impact on the accessibility degree of mentally represented discourse referents compared to linguistically induced salience via topic-hood. This train of thought is supported by ERP and behavioural studies showing that linguistic stimuli (e.g., sentences, words) more strongly affect the accessibility of entities compared to pictures (e.g., Bögels, Schriefers, Vonk, & Chwilla, 2011;Fukumura et al., 2010). Moreover, a different neural processing of linguistic and visual stimuli has been suggested (e.g., Brandon & Andrew, 2007;Zhang, Begleiter, Porjesz, & Litke, 1997) which, for instance, has been explained by the more efficient semantic access and/or memory retrieval of words compared to pictures (Dorjee, Devenney, & Thierry, 2010). However, words and pictures revealed a very similar time course of the N400 (congruity) effect with differences only in the topographical distribution (Ganis, Kutas & Sereno, 1996).
In summary, reasons for the null effect of the implicit cue in our study might, on the one hand, be traced back to a weaker, less long lasting impact on referent accessibility compared to intentional or linguistic cues. On the other hand, it might to some degree be related to methodological differences to production studies in which implicit cues modulated sentenceinitial mention. Disentangling these reasons and testing the impact of other types of visual attentional cues is left open for future research.

Word order effect
In the present study, word order effects are reflected in sustained positive deflections (at DP1 and verb position) which were more pronounced for OS compared to SO sentences across multiple ROIs. Hence, we replicated differential processing costs for SO and OS sentences found in Burmester et al. (2014). A body of neurocognitive research concerning the processing of German sentences with varying word order demonstrates increased processing costs for OS sentences compared to their canonical (SO) counterpart as reflected in different ERP components, time windows and across different sentence positions (e.g., Holle et al., 2012;Knoeferle et al., 2007;Matzke, Mai, Nager et al., 2002;Schlesewsky, Bornkessel, & Frisch, 2003). For this paper, the word order effect is not of primary relevance and we take it just as a "sanity check" of our ERPdata. In line with the previous literature, we argue that the processing differences between OS and SO sentences are engendered by the subject/nominative-first-preference in German leading to increased processing demands for the non-canonical and less frequent OS word order. Note that in our study the impact of word order might be confounded with the ordering of theta roles, as both grammatical role and theta role coincided in the sentence constituents, that is, subject and agent, object and patient. The increased online processing difficulties were, however, not visible in the behavioural (accuracy) results of the subsequent probe sentence-picture-verifications. In summary, we argue that the replicated word order effect during online sentence processing confirms the validity of the design: The replication of the word order effect shows that we are not dealing with a replication failure per se but rather that the different findings can be traced back specifically to cue modality.

Conclusion
All in all, previous findings speak in favour of the integration of visual and linguistic cues into listeners' mental representations -although with a different magnitude of both cue modalities. In our study, the implicit visual cue to a depicted referent followed by subsequent sentence-initial mention did not influence online processing of German SO and OS sentences. Hence, the impact of the linguistic topic cue on the identical sentence material could not be replicated with the present study design, although a similar type of cue was effective in previous production studies (Gleitman et al., 2007). It therefore remains an open question if comprehension and production are influenced by similar underlying processes. Hence, we conclude that the role of visual, purely attention-directing cues for meaning computation during sentence processing needs further clarification. Future research needs to shed more light on the role of different visual cues and their interaction with intentional and attentional aspects in guiding information packaging preferences during utterance comprehension in order to disentangle experimental task specific effects.

Acknowledgements
This work was supported by the German Research Foundation (DFG) under Grant SFB 632 'Information structure'. We thank Franziska Machens and Tobias Busch for assistance in material preparation and data collection as well as Jan Ries for his help in preparation of the Figures.

Appendix A
Results of analysis of variance (ANOVAs) of the ERPs for the different time windows time-locked to the onset of the first determiner phrase (DP1). Note. Greenhouse & Geisser (1959) corrected significance levels: * p <. 05; ** p <. 01; *** p <. 001. df = degrees of freedom. ε = Greenhouse & Geisser epsilon factor for non-spericity to adjust the original df according to Jennings and Wood (1976).

Appendix C
Results of overall analysis of variance (ANOVAs) of the ERPs for the different time windows time-locked to the onset of the first determiner phrase (DP1) of target sentences following both cue modalities (MODALITY), that is, the implicit visual cue (i.e., VISUAL: data of the present study) and the linguistic cue (i.e., LINGUISTIC: data published in Burmester et al., 2014). Note. Greenhouse & Geisser (1959) corrected significance levels: # p <. 06; * p <. 05; ** p <. 01; *** p <. 001. df = original degrees of freedom. ε = Greenhouse & Geisser epsilon factor for non-spericity to adjust the original df according to Jennings and Wood (1976