This study demonstrates the role of chunking in the realm of second language writing. It presents a novel approach based on the analysis of large amounts of keystroke logging data. The statistics of the multi-word sequences in these texts provide direct evidence for a chunk-based production mechanism in written production.
Although chunking has been a central concept with the field of learning and memory for over half a century (Miller, 1956), it has only recently resurfaced in current theoretical approaches to language learning and processing. These approaches assume that through exposure to language, humans learn to rapidly recode incoming language material into chunks to overcome the fleeting nature of the linguistic input (Now-or-Never bottleneck and chunk-and-pass language processing, Christiansen & Chater, 2016). Moreover, such chunking contributes significantly to fluent language production (see, e.g., Christiansen and Arnon, 2017 for an overview). This is supported by an accumulating body of evidence from experimental measurements as well as computational models (Christiansen, 2019). Here we present a novel approach to studying online chunking in language production based on keystroke-logging. Previously, keystroke measures have been successfully linked to the cognitive processes executed during (keyboard) writing, with pauses reflecting different levels of planning and bursts reflecting initial formulation of thought (see, e.g., Baaijen, Galbraith and de Glopper, 2012). The keystroke data used for the study come from 660 university students (second language learners of English) tasked with writing weekly lecture summaries over the course of a semester, yielding 7,012 texts amounting to 2.79 million words. The data were analyzed with respect to (a) person-centered mean inter-keystroke intervals, average pause lengths and pause distributions and (b) with respect to the frequency statistics of all unigrams, bigrams and trigrams produced, as computed from a 570 million word corpus of contemporary English usage (Davies 2008). Statistical modelling revealed (1) that median pause time decreased as word frequency increased and (2) that n-gram frequency was significantly higher for the n-grams within bursts than for n-grams across bursts. These results provide direct evidence for a chunk-based production mechanism during written production