This paper addresses methodological and conceptual issues related to one of the most commonly used research designs in corpus-based studies on L2 use: the comparison of corpora representing L1 and L2 speakers. It draws on the data from two new large corpora representing spoken English interactive L2 and L1 production.
This paper addresses methodological and conceptual issues related to one of the most commonly used research designs in corpus-based studies on L2 use: the comparison of corpora representing L1 and L2 speakers. In these studies, patterns from an L2 corpus are contrasted with findings from an L1 corpus, which acts as a reference point against which the L2 data are interpreted. To make this comparison meaningful, the corpora should share the major characteristics (e.g. the type of speakers, linguistic setting and genre) and differ only in the status (L1 vs L2) of the speakers (e.g. Leech, 1998, Gilquin, 2015). In practice, this is often difficult to achieve; it is thus often unclear to what extent the differences found between L1 and L2 speech genuinely reflect the difference between the two populations and to what extent these are merely an artefact of the corpus design. This talk examines the impact of comparability of corpora on the interpretation of the results and validity of the conclusions. The talk draws on the data from two corpora representing interactive spoken production: The Trinity Lancaster Corpus (TLC), which contains more than 4M words of spoken L2 English from over 2,000 L2 speakers, and the Trinity Lancaster L1 Corpus (TLC-L1), which contains over 500,000 words from 150 L1 speakers of English. The two corpora were built along the same principles, using the same speaking tasks and include speakers with different social characteristics.