Here is a list of my most recent (collaborative) research projects. For past projects, please check the website of the Centre for English Corpus Linguistics.

  • Project title: Core metadata scheme for learner corpora
    Status: ongoing
    Collaborators: Alexander König (CLARIN ERIC), Jennifer-Carmen Frey (EURAC), Egon Stemle (EURAC)
    See list of presentations

The main objective of this project is to design a core metadata scheme for learner corpora. In a first study presented at LCR2022, we applied Granger & Paquot (2017)’s first proposal (see project below) on a set of learner corpora. We then adapted the initial metadata scheme to better answer the needs of a varied set of learner corpora and make the scheme more FAIR. We shared the scheme with the research community in September and are now busy finalizing version 1 based on colleagues’ feedback.

The draft version presented at LCR is available from:

Paquot, M., König, A., Stemle, E. & Frey, J.-C (2023). Core Metadata Schema for Learner Corpora, https://doi.org/10.14428/DVN/4CDX3P, Open Data @ UCLouvain, V1, UNF:6:WhLZTg+knFe2FjjgxGg3Uw== [fileUNF]

A revised version of the core metadata scheme will be presented at EUROSLA in August 2023.

  • Project title: Adaptive Comparative Judgment for large-scale proficiency rating in L2 research: task conditions and their impact on reliability and validity (Crowdsourcing Language Assessment Project (CLAP))
    Status: ongoing
    Collaborator: Peter Thwaites, Nathan Vandeweerd
    See list of presentations and publications

Though proficiency is one of the most important constructs in Second Language Research, its measurement has not always received the attention it deserves, and practices of proficiency level assignment have been the subject of continued criticism. In many learner corpora, proficiency is still operationalized in ways that are widely regarded as unreliable (e.g. institutional status). While some learner corpora improve on the current state of the art by including text-based proficiency scores, the time and cost difficulties typically associated with analytical scoring means that it is often absent from, or operationalized in unreliable ways, in currently available large-scale learner corpora.

The main objective of this research project is to investigate whether Adaptive Comparative Judgment (ACJ), a method used in educational assessment, can be used for reliable and valid large-scale proficiency rating of learner language. The project adopts an experimental design to investigate the effect of judges’ language assessment training and expertise, as well as of text features such as topic and length, on the reliability and validity of ACJ tasks. Answering a call for more research into CJ validity, the project focuses on its construct validity, content validity, and criterion validity by collecting and analyzing the judges’ justifications that underpin their selection of one text above the other, and examining correlations between ACJ rank orders and CEFR levels of learner texts from the International Corpus of Learner English.

Theoretically, the project will contribute to a better understanding of proficiency in L2 research by allowing for an empirical investigation into the components that factor into reliable assessments, the relative importance of the various components, and the potential dynamic relationship between these components as learner proficiency develops. Practically, it will generate valuable resources to be shared open access with the research community.

  • Project title: The Crowdsourcing Language Assessment Project (CLAP) – pilot study
    Collaborator: Rachel Rubin, Nathan Vandeweerd
    Publication: Paquot, M., Rubin, R. & Vandeweerd, N. (2022). Crowdsourced Adaptive Comparative Judgment: A community-based solution for proficiency rating. Language Learning 72(3), 853-885. https://onlinelibrary.wiley.com/doi/10.1111/lang.12498

Proficiency is one of the most important constructs in Second Language  Research. Yet, its measurement has not always received the attention it deserves, and practices of proficiency level assignment have been the subject of continued criticism (e.g. Hulstijn et al. 2010). In Learner Corpus Research, more particularly, Carlsen (2012) noted that learner corpus compilation projects still relied on variables such as ‘institutional status’ or ‘year of study’ to assign a proficiency level to learners (and even learner groups) while these external or learner-centred criteria are largely regarded as unreliable (Thomas, 1994).

To address these criticisms, the AndreSpråksKorpus (ASK) compilers were among the first to adopt a systematic a posteriori text-based approach to proficiency grounded in the professional field of language assessment. The ASK texts were rated by a minimum of five trained raters; estimates of rater severity and rater reliability were also used to provide feedback to the raters and evaluate the rating procedure (see Carlsen, 2012 for more details). While this approach is undoubtedly exemplary, the drawback is “the time and costs required” (ibid.:179).

The main objective of the Crowdsourcing Language Assessment Project (CLAP) is to investigate whether crowdsourcing can offer practical solutions to the time and cost difficulties associated with a text-based approach to proficiency assessment in learner corpus research. Crowdsourcing refers to the practice of soliciting contributions from a large group of people or the general public rather than from traditional employees or suppliers. In this specific project, we are soliciting evaluations of argumentative essays written by English as a Foreign Language (EFL) learners from members of the European Network of Combining Language Learning with Crowdsourcing Techniques (enetcollect; Lyding et al., 2018) instead of professional raters.

The project relies on the technique of adaptive comparative judgement (ACJ; Pollitt, 2012), which is a modification of Thurstone’s method of comparative judgement that has been reported to produce reliable and valid assessment of written performances (Van Daal et al., 2019): learner texts are presented in pair to raters who have to choose the ‘best’ one. This technique is based on the assumption that people are able to compare two performances more easily and reliably than to assign a score to an individual performance (Lesterhuis et al., 2017). A second critical assumption underpinning CJ is its reliance on holistic judgment: “Judges do not receive criteria to guide their judgment process, but only a general description regarding the writing competence to be assessed” (Van Daal et al, 2019). By means of an iterative and adaptive algorithm, the ACJ platform used (here ComPAIR, Potter et al., 2017) then produces a scaled distribution of student performances.

The study design was piloted locally in 2019 and we will launch the project in February 2020. Unlike in traditional crowdsourcing tasks, we will ask raters to fill in a short questionnaire about their familiarity with English and the task of writing assessment. The results will make it possible to answer the following research questions:

– Can a crowd of people be used to assess learner texts reliably and validly?
– How do the scores generated by the crowdsourced language assessment task compare with the scores assigned by experts to the same learner texts?
– What is the impact of different language skills and language assessment expertise on the task of language assessment?

We believe addressing these questions in the fields of SLA and LCR is of crucial importance given the centrality of the construct of proficiency. If results are satisfactory, it is our hope that the approach be used by the SLA/LCR crowd to add text-based proficiency levels to a variety of learner corpora for different languages.

  • Project title: Phraseological complexity: a promising construct for second language research
    Status: ongoing
    Collaborators: Alex Housen, Rachel Rubin, Nathan Vanderweerd
    See list of presentations and publications

Research in corpus linguistics, psycholinguistics and cognitive linguistics has provided convergent evidence that lexis and grammar are inextricably intertwined and that word combinations play crucial roles in language acquisition, processing, fluency, idiomaticity and change. Second language (L2) research has been relatively slow to follow suit. Phraseology, formulaic language and constructions are now at the forefront of debates in foreign language learning and teaching but not all domains of L2 research have navigated the transition. One particular field that has remained impervious to current developments is interlanguage complexity: linguistic complexity has traditionally been narrowed down to syntactic complexity and lexical complexity is still very much regarded as its poor relation.

The overarching aim of the project is to define and circumscribe the linguistic construct of phraseological complexity within the framework of usage-based theories of language, and to theoretically and empirically demonstrate its relevance for L2 complexity research, and more generally for theories of L2 use and development. The project centres around four main objectives: (1) determine the dimensions of phraseological complexity, (2) establish the construct validity of new phraseological complexity measures automatically calculated using NLP techniques and corpus data, (3) chart the development of phraseological complexity in L2 writing and speech, and (4) identify the best set of complexity measures to adequately capture the dynamics of language development over time. More generally, the research project will contribute to giving phraseology (and the lexis-grammar interface) the place it deserves in theories of language proficiency and language development. It also opens up promising research avenues for linguistic theory and description in other areas such as contrastive linguistics, where linguistic complexity has also often been traditionally conceived of as syntactic complexity.

(Open) online education poses a variety of challenges for higher education, one of which is how to foster social interactions and induce beneficial socio-cognitive conflicts (i.e. differences in point of view that are socially experienced and cognitively resolved) to promote learning in an environment where interactions are primarily written and asynchronous.

To address this challenge, we adopt a multidisciplinary perspective that builds on theories from several disciplines from the humanities and social sciences (linguistics, natural language processing, communication sciences, education, and management studies) to analyse social interactions and investigate the presence and unfolding of socio-cognitive conflicts in massive open online courses (MOOCs). MOOCs are a unique environment where people from all over the world – with different professional experience, first language, cultural background, etc. – are invited to discuss disciplinary concepts and/or society issues that can potentially induce socio-cognitive conflicts and/or controversies. As such, they provide a yet unexplored though promising empirical ground for multidisciplinary research on online interactions, social learning and conflict regulation. Methodologically, the project also innovates by building on various disciplinary methodological toolkits (from content analysis to natural language processing, corpus linguistics and social media analytics), thus showcasing a unique mixed-method approach and answering repeated calls for an expansion of MOOC research’s methodological toolbox.

  • Project title: Inter-rater reliability in Learner Corpus Research: Insights from a collaborative study on adverb placement
    Status: Published in the International Journal of Learner Corpus Research
    Collaborators: Tove Larsson, Magali Paquot and Luke Plonsky

In Learner Corpus Research (LCR), a common source of errors stems from manual coding and annotation of linguistic features. To estimate the amount of error present in a coded data set, coefficients of inter-rater reliability are used. However, despite the importance of reliability and internal consistency for validity and, by extension, study quality, interpretability and generalizability, it is surprisingly uncommon for studies in the field of LCR to report on such reliability coefficients. In this Method Report, we use a recent collaborative research project to illustrate the pertinence of considering inter-rater reliability. In doing so, we hope to initiate methodological discussion on instrument design, piloting and evaluation. We also suggest some ways forward to encourage increased transparency in reporting practices.

  • Project title: Adverb placement in EFL academic writing: Going beyond syntactic transfer
    Status: Published in the International Journal of Corpus Linguistics
    Collaborators: Tove Larsson, Marcus Callies, Hilde Hasselgård, Natalia Judith Laso, Sanne van Vuuren, Isabel Verdaguer, Magali Paquot

The present study looks at adverb placement in expert writing and in L1 and L2 novice spoken and written production. The extent to which first-language transfer is still present in advanced learners’ written production is also investigated. The study uses data from one expert corpus (LOCRA), two native-speaker student corpora (BAWE and LOCNEC) and two learner corpora (VESPA and LINDSEI). The results highlight the importance of taking register into consideration, as clear distributional differences were found between spoken and written production. In addition, while considerable differences could be noted across L1 background in the spoken data, factors such as presence/absence of auxiliary, verb type (e.g. intransitive, copular/linking) and lexis were found to be most important for predicting adverb placement in the written data. Only very limited evidence of L1 transfer was found in the learners’ writing, suggesting that advanced learners have largely mastered the distributional preferences of adverbs.

  • Project title: Assessing L2 oral performance with the help of relational co-occurrences
    Collaborators: Vaclav Brezina, Dana Gablasova
    Status: Published – Paquot, M., Gablasova, D., Brezina, V. & Naets, H. (2022). Phraseological complexity in EFL learners’ spoken production across proficiency levels. In Leńko-Szymańska, A. & Götz, S. (eds.). Complexity, Accuracy and Fluency in Learner Corpus Research. Benjamins.

    Presented at AACL2018 and COLING2018

Recent studies have shown that statistical co-occurrences, i.e. co-occurrences extracted and ranked with the help of association measures such as the mutual information (MI) score, can be used to describe EFL learner performance across proficiency levels (e.g. Durrant & Schmitt, 2009; Granger & Bestgen, 2014). Paquot (2017), for example, focused on relational co-occurrences (i.e. where the co-occurring words appear in a specific structural relation such as verb + direct object or adjective + noun) in French EFL learner academic writing and showed that phraseological indices based on the MI are better able to gauge language proficiency than traditional measures of syntactic and lexical complexity.

Most studies so far, however, have explored EFL learner use of statistical co-occurrences in upper-intermediate to advanced writing and focused on learner groups representing a limited number of first languages or language families. The main objective of this study is therefore to explore learners’ use of co-occurrences in the Trinity Lancaster Learner Corpus (Gablasova et al. 2017), i.e. roughly 4 million transcribed words from the Trinity College London spoken language exams, and answer the following main research questions:

  • To what extent can relational co-occurrences be used to describe L2 oral performance at different proficiency levels (from B1 to C1/C2)?
  • Does phraseological competence develop in the same way across different learner groups (i.e. Chinese vs. Hindi vs. Spanish speakers of English)?

In this presentation, we focus on verb + object co-occurrences as these structures have repeatedly been shown to be a major hurdle for English L2 learners (e.g. Nesselhauf, 2005). Co-occurrences were extracted from the learner spoken corpus with the help of regular expressions and evaluated on the basis of MI scores computed from the ENCOW16AX corpus (see Paquot, 2017 for more information on the methodology). Preliminary results suggest that phraseological competence develops (slowly) from B1 to C1/C2 in spoken language but with variability within and across proficiency levels and L1 groups.

In the very near future, we aim to investigate whether, as shown by Paquot (2017) for L2 writing, co-occurrences are also better than measures of lexical diversity and sophistication at discriminating between L2 spoken samples at various proficiency levels in the Trinity Lancaster Learner Corpus.

  • Project title: The role of the reference corpus in studies of EFL learners’ use of statistical collocations
    Collaborator: Hubert Naets (CENTAL)
    Presented at ICAME38, Pragues, 24-28 May 2017.

In learner corpus research (LCR), there has been a recent boom in the number of studies that have investigated English as a Foreign Language (EFL) learners’ use of statistical collocations (e.g. Bestgen & Granger, 2014; Granger & Bestgen, 2014; Paquot & Naets, 2015; Paquot, forthcoming a & b). These studies have adopted an approach first put forward by Schmitt and colleagues (e.g. Durrant & Schmitt, 2009) to assess whether and to what extent the word combinations used by learners are ‘native-like’ by assigning to each pair of words in a learner text an association score (typically a pointwise mutual information and/or a t-score) computed on the basis of a large reference corpus.

The reference corpus differs across studies. Thus, Granger & Bestgen (2014) made use of the British National Corpus (BNC) to evaluate EFL learners’ use of bigrams in the International Corpus of Learner English (Granger et al, 2009); Paquot (forthcoming a & b) extracted statistical collocations from the L2 Research Corpus (L2RC), i.e. a large specialized corpus of research articles in applied linguistics, to assess learners’ use of adjective + noun and verb + object combinations in term papers in linguistics written by French EFL learners sampled from the Varieties of English for Specific Purposes Database (VESPA); and Paquot & Naets (2015) used the web corpus ENCOW14 (http://corporafromtheweb.org/encow14/) to analyze statistical collocations in the Longitudinal Database of Learner English (LONGDALE, Meunier, 2015).
The main objective of this study is to investigate the role of the reference corpus in LCR studies of statistical collocations in learner writing. It is driven by the following research questions:

  • To what extent are results replicable if another reference corpus is used to calculate association scores?
  • Depending on the learner corpus data investigated, should we use a general reference corpus or a specialized corpus to compute association measures?

To answer our research questions, we replicate the method used in Paquot & Naets (2015) and Paquot (forthcoming a & b): we extract relational co-occurrences (i.e. adjective + noun, adverb + adjective, adverb + verb and verb + direct object relations) from dependency parsed versions of the BNC, ENCOW14 and L2RC and compute their mutual information (MI) scores with the Ngram Statistics Package (NSP). We then use MI scores computed on the basis of the three reference corpora to analyze the same relational co-occurrences in learner texts rated at different CEFR levels (i.e. B2, C1, C2) sampled from ICLE and VESPA. We compute mean MI scores for each dependency relation in each learner text (cf. Bestgen & Granger, 2014) and compare their distribution across proficiency levels. Distributions in the CEFR-based learner data sets are tested for normality and accordingly compared with ANOVAs followed by Tuckey contrasts or Kruskal-Wallis rank sum tests followed by pairwise comparisons using Wilcoxon rank sum tests.

Preliminary results confirm previous research by demonstrating that the more advanced learners use more native-like collocations irrespective of the reference corpus. However, MI scores computed on the basis of the three different reference corpora seem to reveal different aspects of phraseological proficiency in learner writing, most notably the use of general vs. genre-specific collocations.

Szmrecsanyi et al. (2016) explored three syntactic alternations (the particle placement, genitive and dative alternations) in four varieties of English (British, Canadian, Indian and Singapore English) as represented in the International Corpus of English and reported that the varieties studied share a core probabilistic grammar, i.e. the choice between syntactic alternations is motivated by probabilistic constraints rather than categorical rules (cf. Bresnan, 2007). However, they also showed that grammatical variation is subject to indigenization “at various degrees of subtlety, depending on the abstractness and the lexical embedding of the syntactic pattern involved” (p. 2), with particle placement alternation exhibiting the most robust variety effects.

The main objective of the case study presented here is to shed some light on whether English as a Foreign Language (EFL) learners share a core probabilistic grammar with users of first and second language varieties of English. The study focuses on particle placement (as this alternation is more likely to exhibit variety effects, cf. Szmrecsanyi et al., to appear) and is driven by the following research questions:

  • What factors influence EFL learners’ particle placement alternations?
  • How do EFL learners’ particle placement preferences compare with those of users of first and second language varieties of English as described in Szmrecsanyi et al. (to appear)?

The study makes use of the French and German L1 components of the Louvain International Database of Spoken English Interlanguage (LINDSEI)  (Gilquin et al., 2010) and largely replicates the methods used in Szmrecsanyi et al. (to appear) to identify transitive phrasal verbs and code particle placement alternations in EFL learner speech. Unlike in Szmrecsanyi et al. (to appear), however, identification and annotation of particle placement alternations are done fully manually for two main reasons: (1) tagging learner speech as represented in the LINDSEI proves unreliable and (2) the LINDSEI components are much smaller (about 50,000 words each) than the sub-corpora of the International Corpus of English.

  • Project title: Quantitative research methods and study quality in learner corpus research
    Status: Published
    Collaborator: Luke Plonsky (University College London)
    Presented at: Learner Corpus Research Conference (2015)

Learner corpus research has seen major development since its inception some 25 years ago. Nevertheless, theoretical, methodological and empirical advances have been summarized in the literature only rarely and, in such cases, selectively rather than systematically. To the authors’ knowledge, in fact, there is no meta-analysis to date that summarizes and synthesizes the body of knowledge resulting from learner corpus research in a specific area of study (e.g. English as a Foreign Language learners’ use of collocations or tense, aspect and modality in learner writing). Equally concerning is that relatively little attention has been paid to the state or development of the field’s methodological practices, an unfortunate circumstance given the empirical rigor needed to reliably and accurately make use of corpus data and analyse frequencies of (co-)occurrence (Gries, 2013; Gries, forthcoming; Gries & Deshors, 2014).

Progress in any discipline, however, crucially “depends on sound research methods, principled data analysis, and transparent reporting practices” (Plonsky & Gass, 2011: 325). This study thus aims to provide the first empirical assessment of quantitative research methods and study quality in learner corpus research. Study quality is defined rather broadly as « (a) adherence to standards of contextually appropriate methodological rigor in research practices and (b) transparent and complete reporting of such practices” (Plonsky, 2013: 657). Specifically, we systematically review all quantitative, primary studies referenced in the Learner Corpus Bibliography (LCB), a representative bibliography of learner corpus research maintained by the Learner Corpus Association (http://learnercorpusassociation.org) which currently contains approximately 1180 references.

The techniques used to retrieve, code, and analyze this body of primary research are characteristic of research synthesis and meta-analysis. Following Plonsky (2013), however, this study differs from those traditions of synthetic research in that the focus here is almost exclusively methodological (i.e. the “how” of learner corpus research) rather than substantive (i.e. the “what”). Each reference in the LCB is surveyed using a coding scheme inspired from the protocol developed and first used by Plonsky & Gass (2011) to assess methodological quality in second language acquisition, and more particularly interaction research. The coding scheme is however revised and expanded to account for the methodological characteristics of corpus linguistics. Quantitative studies are coded for over 50 categories representing six dimensions: (a) publication type (i.e. conference paper, book chapter, journal article), (b) research focus (e.g. lexis, grammar), (c) methodological features (e.g. Contrastive Interlanguage Analysis, keyword analysis, error analysis, use of reference corpus), (d) statistical analyses (e.g. X², t-test, regression analysis), and (e) reporting practices (e.g. reliability coefficients, means). The 25-year span of research represented in the LCB provides a unique opportunity to examine the resulting data cumulatively and also permits analyses of changes taking place over time in the research and reporting practices of this domain.

Preliminary results point to several systematic strengths as well as many flaws, such as the absence of research questions or hypotheses, incomplete and inconsistent reporting practices (e.g. means without standard deviations), and low statistical power (i.e. LCR studies generally overrely on tests of statistical significance such as the X² test, do not report effect sizes, rarely check or report whether statistical assumptions have been met, rarely use multivariate analyses). Improvements over time are however clearly noted and there are signs that, like other related disciplines, learner corpus research is slowly “undergoing a change to becoming much more empirical, much more rigorous, and much more quantitative/statistical” (Gries, 2013: 287)

In addition to providing direction for future research and research practices, the study’s findings will also be discussed and contextualized within the research cultures of corpus linguistics, second language acquisition, and applied linguistics more generally.

  • Project title: Adopting a relational model of co-occurrences to trace phraseological development
    Status: published
    Collaborator: Hubert Naets (Université catholique de Louvain) 

Learner corpus research has witnessed a boom in the number of studies that investigate learners’ use of multi-word combinations (see Paquot & Granger, 2012 for a recent overview). Several recent studies have adopted an approach first put forward by Schmitt and colleagues (e.g. Durrant & Schmitt, 2009) to assess whether and to what extent the word combinations used by learners are ‘native-like’ by assigning to each pair of words in a learner text an association score computed on the basis of a large reference corpus. Bestgen & Granger (2014), for example, used this procedure to analyse the Michigan State University Corpus of second language writing (MSU) and showed that mean Mutual Information (MI) scores of the bigrams used by L2 writers are positively correlated with human judgment of proficiency. Most studies so far have investigated positional co-occurrences, where words are said to co-occur when they appear within a certain distance from each other (Evert, 2004) and focused more particularly on adjacent word combinations (often in the form of adjective + noun combinations) (e.g. Li & Schmitt 2010, Siyanova & Schmitt 2008). Corpus linguists such as Evert & Krenn (2003), however, have argued strongly for a relational model of co-occurrences, where the co-occurring words appear in a specific structural relation (see also Bartsch, 2004).
Paquot (2014) adopted a relational model of co-occurrences to evaluate whether such co-occurrences are good discriminators of language proficiency. She made use of the Stanford CoreNLP suite of tools to parse the French L1 component of the Varieties of English for Specific Purposes dAtabase (VESPA) and extract dependency relations in the form of triples of a relation between pairs of words such as dobj(win,lottery), i.e. “the direct object of win is lottery” (de Marneffe and Manning, 2013). She then used association measures computed on the basis of a large reference corpus to analyse pairs of words in specific grammatical relations in three VESPA sub-corpora made up of texts rated at different CEFR levels (i.e. B2, C1 and C2). Findings showed that adjective + noun relations discriminated well between B2 and C2 levels; adverbial modifiers separated out B2 texts from the C1 and C2 texts; and verb + direct object relations set C2 texts apart from B2 and C1 texts. These results suggest that, used together, phraseological indices computed on the basis of relational dependencies are able to gauge language proficiency.
The main objective of this study is to investigate whether relational co-occurrences also constitute valid indices of phraseological development. To do so, we replicate the method used in Paquot (2014) on data from the Longitudinal Database of Learner English (LONGDALE, Meunier 2013, forthcoming). In the LONGDALE project, the same students are followed over a period of at least three years and data collections are typically organized once per year. The 78 argumentative essays selected for this study were written by 39 French learners of English in Year 1 and Year 3 of their studies at the University of Louvain. Unlike in Year 2, students were requested to write on the same topic in Year 1 and Year 3, which allows us to control for topic, a variable that has been shown to considerably influence learners’ use of word combinations (e.g. Cortes, 2004; Paquot, 2013).

Like in Paquot (2014), relational co-occurrences are operationalized in the form of word combinations used in four grammatical relations, i.e. adjective + noun, adverb + adjective, adverb + verb and verb + direct object, and extracted from the learner and reference corpora with the Stanford CoreNLP suite of tools. We then assign to each pair of words in the LONGDALE corpus its MI score computed on the basis of the British National Corpus, and compute mean MI scores for each dependency relations in each learner text (cf. Bestgen & Granger, 2014). Distributions in the two learner data sets (i.e. Year 1 and Year 3) are tested for normality and accordingly compared with ANOVAs followed by Tuckey contrasts or Kruskal-Wallis rank sum tests followed by pairwise comparisons using Wilcoxon rank sum tests. To explore the links between individual and group phraseological development trajectories, a detailed variability analysis using the method of individual profiling and visualization techniques will also be presented (cf. Verspoor & Smiskova, 2012).

  • Project title: Towards standardization of metadata for L2 corpora
    Collaborator: Sylviane Granger
    Presented at CLARIN Workshop on interoperability of L2 resources and tools, 6-8 December 2017, Gothenburg, Sweden

Although there have been a large number of learner-corpus-based studies in recent years, the results are often inconclusive and at times seemingly contradictory because the data they are based on are not comparable. This highlights the importance of drawing up a standardized system of metadata for L2 texts. In this research project, we will take stock of a range of current metadata sets and make suggestions for minimal and maximal design principles.

  • Project title: The impact of genre on EFL learner writing: A Multi-Dimensional Analysis Perspective
    Collaborators: Prof. Doug Biber (Northern Arizona University), Tove Larsson
    Status: published – Larsson, T., Paquot, M., & Biber, D. (2021). On the importance of register in learner writing: A multi-dimensional approach. In E. Seoane & D. Biber (Eds.), Corpus based approaches to register variation (pp. 235-258). Amsterdam: Benjamins.
    Presented at: ICAME36, ICAME40

Previous learner corpus-based studies have shown that EFL learner languages exhibit shared linguistic features irrespective of the learners’ first languages. For example, it has repeatedly been reported that EFL learner writing is characterized by a more involved style than the writing of their native peers, as evidenced by a high number of writer/reader (W/R) visibility features such as first and second person pronouns, let’s imperatives, epistemic modal adverbs (e.g. certainly, maybe) and questions (cf. e.g. Petch-Tyson 1998, Altenberg and Tapper 1998; Aijmer, 2002; Narita & Sugiura, 2006; Neff et al. 2007; Gilquin & Paquot, 2008; Hasselgård, 2009).

Most of these learner corpus research (LCR) studies, however, have focused almost exclusively on argumentative writing and it is therefore questionable whether their results can be generalized to other genres and ultimately be used to inform English for Academic Purposes (EAP) pedagogical materials (Gilquin et al., 2007). To use the example of involvement features again, as noted by Recski (2004), in the case of argumentative essays such as those contained in the widely used International Corpus of Learner English (ICLE, Granger et al. 2009), “personal references and subjective attitudes are certainly hard to avoid”, since learners are explicitly prompted to give their personal opinions.
This study is part of a larger body of research that seeks to investigate whether the features commonly attributed to EFL learner writing are genuine characteristics of interlanguages or whether they are prompted by the argumentative type of texts that has usually been analysed in LCR. Paquot et al. (2013), for example, compared argumentative texts from the ICLE with discipline-specific texts from the Varieties of English for Specific Purposes dAtabase (VESPA). The VESPA learner corpus project aims at building a large corpus of English for Specific Purposes (ESP) texts written by L2 writers from various mother tongue backgrounds. The corpus currently contains papers and reports produced by BA and MA students in the context of a variety of content courses in linguistics, business, medicine, and engineering (for more details, see http://www.uclouvain.be/en-cecl-vespa.html). Paquot et al. (2013) analysed French and Norwegian learners’ use of a variety of W/R visibility features in ICLE argumentative texts and VESPA papers written for linguistics courses. They showed that, when compared to native speakers’ writing within the same genre and discipline, texts produced by French and Norwegian learners systematically displayed an overuse of W/R visibility features.

We adopt a broader perspective and build on a multi-dimensional analysis of a variety of native and learner corpora to further investigate the impact of genre on EFL learner writing. We make use of Biber’s (1988) linguistic features and dimensions to compare and contrast the same ICLE and VESPA sub-corpora as used in Paquot et al. (2013) as well as two corpora of student native writing (comparable subsets of the Louvain Corpus of Native English Essays (LOCNESS), the British Academic Written English (BAWE) corpus) and a 1 million word corpus of published research articles in linguistics.

Preliminary results suggest that French and Norwegian learners’ argumentative and discipline-specific texts are characterized by higher degrees of involvement (Dimension 1) and persuasiveness (Dimension 4) when set against comparable native speakers’ writing in terms of genre and discipline. They also point to strong L1-based differences (e.g. Norwegian learners’ argumentative and discipline-specific texts are much more involved than French learners’ texts). However, the various corpora used also cluster by genre, irrespective of L1 background, thus suggesting that learners adapt to genre requirements to some extent (cf. Paquot et al., 2013).

The theoretical and pedagogical implications of this study will be discussed. We will also consider its implications in terms of corpus comparability and selection of a reference corpus in learner corpus research.