Here is a list of my most recent (collaborative) research projects. For past projects, please check the website of the Centre for English Corpus Linguistics.
- Project title: Towards standardization of metadata for L2 corpora
Collaborator: Sylviane Granger
To be presented at CLARIN Workshop on interoperability of L2 resources and tools, 6-8 December 2017, Gothenburg, Sweden
Although there have been a large number of learner-corpus-based studies in recent years, the results are often inconclusive and at times seemingly contradictory because the data they are based on are not comparable. This highlights the importance of drawing up a standardized system of metadata for L2 texts. In this research project, we will take stock of a range of current metadata sets and make suggestions for minimal and maximal design principles.
- Project title: Phraseological complexity: a promising construct for second language research
Research in corpus linguistics, psycholinguistics and cognitive linguistics has provided convergent evidence that lexis and grammar are inextricably intertwined and that word combinations play crucial roles in language acquisition, processing, fluency, idiomaticity and change. Second language (L2) research has been relatively slow to follow suit. Phraseology, formulaic language and constructions are now at the forefront of debates in foreign language learning and teaching but not all domains of L2 research have navigated the transition. One particular field that has remained impervious to current developments is interlanguage complexity: linguistic complexity has traditionally been narrowed down to syntactic complexity and lexical complexity is still very much regarded as its poor relation.
The overarching aim of the project is to define and circumscribe the linguistic construct of phraseological complexity within the framework of usage-based theories of language, and to theoretically and empirically demonstrate its relevance for L2 complexity research, and more generally for theories of L2 use and development. The project centres around four main objectives: (1) determine the dimensions of phraseological complexity, (2) establish the construct validity of new phraseological complexity measures automatically calculated using NLP techniques and corpus data, (3) chart the development of phraseological complexity in L2 writing and speech, and (4) identify the best set of complexity measures to adequately capture the dynamics of language development over time. More generally, the research project will contribute to giving phraseology (and the lexis-grammar interface) the place it deserves in theories of language proficiency and language development. It also opens up promising research avenues for linguistic theory and description in other areas such as contrastive linguistics, where linguistic complexity has also often been traditionally conceived of as syntactic complexity.
- Project title: The role of the reference corpus in studies of EFL learners’ use of statistical collocations
Collaborator: Hubert Naets (CENTAL)
Presented at ICAME38, Pragues, 24-28 May 2017.
In learner corpus research (LCR), there has been a recent boom in the number of studies that have investigated English as a Foreign Language (EFL) learners’ use of statistical collocations (e.g. Bestgen & Granger, 2014; Granger & Bestgen, 2014; Paquot & Naets, 2015; Paquot, forthcoming a & b). These studies have adopted an approach first put forward by Schmitt and colleagues (e.g. Durrant & Schmitt, 2009) to assess whether and to what extent the word combinations used by learners are ‘native-like’ by assigning to each pair of words in a learner text an association score (typically a pointwise mutual information and/or a t-score) computed on the basis of a large reference corpus.
The reference corpus differs across studies. Thus, Granger & Bestgen (2014) made use of the British National Corpus (BNC) to evaluate EFL learners’ use of bigrams in the International Corpus of Learner English (Granger et al, 2009); Paquot (forthcoming a & b) extracted statistical collocations from the L2 Research Corpus (L2RC), i.e. a large specialized corpus of research articles in applied linguistics, to assess learners’ use of adjective + noun and verb + object combinations in term papers in linguistics written by French EFL learners sampled from the Varieties of English for Specific Purposes Database (VESPA); and Paquot & Naets (2015) used the web corpus ENCOW14 (http://corporafromtheweb.org/encow14/) to analyze statistical collocations in the Longitudinal Database of Learner English (LONGDALE, Meunier, 2015).
The main objective of this study is to investigate the role of the reference corpus in LCR studies of statistical collocations in learner writing. It is driven by the following research questions:
- To what extent are results replicable if another reference corpus is used to calculate association scores?
- Depending on the learner corpus data investigated, should we use a general reference corpus or a specialized corpus to compute association measures?
To answer our research questions, we replicate the method used in Paquot & Naets (2015) and Paquot (forthcoming a & b): we extract relational co-occurrences (i.e. adjective + noun, adverb + adjective, adverb + verb and verb + direct object relations) from dependency parsed versions of the BNC, ENCOW14 and L2RC and compute their mutual information (MI) scores with the Ngram Statistics Package (NSP). We then use MI scores computed on the basis of the three reference corpora to analyze the same relational co-occurrences in learner texts rated at different CEFR levels (i.e. B2, C1, C2) sampled from ICLE and VESPA. We compute mean MI scores for each dependency relation in each learner text (cf. Bestgen & Granger, 2014) and compare their distribution across proficiency levels. Distributions in the CEFR-based learner data sets are tested for normality and accordingly compared with ANOVAs followed by Tuckey contrasts or Kruskal-Wallis rank sum tests followed by pairwise comparisons using Wilcoxon rank sum tests.
Preliminary results confirm previous research by demonstrating that the more advanced learners use more native-like collocations irrespective of the reference corpus. However, MI scores computed on the basis of the three different reference corpora seem to reveal different aspects of phraseological proficiency in learner writing, most notably the use of general vs. genre-specific collocations.
- Project title: Particle placement alternation in EFL learner speech: core probabilistic grammar and/or EFL-specific preferences?
Collaborators: Benedikt Szmrecsanyi ‘s team – ‘Exploring probabilistic grammar(s) in varieties of English around the world‘ project
Presented at: ‘Probabilistic variation across dialects and varieties’ workshop, 4-5 April 2016, Leuven, Belgium. (invited)
Szmrecsanyi et al. (2016) explored three syntactic alternations (the particle placement, genitive and dative alternations) in four varieties of English (British, Canadian, Indian and Singapore English) as represented in the International Corpus of English and reported that the varieties studied share a core probabilistic grammar, i.e. the choice between syntactic alternations is motivated by probabilistic constraints rather than categorical rules (cf. Bresnan, 2007). However, they also showed that grammatical variation is subject to indigenization “at various degrees of subtlety, depending on the abstractness and the lexical embedding of the syntactic pattern involved” (p. 2), with particle placement alternation exhibiting the most robust variety effects.
The main objective of the case study presented here is to shed some light on whether English as a Foreign Language (EFL) learners share a core probabilistic grammar with users of first and second language varieties of English. The study focuses on particle placement (as this alternation is more likely to exhibit variety effects, cf. Szmrecsanyi et al., to appear) and is driven by the following research questions:
- What factors influence EFL learners’ particle placement alternations?
- How do EFL learners’ particle placement preferences compare with those of users of first and second language varieties of English as described in Szmrecsanyi et al. (to appear)?
The study makes use of the French and German L1 components of the Louvain International Database of Spoken English Interlanguage (LINDSEI) (Gilquin et al., 2010) and largely replicates the methods used in Szmrecsanyi et al. (to appear) to identify transitive phrasal verbs and code particle placement alternations in EFL learner speech. Unlike in Szmrecsanyi et al. (to appear), however, identification and annotation of particle placement alternations are done fully manually for two main reasons: (1) tagging learner speech as represented in the LINDSEI proves unreliable and (2) the LINDSEI components are much smaller (about 50,000 words each) than the sub-corpora of the International Corpus of English.
- Project title: Quantitative research methods and study quality in learner corpus research
Collaborator: Luke Plonsky (University College London)
Presented at: Learner Corpus Research Conference (2015)
Learner corpus research has seen major development since its inception some 25 years ago. Nevertheless, theoretical, methodological and empirical advances have been summarized in the literature only rarely and, in such cases, selectively rather than systematically. To the authors’ knowledge, in fact, there is no meta-analysis to date that summarizes and synthesizes the body of knowledge resulting from learner corpus research in a specific area of study (e.g. English as a Foreign Language learners’ use of collocations or tense, aspect and modality in learner writing). Equally concerning is that relatively little attention has been paid to the state or development of the field’s methodological practices, an unfortunate circumstance given the empirical rigor needed to reliably and accurately make use of corpus data and analyse frequencies of (co-)occurrence (Gries, 2013; Gries, forthcoming; Gries & Deshors, 2014).
Progress in any discipline, however, crucially “depends on sound research methods, principled data analysis, and transparent reporting practices” (Plonsky & Gass, 2011: 325). This study thus aims to provide the first empirical assessment of quantitative research methods and study quality in learner corpus research. Study quality is defined rather broadly as « (a) adherence to standards of contextually appropriate methodological rigor in research practices and (b) transparent and complete reporting of such practices” (Plonsky, 2013: 657). Specifically, we systematically review all quantitative, primary studies referenced in the Learner Corpus Bibliography (LCB), a representative bibliography of learner corpus research maintained by the Learner Corpus Association (http://learnercorpusassociation.org) which currently contains approximately 1180 references.
The techniques used to retrieve, code, and analyze this body of primary research are characteristic of research synthesis and meta-analysis. Following Plonsky (2013), however, this study differs from those traditions of synthetic research in that the focus here is almost exclusively methodological (i.e. the “how” of learner corpus research) rather than substantive (i.e. the “what”). Each reference in the LCB is surveyed using a coding scheme inspired from the protocol developed and first used by Plonsky & Gass (2011) to assess methodological quality in second language acquisition, and more particularly interaction research. The coding scheme is however revised and expanded to account for the methodological characteristics of corpus linguistics. Quantitative studies are coded for over 50 categories representing six dimensions: (a) publication type (i.e. conference paper, book chapter, journal article), (b) research focus (e.g. lexis, grammar), (c) methodological features (e.g. Contrastive Interlanguage Analysis, keyword analysis, error analysis, use of reference corpus), (d) statistical analyses (e.g. X², t-test, regression analysis), and (e) reporting practices (e.g. reliability coefficients, means). The 25-year span of research represented in the LCB provides a unique opportunity to examine the resulting data cumulatively and also permits analyses of changes taking place over time in the research and reporting practices of this domain.
Preliminary results point to several systematic strengths as well as many flaws, such as the absence of research questions or hypotheses, incomplete and inconsistent reporting practices (e.g. means without standard deviations), and low statistical power (i.e. LCR studies generally overrely on tests of statistical significance such as the X² test, do not report effect sizes, rarely check or report whether statistical assumptions have been met, rarely use multivariate analyses). Improvements over time are however clearly noted and there are signs that, like other related disciplines, learner corpus research is slowly “undergoing a change to becoming much more empirical, much more rigorous, and much more quantitative/statistical” (Gries, 2013: 287)
In addition to providing direction for future research and research practices, the study’s findings will also be discussed and contextualized within the research cultures of corpus linguistics, second language acquisition, and applied linguistics more generally.
- Project title: Adopting a relational model of co-occurrences to trace phraseological development
Status: Manuscript in preparation
Collaborator: Hubert Naets (Université catholique de Louvain)
Presented at: Learner Corpus Research Conference (2015)
Learner corpus research has witnessed a boom in the number of studies that investigate learners’ use of multi-word combinations (see Paquot & Granger, 2012 for a recent overview). Several recent studies have adopted an approach first put forward by Schmitt and colleagues (e.g. Durrant & Schmitt, 2009) to assess whether and to what extent the word combinations used by learners are ‘native-like’ by assigning to each pair of words in a learner text an association score computed on the basis of a large reference corpus. Bestgen & Granger (2014), for example, used this procedure to analyse the Michigan State University Corpus of second language writing (MSU) and showed that mean Mutual Information (MI) scores of the bigrams used by L2 writers are positively correlated with human judgment of proficiency. Most studies so far have investigated positional co-occurrences, where words are said to co-occur when they appear within a certain distance from each other (Evert, 2004) and focused more particularly on adjacent word combinations (often in the form of adjective + noun combinations) (e.g. Li & Schmitt 2010, Siyanova & Schmitt 2008). Corpus linguists such as Evert & Krenn (2003), however, have argued strongly for a relational model of co-occurrences, where the co-occurring words appear in a specific structural relation (see also Bartsch, 2004).
Paquot (2014) adopted a relational model of co-occurrences to evaluate whether such co-occurrences are good discriminators of language proficiency. She made use of the Stanford CoreNLP suite of tools to parse the French L1 component of the Varieties of English for Specific Purposes dAtabase (VESPA) and extract dependency relations in the form of triples of a relation between pairs of words such as dobj(win,lottery), i.e. “the direct object of win is lottery” (de Marneffe and Manning, 2013). She then used association measures computed on the basis of a large reference corpus to analyse pairs of words in specific grammatical relations in three VESPA sub-corpora made up of texts rated at different CEFR levels (i.e. B2, C1 and C2). Findings showed that adjective + noun relations discriminated well between B2 and C2 levels; adverbial modifiers separated out B2 texts from the C1 and C2 texts; and verb + direct object relations set C2 texts apart from B2 and C1 texts. These results suggest that, used together, phraseological indices computed on the basis of relational dependencies are able to gauge language proficiency.
The main objective of this study is to investigate whether relational co-occurrences also constitute valid indices of phraseological development. To do so, we replicate the method used in Paquot (2014) on data from the Longitudinal Database of Learner English (LONGDALE, Meunier 2013, forthcoming). In the LONGDALE project, the same students are followed over a period of at least three years and data collections are typically organized once per year. The 78 argumentative essays selected for this study were written by 39 French learners of English in Year 1 and Year 3 of their studies at the University of Louvain. Unlike in Year 2, students were requested to write on the same topic in Year 1 and Year 3, which allows us to control for topic, a variable that has been shown to considerably influence learners’ use of word combinations (e.g. Cortes, 2004; Paquot, 2013).
Like in Paquot (2014), relational co-occurrences are operationalized in the form of word combinations used in four grammatical relations, i.e. adjective + noun, adverb + adjective, adverb + verb and verb + direct object, and extracted from the learner and reference corpora with the Stanford CoreNLP suite of tools. We then assign to each pair of words in the LONGDALE corpus its MI score computed on the basis of the British National Corpus, and compute mean MI scores for each dependency relations in each learner text (cf. Bestgen & Granger, 2014). Distributions in the two learner data sets (i.e. Year 1 and Year 3) are tested for normality and accordingly compared with ANOVAs followed by Tuckey contrasts or Kruskal-Wallis rank sum tests followed by pairwise comparisons using Wilcoxon rank sum tests. To explore the links between individual and group phraseological development trajectories, a detailed variability analysis using the method of individual profiling and visualization techniques will also be presented (cf. Verspoor & Smiskova, 2012).
- Project title: The impact of genre on EFL learner writing: A Multi-Dimensional Analysis Perspective
Status: Manuscript in preparation
Collaborator: Prof. Doug Biber (Northern Arizona University)
Presented at: ICAME2015
Previous learner corpus-based studies have shown that EFL learner languages exhibit shared linguistic features irrespective of the learners’ first languages. For example, it has repeatedly been reported that EFL learner writing is characterized by a more involved style than the writing of their native peers, as evidenced by a high number of writer/reader (W/R) visibility features such as first and second person pronouns, let’s imperatives, epistemic modal adverbs (e.g. certainly, maybe) and questions (cf. e.g. Petch-Tyson 1998, Altenberg and Tapper 1998; Aijmer, 2002; Narita & Sugiura, 2006; Neff et al. 2007; Gilquin & Paquot, 2008; Hasselgård, 2009).
Most of these learner corpus research (LCR) studies, however, have focused almost exclusively on argumentative writing and it is therefore questionable whether their results can be generalized to other genres and ultimately be used to inform English for Academic Purposes (EAP) pedagogical materials (Gilquin et al., 2007). To use the example of involvement features again, as noted by Recski (2004), in the case of argumentative essays such as those contained in the widely used International Corpus of Learner English (ICLE, Granger et al. 2009), “personal references and subjective attitudes are certainly hard to avoid”, since learners are explicitly prompted to give their personal opinions.
This study is part of a larger body of research that seeks to investigate whether the features commonly attributed to EFL learner writing are genuine characteristics of interlanguages or whether they are prompted by the argumentative type of texts that has usually been analysed in LCR. Paquot et al. (2013), for example, compared argumentative texts from the ICLE with discipline-specific texts from the Varieties of English for Specific Purposes dAtabase (VESPA). The VESPA learner corpus project aims at building a large corpus of English for Specific Purposes (ESP) texts written by L2 writers from various mother tongue backgrounds. The corpus currently contains papers and reports produced by BA and MA students in the context of a variety of content courses in linguistics, business, medicine, and engineering (for more details, see http://www.uclouvain.be/en-cecl-vespa.html). Paquot et al. (2013) analysed French and Norwegian learners’ use of a variety of W/R visibility features in ICLE argumentative texts and VESPA papers written for linguistics courses. They showed that, when compared to native speakers’ writing within the same genre and discipline, texts produced by French and Norwegian learners systematically displayed an overuse of W/R visibility features.
We adopt a broader perspective and build on a multi-dimensional analysis of a variety of native and learner corpora to further investigate the impact of genre on EFL learner writing. We make use of Biber’s (1988) linguistic features and dimensions to compare and contrast the same ICLE and VESPA sub-corpora as used in Paquot et al. (2013) as well as two corpora of student native writing (comparable subsets of the Louvain Corpus of Native English Essays (LOCNESS), the British Academic Written English (BAWE) corpus) and a 1 million word corpus of published research articles in linguistics.
Preliminary results suggest that French and Norwegian learners’ argumentative and discipline-specific texts are characterized by higher degrees of involvement (Dimension 1) and persuasiveness (Dimension 4) when set against comparable native speakers’ writing in terms of genre and discipline. They also point to strong L1-based differences (e.g. Norwegian learners’ argumentative and discipline-specific texts are much more involved than French learners’ texts). However, the various corpora used also cluster by genre, irrespective of L1 background, thus suggesting that learners adapt to genre requirements to some extent (cf. Paquot et al., 2013).
The theoretical and pedagogical implications of this study will be discussed. We will also consider its implications in terms of corpus comparability and selection of a reference corpus in learner corpus research.