Modelling score variation in student writing with a big data system: Benefits, challenges and ways forward

Lee McCallum

Research output: Contribution to journalArticlepeer-review

18 Downloads (Pure)


Aim: This theoretically-oriented research note has multiple aims. First, the note sets out approaches to studying the relationship between linguistic features and essay grades and how this has influenced approaches to score modelling in the domain of student university writing. Second, and more pressingly, the note makes the case for this work to continue by scholars
questioning how they have gone about uncovering this language use and how they have used this knowledge to construct models of writing assessment and/or model what lies behind assessment scores. As part of this questioning, the note introduces the method of mixed-effects modelling as a robust alternative to traditional linear regression modelling techniques. This method allows us to determine how the use of language influences assessment scores, while taking account of individual writer and
rater variables as well as contextual variables that include task and topic in this modelling process. The note encourages this research in the context of a first-year university writing program in a large U.S. university that has currently set up a “big data” text repository system to allow researchers the opportunity to carry out large-scale corpus examinations of language use. The note concludes by outlining some of the key challenges that scholars
need to be aware of when undertaking such work.
• Problem Formation: This section explores how previous studies have attempted to examine the role of various assessment variables on the rating process. These studies focus on interconnected research strands: the role of the writer in completing the assessment task(s); and the role of the rater in
the assessment process. Among these strands, there has been a focus on the relationship between the linguistic features that students use to complete
assessments and the grade score awarded and how this relationship may be mediated by other rater and writer characteristics. The review then narrows to analyze the methodological approaches in these studies and
subsequently sets the scene to introduce and promote mixed-effects modelling as a viable method to model these assessment constructs and relationships.
• Information Collection: Building on the review of studies, this section
begins by showcasing work that has used mixed-effects modelling in an attempt to minimize previous studies’ methodological shortcomings. This section outlines how such exemplary work takes account of statistical dependency in corpus data sets and highlights the feasibility of using such mixed-effects modelling on the “big data” system at the University of
South Florida (USF). Several theoretical and empirical points are made here in terms of considering practicality and the caveats involved in working with big data systems.
• Conclusions: Mixed-effects modelling appears to offer a reliable method that First-Year Composition (FYC) researchers can make use of in their study of numerous course and learner variables that influence multiple outcome variables in FYC programs. When we apply this method to the “big data” system at USF, it appears that the method can offer a robust and more accurate estimation of the relationship between student writers’ language use, grades, and mediating course and learner variables. However, the method and the treatment of the data contained in such a system need to be considered cautiously, as the effects of sample size
discrepancies across variable levels and the issue of missing data need further exploration.
• Directions for Future Research: Although the use of mixed-effects modelling is warranted from a theoretical and empirical evidence base, researchers need to take this work forward by also asking questions of the data structure and the variables contained within such data warehouse systems. Future research needs to examine how big data systems that are unbalanced can be used and how the presence of uneven data collection can influence the use of mixed-effects modelling.
Original languageEnglish
Pages (from-to)286-311
Number of pages26
JournalJournal of Writing Analytics
Publication statusPublished - 2019
Externally publishedYes

Bibliographical note

Licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 United States License.


  • first-year writing
  • linguistic features
  • mixed-effects models
  • writing analytics


Dive into the research topics of 'Modelling score variation in student writing with a big data system: Benefits, challenges and ways forward'. Together they form a unique fingerprint.

Cite this