Information density in a corpus of university student writing

Research output: Chapter in Book/Report/Conference proceedingConference proceedingpeer-review


This paper provides an overview of the British Academic Written English (BAWE) corpus, and reports on two multidimensional analyses of the corpus, focussing on the findings relating to informational production and density. The corpus contains about 6.5 million words of proficient university student writing collected from British universities in the first decade of the 21st century, categorised in terms of ‘genre families’ and distributed fairly equally across levels of study and disciplinary groupings (see BAWE has been named as a major data source in more than 80 publications, and has been examined from many linguistic perspectives, for example in terms of tense, modality and lexical cohesion.
The techniques of multidimensional analysis (MDA) complement studies of individual corpus aspects, because they permit multiple aspects to be examined simultaneously and enable mapping of their distribution across different datasets. In the case of BAWE, we can use MDA to compare linguistic features across levels of study, disciplines and genre families. The first BAWE MDA study was made with reference to the dimensions identified by Biber (1988) when comparing spoken and written registers. The second, BAWE2016, identified new BAWE-specific dimensions and is a more delicate characterisation of the writing produced by British university students.
Biber’s Dimension 1 (1988), ‘Involved versus Informational Production’, contrasts verbal and nominal styles: more ‘informational’ texts have negative scores on this dimension, with lower frequencies of present tense verbs, private verbs, the pro-verb DO, contractions, and 1st and 2nd person pronouns, and greater frequencies of longer words, nouns, attributive adjectives and prepositions. Scores on this dimension were entirely negative across all subsets of BAWE, and became increasingly so in the more advanced levels of study, reaching equivalence with published academic prose. This might indicate progression towards a more ‘academic’ writing style. The BAWE2016 Dimension 4, ‘Informational Density’, gave high scores to texts containing more noun groups and fewer verb groups, more nominalisations of verbs and adjectives, and a greater number of abstract nouns and long words. Again, texts at higher levels of study tended to cluster at the informational end of this dimension.
Informationally dense texts are associated with writing rather than spontaneous speech, because they require more pre-planning and more attentive decoding on the part of the reader. The information load tends to be heavier in written texts because they are permanent and relatively context free; they can be edited and revised, and they do not have to be immediately understood at the time of production. However, although informational density may be equated with academic maturity, the register is not suitable for all the purposes of university student writing. Texts ostensibly written for a non-expert readership, and/or describing human actions rather than abstractions, are likely to be more successful if their information load is lighter. This helps to explain why, in both analyses, informational density was most strongly associated with the Social Sciences in contrast to Arts and Humanities disciplines, and with research-oriented genre families rather than reflective writing and writing oriented towards non-expert readers.
Original languageEnglish
Title of host publicationProceedings of the International Conference CORPUS LINGUISTICS 2017
Place of PublicationSt Petersburg
PublisherSt. Petersburg State University
Number of pages6
Publication statusPublished - 27 Jun 2017
EventInternational Scientific Conference Corpus Linguistics 2017 - St Petersburg, Russian Federation
Duration: 27 Jul 201730 Jul 2017


ConferenceInternational Scientific Conference Corpus Linguistics 2017
Country/TerritoryRussian Federation
CitySt Petersburg
Internet address


Dive into the research topics of 'Information density in a corpus of university student writing'. Together they form a unique fingerprint.

Cite this