This paper explores some of the challenges in working with archive material to produce language corpora. It takes as a case study the British Telecom Correspondence Corpus (BTCC) which contains a selection of the letters held in the BT Archives, housed in Holborn Telephone Exchange. One of the essential differences between a corpus and an archive is that a corpus is intended to be representative of a language variety. Material makes its way into historical archives in a variety of ways, and whilst they may preserve a breadth of material; archives are not generally collected to be representative, nor are they primarily designed to facilitate linguistic investigation. Work on the BTCC began as part of a Jisc-funded project to digitise the BT Archives and create a ‘research resource for the higher education sector’ (Hay, 2014:12). The BT Digital Archives became available to the public in July 2013. Our experiences using this resource inform the second half of the paper, in particular regarding the identification of corpus material and the difficulty in identifying letters at an item level. This leads to a wider discussion of how best to digitise physical archives.
|Title of host publication||Proceedings of the Digital Humanities Congress 2014|
|Editors||Clare Mills, Michael Pidd, Jessica Williams|
|Place of Publication||Sheffield|
|Publisher||HRI Online Publications|
|Publication status||Published - 2016|
|Event||Digital Humanities Congress 2014 - Sheffield University, Sheffield, United Kingdom|
Duration: 4 Sep 2014 → 6 Sep 2014
|Conference||Digital Humanities Congress 2014|
|Period||4/09/14 → 6/09/14|
Bibliographical noteThe full text is also available from http://www.hrionline.ac.uk/openbook/chapter/dhc2014-morton
This is an open access publication with a Creative Commons Attribution-NoDerivatives 4.0 International License.