The University of Sheffield
Browse
1/1
3 files

BLN600: A Parallel Corpus of Machine/Human Transcribed Nineteenth Century Newspaper Texts

BLN600

Human transcriptions of 600 images selected from the British Library Newspapers parts 1 & 2 dataset.

To discourage automated web-scraping for AI training purposes, BLN600 is released as a password protected ZIP archive. The password is BLN600.

Directory Layout

  • Images/ - cropped images downloaded from the BLN platform in mixed JPG and TIFF format - indexed by GALE document ID
  • Ground Truth/ - human transcriptions of the images - indexed by GALE document ID to match up with images
  • OCR Text/ - GALE's OCR transcriptions of the images - indexed by GALE document ID to match up with images
  • metadata.json - structured document data linking document ID with publication information, article count, and non crime counts

Document IDs

Documents within the GALE BLN system are indexed by a "document ID"---a 10 digit number, prefixed with a 1 or 2 letter collection ID. Two versions of this ID appear to exist - a short form without the collection ID that data has been returned from GALE with, and a longer form with the collection ID that must be used when searching the platform site. For example, the 1834-07-07 issue of the Morning Chronicle has been returned from GALE as document ID 3207163457, in order to find this document again in the BLN online platform, you will need the longer form BA3207163457. The two types are bridged in metadata.json.

Errors

Care has been taken to ensure the ground truth is high quality through multiple error detection, image comparison, and error correction passes, however errors may still remain. BLN600 claims a high, but not 100% accuracy rate. If you have noticed an error in the ground truth, please report it to one of the authors.

Access, usage, and modification terms and license

Express permission was sought from and granted by GALE on behalf of the company and the British Library partners, and communicated to the authors electronically, for the release of the OCR text of 600 individual excerpts from the British Library Newspapers corpus parts 1 and 2, under a non-commercial use-only license (CC BY-NC-ND 4.0), publicly accessible with no additional access stipulations.

This research was funded by a UKRI EPSRC PhD studentship (1st author) and by The University of Sheffield's Centre for Machine Intelligence (2nd author).

BLN600 by Callum Booth, Alan Thomas, and Robert Gaizauskas is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/

Citation

Pending publication

History

Ethics

  • There is no personal data or any that requires ethical approval

Policy

  • The data complies with the institution and funders' policies on access and sharing

Sharing and access restrictions

  • The uploaded data can be shared openly

Data description

  • The file formats are open or commonly used

Methodology, headings and units

  • There is a file including methodology, headings and units, such as a readme.txt