The University of Sheffield
2 files

Exploring open access coverage of Wikipedia cited research across the White Rose Universities. and Unpaywall datasets

These data were collected as part of a project by Andy Tattersall, Kate O'Neill, Chris Carroll (University of Sheffield) Nick Sheppard (University of Leeds) Thom Blake (University of York).

A data request to was submitted on the 16th April 2019 for entries that included authors from any of the three White Rose universities that are cited at least once in a Wikipedia entry. Data were tabulated and descriptive statistics were produced. The implications of the data are then discussed. We looked at just the White Rose universities of Leeds, Sheffield and York as they have their own shared Open Access repository in addition to their long history of collaboration, research focus; in addition that they are all members of the Russell Group of universities.

The data was tabulated with discipline data extracted from university systems. Wikipedia page entries and embedded citations were collected by using unique identifiers within the research such as a DOI, PubMed ID or ISBN, this also included the data the research was cited within a Wikipedia entry. They also collected further bibliographic data that included publication title and date. Data collection also included the individual page corresponding to each Wikipedia citation.

We explored the number of Wikipedia citations by discipline for each of the three institutions. We note that the data that supplies is only as good as the institutional and bibliometric journal that it harvests. Therefore as a consequence we found that certain fields were incomplete and we anticipate that based on a previous study by (Tattersall and Carroll 2018) that a percentage of the data in relation to institutional affiliation and date of publication to be inaccurate. (Tattersall and Carroll 2018) found by looking at citations within policy documents using a sample of their data that as much as one third of data could be erroneous.

Unpaywall data
DOIs of all papers that included a Wikipedia citation were subsequently run against the Unpaywall API. Unpaywall is a not for profit service that maintains a database of links to full-text articles harvested from a range of open-access sources. Unpaywall's Simple Query Tool enabled us to submit a large number of DOIs which returned a set of results that we placed into spreadsheet that comprises of information on the open access status including ‘best_oa_url’ and ‘best_oa_licence’. For articles published under the gold model these will typically be the resolvable DOI under a Creative Commons licence whereas for accepted manuscripts from institutional repositories it tended to be the repository URL under a more restrictive licence, often no specific licence. For the purposes of this study, the primary field of interest is designated as ‘is_oa’ which enables us to ascertain the proportion of articles that are available open access (is_oa = TRUE) compared to those that are not (is_oa = FALSE). It is important to note also that any repository record that was under embargo at the time of data collection was returned is_oa = FALSE. Whether the OA version is gold (under a Creative Commons licence) or green (with a more restrictive or no specified licence) is also significant, as Wikipedia citations to gold articles would necessarily be open access with no further intervention, whereas Wikipedia citations to articles made open access from a repository will only be accessible directly from that citation if it includes the appropriate ‘best_oa_url’ which may need to be added manually.

Tattersall, Andy, Nick Sheppard, Thom Blake, Kate O’Neill, and Christopher Carroll. 2022. “Exploring Open Access Coverage of Wikipedia-cited Research Across the White Rose Universities”. Insights 35: 3. DOI:



  • There is no personal data or any that requires ethical approval


  • The data complies with the institution and funders' policies on access and sharing

Sharing and access restrictions

  • The data can be shared openly

Data description

  • The file formats are open or commonly used

Methodology, headings and units

  • Headings and units are explained in the files