Data Essay Criteria

To propose a dataset for inclusion in the collective, please submit a curatorial statement—a data essay—organized according to the criteria below and addressing these questions as relevant. The data essay should be around 1500 words, not including bibliography. It should make visible the choices and the labor that have gone into the creation of the dataset and demonstrate the scholarly value of this data in its field.

Basic Information

Include the following information about your dataset:

  • Title
  • Creator(s) names, institutions, and contact information
  • Funder(s)
  • Date of creation & date(s) of updates
  • Language(s)

Brief Project Description

Briefly describe your data and the research question that led you to create this dataset. This section should explain the value of this dataset, particularly its relevance to nineteenth-century studies (or another field). This section pertains to the data itself and possible or plausible analysis, not necessarily an analysis you have already undertaken. Cite sources in order to characterize the scholarly landscape (digital or traditional) that could either benefit from this dataset.

How is the data relevant to nineteenth-century scholarship? Who might it be useful for? What could it be used for? Please suggest at least three specific uses.

For what purpose did you create the dataset? Was there a gap that needed to be filled? Has the data been used already? Do similar or overlapping data exist publicly? If so, please describe.

Collection & Creation Methodology

This section should describe the choices that structured the creation of the dataset, explain any categorical variables (if you have them), and discuss the labor and technology that went into data collection and creation. Please avoid passive voice.

How did you acquire or create the data? If you acquired it, were there licenses and MOUs, institutional subscriptions or purchase agreements? Who paid for it or facilitated the transactions? If you created it, how? What mechanisms or procedures did you use to collect it (e.g. hardware apparatus, human curation, software, API)?

If the data was hand-curated, what organizational heuristic was adopted, and why? What aspects of the data are products of the researcher’s judgment or interpretation, and which aspects were inherited? What are the implications of these decisions?

Who was involved in the data collection process (e.g. students, crowdworkers, contractors) and how were they compensated? How long did it take to collect the data?

Did you hand-clean the data (e.g. removal of instances, processing of missing values) or use OpenRefine or another tool? Do you have a saved copy of the “raw” data in addition to the cleaned (e.g. to support unanticipated future uses)?

Provide sufficient detail such that readers understand how the dataset was created and would within reason be able to recreate it.

Data Structure

This section should explain the parameters and categories of your dataset (what are we looking at and how much of it are we looking at)?

What does the data describe? Are all instances included or a selection? If selected, what principles were used to justify inclusions and exclusions?

Is any information missing? If so, please provide a description, explaining why this information is missing (e.g. because it was unavailable). Are there any errors, sources of noise, or redundancies? If so, please describe.

What is the file type and size of the data? If you have multiple files, describe the relationships between them.

Describe any variable or non-standard features of your data. If your dataset uses categorical variables or other labels or fields that you have created, explain how they were constructed. Should the user be aware of any categories or fields that condense or erase information?


Were there any possible negative impacts or harms that resulted from collecting or curating this data?

What possible negative impacts or harms might result from the publication of your data?

Does the dataset contain data that might be considered confidential (e.g. data that includes the content of individuals’ non-public communications)? If so, please describe.

Does the dataset contain data that might be considered sensitive (e.g. data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please describe.

Were any ethical review processes conducted (e.g. by an institutional review board)? If so, please describe these review processes, including the outcomes, as well as a link or other access point to supporting documentation.

Can you anticipate any way that this data could be misused?


Check PDF for common accessibility issues. Resources for checking and techniques for improving accessibility in PDFs:

Adobe has a built-in accessibility checker if you are working with those proprietary products already.

Statement of Collaboration

Is there documentation (on Github or elsewhere) of the collaborative labor that went into making (and/or maintaining) this dataset? Briefly describe and provide links.


Will the data be updated (e.g. to correct errors, add new instances, delete instances)? If so, please describe how often and by whom.


Provide a list of sources consulted or drawn from to produce the dataset.

Licensing & Rights

If applicable, the data must be deposited under an open license that permits unrestricted access (e.g. CC0, CC-BY).

Data Citation

Any reason why the citation should not conform to the C19 Data Collective standard citational practice?

The language for these criteria was drawn from the Post45 website, which cites Katherine Bode, Jennifer Doty, Lauren F. Klein, Melanie Walsh, Cultural Analytics, Journal of Open Humanities Data, and “Datasheets for Datasets” by Timnit Gebru et. al. Additional language and criteria come from the Center for Digital Humanities at Princeton University, particularly the criteria devised by Grant Wythoff.