Data Cleaning / Editing

  • Systematic interpretation
    • Humanities-oriented tools
      • TEI
      • Tropy
      • Recogito
        • collaborative data annotation platform for text and images
        • includes syntax for places, people, events
        • multiple export formats
        • runs auto Named-Entity Recognition
    • General tools
      • Airtable
      • Google Sheets/Forms
      • Atom
        • free and open-source text and code editor
        • powerful search across files
        • supports regular expressions
        • robust community of developers who contribute “packages” that extend Atom’s functionality, including data transforms, specialized syntax highlighting, easy GitHub integration, etc
    • Qualitative analysis tools: NVivo, ATLAS.ti, MAXQDA
  • Semi-automated cleaning tools
    • OpenRefine
      • power tool for cleaning tabular data and some XML
      • supports regular expressions
    • tidyr (part of the tidyverse suite of tools for R)
    • Breve
      • web based visual tool for seeing data errors in tabular data
      • NEH-funded project under development at Stanford’s Center for Spatial and Textual Analysis
    • WTFcsv
      • web based visual tool for a quick snapshot of the data in a csv file
  • String pattern manipulation


  • Airtable: collaborative database platform
    • allows you to embed a browsable copy of your database in a webpage
    • super user friendly, with tutorials that explain features like pivot tables
    • free account allows for 2GB server space and revision history 2 weeks old, but further features cost $$
  • Mukurtu: content management system supporting Indigenous knowledge systems and values
    • grassroots platform currently used by six hundred different groups to “curate their own Web sites and regulate access in accordance with custom”
    • multiple records can be generated for single digital heritage items, allowing for overlapping cultural narratives
    • “There is rarely just one story, one set of information, or one way of knowing cultural heritage materials.”
  • Omeka: open-source web publishing platforms for sharing digital collections and creating media-rich online exhibits.
    • designed by the Roy Rosenzweig Center for History and New Media at George Mason University, who developed Zotero
    • a go-to choice for many digital humanists and museums looking for a user-friendly, sustainable system for creating online collections/exhibits


Discipline-specific repositories

  • Tighter communities with richer standards
  • May have more restrictions, and perhaps a cost
  • Example: Humanities Commons CORE

Generalist repositories

  • Loose communities with boilerplate standards
  • Often unmediated (fast, but no quality assurance)
  • Example: Zenodo, open access repository maintained by CERN
    • automatically assigns DOI’s to all files
    • If you publish software or data in Github, you can create a citable archived version whenever you choose through Zenodo
    • possible to sign up directly through your GitHub account
    • because Zenodo accepts image / video / PDF files in addition to numerical / tabular / textual data, many scholars use Zenodo as an alternative to the for-profit when sharing copies of their articles or creating public research profiles: many metadata categories available for journal name, pages, etc. that Figshare and Dataverse don’t have
    • Social collections: tag datasets with “community collections,” curated by individual Zenodo users. Example: a collection of datasets, papers, presentations and source code on Digital Historical Linguistics created by one user
    • “your research output is stored safely for the future in the same cloud infrastructure as CERN’s own LHC research data.”
    • 50GB per dataset limit
  • Example: Dataverse, open access repository hosted by Harvard Institute for Quantitative Social Studies (IQSS)
    • A “dataverse” is a container for all your datasets, files, and metadata.
    • Tag datasets with pre-set categories, less than are available on Figshare
    • Allows user to customize the look of their "Dataverse" or collection
    • Allows for tiered access
    • Includes some integrated data analysis tools, and a useful “data explorer” web interface that lists the variables in a tabular data file and allows users to search, chart, and conduct cross tabulation analysis
    • Used by Cultural Analytics journal
    • 2.5 GB per file, 10 GB per dataset limit

Institutional repositories

  • Often curated, and can accept many sizes and types of data
  • Restricted to affiliates, but open to all disciplines
  • Example: Princeton University’s institutional data repository, Princeton Data Commons
    • assigns DOIs to datasets
    • offers data curation advice and assistance on deposits, with focus on metadata and tagging for preservation and discovery, and open formats for re-use
    • accepts all forms of research data (including research code)
    • has community approach, with upcoming DH community
    • infrastructure supported by library expertise in long-term digital preservation and archival practice

Further comparisons of repository features compiled by:

Project Management Platforms

  • Asana
    • online project management platform with shared to-do lists
  • Trello
    • team communications app in a message board format
  • Slack
    • group communications with topic-based channels


“Cleaning Data with Open Refine,” The Programming Historian

“Cleaning Data with OpenRefine for Ecologists” and “OpenRefine for Social Science Data”, Data Carpentry: Building Communities Teaching Universal Data Literacy

Checklist for Digital Humanities Projects, La Red de Humanidades Digitales (RedHD), English and Spanish versions available

Programming Historian: Preserving Your Research Data: “This lesson will suggest ways in which historians can document and structure their research data so as to ensure it remains useful in the future.”

Library Carpentry: Tidy Data for Librarians

Library Carpentry: OpenRefine

Library Carpentry: Top 10 FAIR Data & Software Things, a list of field-specific FAIR principles/techniques

Black Living Data Booklet section on "3 Steps to Download and Decode Data" PDF

Data Literacies: DH Institutes on tidy data, CSV, stages of data analysis, etc.

NEH’s Office of Digital Humanities Guide to Data Management Plans

Methods & Best Practices

Example Datasets

  • browse projects featured in Journal of Open Humanities Data
    • “features peer reviewed publications describing humanities data or techniques with high potential for reuse”
  • Our extensive, curated list, organized by field and topic, available at:
  • To Be Continued…
    • developed by Katherine Bode alongside her book A World of Fiction: Digital Collections and the Future of Literary History (2018)
    • identified and analyzed fiction over 21,000 novels, novellas and short stories in 19th- and early 20th-century Australian newspapers.
  • Data Refuge
    • “a community-driven, collaborative project to preserve public climate and environmental data”
    • currently building a “Storybank”, or map of data use cases and “life stories”
    • includes a number of toolkits for the rescue and protection of public data
    • spearheaded by UPenn’s Program in Environmental History Lab
  • Early African American Film
    • wonderful example of thorough documentation
    • networks of producers/actors/directors in early twentieth century “race film”
  • Collections as Data: Part to Whole
    • UNLV / University of Iowa / U Penn led Mellon grant, supports a number of project applicants
    • “Collections as data produced by project activity will exhibit high research value, demonstrate the capacity to serve underrepresented communities, represent a diversity of content types, languages, and descriptive practices, and arise from a range of institutional contexts.”
  • NYPL’s “What’s on the Menu?”
    • crowdsourced project that has garnered lots of public interest
    • interesting method of organically generating their data model
  • Black Anthology Project
    • “information related to over 600 African American short stories that appeared in 100 African American and American anthologies published between 1925 and 2017.”
    • tabular data on underrepresented authors and circulation histories
  • British Library Digital Scholarship
    • Extensive resource featuring digital collections and datasets drawn from the British Library collections, including digitized printed books, datasets for image analysis, datasets about the BL collections, datasets for content mining, digital mapping, and an archive of UK web content.
    • Example: CM Taylor Keylogging Data from the author C M Taylor captured between 17 October 2014 to 5 March 2018, during the writing of the novel Staying On, 2018
  • ToposText
    • “an indexed collection of ancient texts and mapped places relevant the the history and mythology of the ancient Greeks from the Neolithic period up through the 2nd century CE”
  • Quill Project
    • marking up “negotiated texts” written/decided by committee: constitutions, legislative proceedings, statements, etc.
    • “legibility to the general public only of secondary concern” – an archive primarily for scholars
    • example:

Further Readings

Data & Method

Tanya E. Clement, “Where Is Methodology in Digital Humanities?”, Debates in the Digital Humanities 2016

Ryan Cordell, “Teaching Humanistic Data Analysis” (2019)

Luke Stark and Anna Lauren Hoffmann, “Data Is the New What? Popular Metaphors & Professional Ethics in Emerging Data Culture,Cultural Analytics (2019)

Daniel Rosenberg, “Data Before the Fact,” in “Raw Data” Is an Oxymoron, ed. Lisa Gitelman (MIT Press, 2013)

Johanna Drucker, “HTML and Structured Data” (2013)

Michael Hancher, “Re: Search and Close Reading,” Debates in the Digital Humanities 2016

Ricardo L. Punzalan, Diana E. Marsh, Kyla Cools, “Beyond Clicks, Likes, and Downloads: Identifying Meaningful Impacts for Digitized Ethnographic Archives,” Archivaria 84 (Fall 2017)

Klein, Lauren F. “The Image of Absence: Archival Silence, Data Visualization, and James Hemings.” American Literature 1 December 2013; 85 (4): 661–688.

Data Cleaning

Katie Rawson & Trevor Muñoz, “Against Cleaning” (2016)

Mia Ridge, “Mia Ridge explores the shape of Cooper-Hewitt collections”, Cooper-Hewitt Labs (2012)

Lauren F. Klein, “The Image of Absence: Archival Silence, Data Visualization, and James Hemings,” American Literature 85, no. 4 (2013)

Garfinkel, Simson L. “De-Identification of Personal Information.” National Institute of Standards and Technology NISTIR 8053, October 2015.

Lincoln, Matthew D. “Tidy Data for the Humanities.” Matthew Lincoln, PhD (blog), 26 May 2020, (Accessed January 31, 2022.)

Sperberg-McQueen, C.M. and David Dubin. “Data Representation.” DH Curation Guide, (no date) (Accessed January 31, 2022.)

Wickham, Hadley. “Tidy Data.” Journal of Statistical Software 2014; 49 (10): 1-23.

Babau, Alison. “Classics, ‘Digital Classics’ and Issues for Data Curation.” DH Curation Guide, (no date) (Accessed February 14, 2022)

Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. “Datasheets for Datasets.” Communications of the ACM 64(12): pp. 86-92, 2021.

Levine, Melissa. “Policy, Practice, and Law.” DH Curation Guide, (no date) (Accessed February 14, 2022)

Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. “Model Cards for Model Reporting.” FAT* ’19: Proceedings of the Conference on Fairness, Accountability, and Transparency: pp. 220-229, 2019.

Van den Eynden, Veerle, Louise Corti, Matthew Woollard, Libby Bishop, and Laurence Horton. “Managing and Sharing Data: Best Practices for Researchers.” 3rd Ed. Essex: UK Data Archive, 2011. (Accessed February 28, 2022)