Resources

Tools

Data Cleaning / Editing

Systematic interpretation
- Humanities-oriented tools
  - TEI
  - Tropy
  - Recogito
    - collaborative data annotation platform for text and images
    - includes syntax for places, people, events
    - multiple export formats
    - runs auto Named-Entity Recognition
- General tools
  - Airtable
  - Google Sheets/Forms
  - Atom
    - free and open-source text and code editor
    - powerful search across files
    - supports regular expressions
    - robust community of developers who contribute “packages” that extend Atom’s functionality, including data transforms, specialized syntax highlighting, easy GitHub integration, etc
- Qualitative analysis tools: NVivo, ATLAS.ti, MAXQDA
Semi-automated cleaning tools
- OpenRefine
  - power tool for cleaning tabular data and some XML
  - supports regular expressions
- tidyr (part of the tidyverse suite of tools for R)
- Breve
  - web based visual tool for seeing data errors in tabular data
  - NEH-funded project under development at Stanford’s Center for Spatial and Textual Analysis
- WTFcsv
  - web based visual tool for a quick snapshot of the data in a csv file
String pattern manipulation
- Regular expressions
  - RegEx 101 tool
  - Programming Historian introduction to regular expressions
- stringr (also part of the tidyverse)

Databases

Airtable: collaborative database platform
- allows you to embed a browsable copy of your database in a webpage
- super user friendly, with tutorials that explain features like pivot tables
- free account allows for 2GB server space and revision history 2 weeks old, but further features cost $$
Mukurtu: content management system supporting Indigenous knowledge systems and values
- grassroots platform currently used by six hundred different groups to “curate their own Web sites and regulate access in accordance with custom”
- multiple records can be generated for single digital heritage items, allowing for overlapping cultural narratives
- “There is rarely just one story, one set of information, or one way of knowing cultural heritage materials.”
Omeka: open-source web publishing platforms for sharing digital collections and creating media-rich online exhibits.
- designed by the Roy Rosenzweig Center for History and New Media at George Mason University, who developed Zotero
- a go-to choice for many digital humanists and museums looking for a user-friendly, sustainable system for creating online collections/exhibits

Repositories

Discipline-specific repositories

Tighter communities with richer standards
May have more restrictions, and perhaps a cost
Example: Humanities Commons CORE

Generalist repositories

Loose communities with boilerplate standards
Often unmediated (fast, but no quality assurance)
Example: Zenodo, open access repository maintained by CERN
- automatically assigns DOI’s to all files
- If you publish software or data in Github, you can create a citable archived version whenever you choose through Zenodo
  - This feature used by CDH for:
  - Derrida’s Margins codebase: https://doi.org/10.5281/zenodo.1453447
  - PPA codebase https://doi.org/10.5281/zenodo.2400705
- possible to sign up directly through your GitHub account
- because Zenodo accepts image / video / PDF files in addition to numerical / tabular / textual data, many scholars use Zenodo as an alternative to the for-profit academia.edu when sharing copies of their articles or creating public research profiles: many metadata categories available for journal name, pages, etc. that Figshare and Dataverse don’t have
- Social collections: tag datasets with “community collections,” curated by individual Zenodo users. Example: a collection of datasets, papers, presentations and source code on Digital Historical Linguistics created by one user
- “your research output is stored safely for the future in the same cloud infrastructure as CERN’s own LHC research data.”
- 50GB per dataset limit
Example: Dataverse, open access repository hosted by Harvard Institute for Quantitative Social Studies (IQSS)
- A “dataverse” is a container for all your datasets, files, and metadata.
- Tag datasets with pre-set categories, less than are available on Figshare
- Allows user to customize the look of their "Dataverse" or collection
- Allows for tiered access
- Includes some integrated data analysis tools, and a useful “data explorer” web interface that lists the variables in a tabular data file and allows users to search, chart, and conduct cross tabulation analysis
- Used by Cultural Analytics journal
- 2.5 GB per file, 10 GB per dataset limit

Institutional repositories

Often curated, and can accept many sizes and types of data
Restricted to affiliates, but open to all disciplines
Example: Princeton University’s institutional data repository, Princeton Data Commons
- assigns DOIs to datasets
- offers data curation advice and assistance on deposits, with focus on metadata and tagging for preservation and discovery, and open formats for re-use
- accepts all forms of research data (including research code)
- has community approach, with upcoming DH community
- infrastructure supported by library expertise in long-term digital preservation and archival practice

Further comparisons of repository features compiled by:

Project Management Platforms

Asana
- online project management platform with shared to-do lists
Trello
- team communications app in a message board format
Slack
- group communications with topic-based channels

Tutorials

“Cleaning Data with Open Refine,” The Programming Historian

“Cleaning Data with OpenRefine for Ecologists” and “OpenRefine for Social Science Data”, Data Carpentry: Building Communities Teaching Universal Data Literacy

Checklist for Digital Humanities Projects, La Red de Humanidades Digitales (RedHD), English and Spanish versions available

Programming Historian: Preserving Your Research Data: “This lesson will suggest ways in which historians can document and structure their research data so as to ensure it remains useful in the future.”

Library Carpentry: Tidy Data for Librarians

Library Carpentry: OpenRefine

Library Carpentry: Top 10 FAIR Data & Software Things, a list of field-specific FAIR principles/techniques

Black Living Data Booklet section on "3 Steps to Download and Decode Data" PDF

Data Literacies: DH Institutes on tidy data, CSV, stages of data analysis, etc.

NEH’s Office of Digital Humanities Guide to Data Management Plans

Methods & Best Practices

Arts & Humanities Standards Directory from the Research Data Alliance
Frictionless Data
- an open-source framework to reduce friction in data workflows
- multiple standards developing for data scientists and researchers
Annotation for Transparent Inquiry (ATI)
The CARE Principles for Indigenous Data Governance
- Complementing the FAIR Principles
- Emphasizing Collective benefit, Authority to control, Responsibility, and Ethics
Traditional Knowledge Labels
- Complementing licenses and permissions for use
- Emphasizing relationships and engagement with Indigenous communities
DH Curation Guide
- Asks, “How do we align the care for digital materials with the methods/goals of traditional humanities disciplines?”
- Introductory essays on different aspects of data curation in digital humanities, with links to relevant readings
- produced by NEH-funded workshops in 2014 at Maryland Institute for Technology in the Humanities and University of Illinois Center for Informatics Research in Science and Scholarship
UCLA Library: Data Management for the Humanities
- extensive research guide
PM4DH | Project Management for the Digital Humanities
- developed by Emory Center for Digital Scholarship
- “curriculum for managing digital projects in academic libraries and other settings”
Data Nutrition Project
- “nutrition labels” graphically designed to resemble those on food packaging
- still in prototype stages
- “aims to create a standard for interrogating datasets for measures that will ultimately drive the creation of better, more inclusive machine learning models”
- “aims to highlight the key ingredients in a dataset such as meta-data and populations, as well as unique or anomalous features regarding distributions, missing data, and comparisons to other ‘ground truth’ datasets.”
Digital Humanities Data Curation Guide (UMD)
Resources from Humanities at Scale (DARIAH)
Managing and Sharing Data: Best Practices for Researchers [PDF]
- Created by the UK Data Archive, “the UK’s largest collection of digital research data in the social sciences and humanities.”
- produced in 2011, a slightly outdated but thorough rundown of best practices for sharing, management, documenting, formatting, storing, and ethics
Kristin Briney, Data Management for Researchers: Organize, Maintain and Share Your Data for Research Success (Exeter, UK: Pelagic Publishing, 2015).
PRDS Guide on Data Documentation
README Guide from Cornell’s Research Data Management Service
Data Paper Template from Princeton’s Center for Digital Humanities
Best Practices for Data Description from DRYAD
ICPSR Guide to Codebooks
“Managing Qualitative Data” Module on Documentation
Open Science Framework How-To for Data Dictionaries
Gebru, et al. 2021. “Datasheets for Datasets.” DOI: 10.1145/3458723.
Mitchell, et al. 2019. “Model Cards for Model Reporting.” DOI: 10.1145/3287560.3287596.

Example Datasets

browse projects featured in Journal of Open Humanities Data
- “features peer reviewed publications describing humanities data or techniques with high potential for reuse”
Our extensive, curated list, organized by field and topic, available at: https://cdh.princeton.edu/research/resources/humanities-datasets/
To Be Continued…
- developed by Katherine Bode alongside her book A World of Fiction: Digital Collections and the Future of Literary History (2018)
- identified and analyzed fiction over 21,000 novels, novellas and short stories in 19th- and early 20th-century Australian newspapers.
Data Refuge
- “a community-driven, collaborative project to preserve public climate and environmental data”
- currently building a “Storybank”, or map of data use cases and “life stories”
- includes a number of toolkits for the rescue and protection of public data
- spearheaded by UPenn’s Program in Environmental History Lab
Early African American Film
- wonderful example of thorough documentation
- networks of producers/actors/directors in early twentieth century “race film”
Collections as Data: Part to Whole
- UNLV / University of Iowa / U Penn led Mellon grant, supports a number of project applicants
- “Collections as data produced by project activity will exhibit high research value, demonstrate the capacity to serve underrepresented communities, represent a diversity of content types, languages, and descriptive practices, and arise from a range of institutional contexts.”
NYPL’s “What’s on the Menu?”
- crowdsourced project that has garnered lots of public interest
- interesting method of organically generating their data model
Black Anthology Project
- “information related to over 600 African American short stories that appeared in 100 African American and American anthologies published between 1925 and 2017.”
- tabular data on underrepresented authors and circulation histories
British Library Digital Scholarship
- Extensive resource featuring digital collections and datasets drawn from the British Library collections, including digitized printed books, datasets for image analysis, datasets about the BL collections, datasets for content mining, digital mapping, and an archive of UK web content.
- Example: CM Taylor Keylogging Data from the author C M Taylor captured between 17 October 2014 to 5 March 2018, during the writing of the novel Staying On, 2018
ToposText
- “an indexed collection of ancient texts and mapped places relevant the the history and mythology of the ancient Greeks from the Neolithic period up through the 2nd century CE”
Quill Project
- marking up “negotiated texts” written/decided by committee: constitutions, legislative proceedings, statements, etc.
- “legibility to the general public only of secondary concern” – an archive primarily for scholars
- example: https://www.quillproject.net/event_visualize/493