Tools
Data Cleaning / Editing
- Systematic interpretation
- Humanities-oriented tools
- General tools
- Airtable
- Google Sheets/Forms
- Atom
- free and open-source text and code editor
- powerful search across files
- supports regular expressions
- robust community of developers who contribute “packages” that extend Atom’s functionality, including data transforms, specialized syntax highlighting, easy GitHub integration, etc
- Qualitative analysis tools: NVivo, ATLAS.ti, MAXQDA
- Semi-automated cleaning tools
- OpenRefine
- power tool for cleaning tabular data and some XML
- supports regular expressions
- tidyr (part of the tidyverse suite of tools for R)
- Breve
- web based visual tool for seeing data errors in tabular data
- NEH-funded project under development at Stanford’s Center for Spatial and Textual Analysis
- WTFcsv
- web based visual tool for a quick snapshot of the data in a csv file
- OpenRefine
- String pattern manipulation
- Regular expressions
- RegEx 101 tool
- Programming Historian introduction to regular expressions
- stringr (also part of the tidyverse)
- Regular expressions
Databases
- Airtable: collaborative
database platform
- allows you to embed a browsable copy of your database in a webpage
- super user friendly, with tutorials that explain features like pivot tables
- free account allows for 2GB server space and revision history 2 weeks old, but further features cost $$
- Mukurtu: content management system supporting Indigenous knowledge systems and values
- grassroots platform currently used by six hundred different groups to “curate their own Web sites and regulate access in accordance with custom”
- multiple records can be generated for single digital heritage items, allowing for overlapping cultural narratives
- “There is rarely just one story, one set of information, or one way of knowing cultural heritage materials.”
- Omeka: open-source web
publishing platforms for sharing digital collections and creating
media-rich online exhibits.
- designed by the Roy Rosenzweig Center for History and New Media at George Mason University, who developed Zotero
- a go-to choice for many digital humanists and museums looking for a user-friendly, sustainable system for creating online collections/exhibits
Repositories
Discipline-specific repositories
- Tighter communities with richer standards
- May have more restrictions, and perhaps a cost
- Example: Humanities Commons CORE
Generalist repositories
- Loose communities with boilerplate standards
- Often unmediated (fast, but no quality assurance)
- Example: Zenodo, open access repository maintained by CERN
- automatically assigns DOI’s to all files
- If you publish software or data in Github, you can create a citable archived version
whenever you choose through Zenodo
- This feature used by CDH for:
- Derrida’s Margins codebase: https://doi.org/10.5281/zenodo.1453447
- PPA codebase https://doi.org/10.5281/zenodo.2400705
- possible to sign up directly through your GitHub account
- because Zenodo accepts image / video / PDF files in addition to numerical / tabular / textual data, many scholars use Zenodo as an alternative to the for-profit academia.edu when sharing copies of their articles or creating public research profiles: many metadata categories available for journal name, pages, etc. that Figshare and Dataverse don’t have
- Social collections: tag datasets with “community collections,” curated by individual Zenodo users. Example: a collection of datasets, papers, presentations and source code on Digital Historical Linguistics created by one user
- “your research output is stored safely for the future in the same cloud infrastructure as CERN’s own LHC research data.”
- 50GB per dataset limit
- Example: Dataverse, open
access repository hosted by Harvard Institute for Quantitative
Social Studies (IQSS)
- A “dataverse” is a container for all your datasets, files, and metadata.
- Tag datasets with pre-set categories, less than are available on Figshare
- Allows user to customize the look of their "Dataverse" or collection
- Allows for tiered access
- Includes some integrated data analysis tools, and a useful “data explorer” web interface that lists the variables in a tabular data file and allows users to search, chart, and conduct cross tabulation analysis
- Used by Cultural Analytics journal
- 2.5 GB per file, 10 GB per dataset limit
Institutional repositories
- Often curated, and can accept many sizes and types of data
- Restricted to affiliates, but open to all disciplines
- Example: Princeton University’s institutional data repository,
Princeton Data Commons
- assigns DOIs to datasets
- offers data curation advice and assistance on deposits, with focus on metadata and tagging for preservation and discovery, and open formats for re-use
- accepts all forms of research data (including research code)
- has community approach, with upcoming DH community
- infrastructure supported by library expertise in long-term digital preservation and archival practice
Further comparisons of repository features compiled by:
Project Management Platforms
- Asana
- online project management platform with shared to-do lists
- Trello
- team communications app in a message board format
- Slack
- group communications with topic-based channels
Tutorials
“Cleaning Data with Open Refine,” The Programming Historian
“Cleaning Data with OpenRefine for Ecologists” and “OpenRefine for Social Science Data”, Data Carpentry: Building Communities Teaching Universal Data Literacy
Checklist for Digital Humanities Projects, La Red de Humanidades Digitales (RedHD), English and Spanish versions available
Programming Historian: Preserving Your Research Data: “This lesson will suggest ways in which historians can document and structure their research data so as to ensure it remains useful in the future.”
Library Carpentry: Tidy Data for Librarians
Library Carpentry: Top 10 FAIR Data & Software Things, a list of field-specific FAIR principles/techniques
Black Living Data Booklet section on "3 Steps to Download and Decode Data" PDF
Data Literacies: DH Institutes on tidy data, CSV, stages of data analysis, etc.
NEH’s Office of Digital Humanities Guide to Data Management Plans
Methods & Best Practices
- Arts & Humanities Standards Directory from the Research Data Alliance
- Frictionless Data
- an open-source framework to reduce friction in data workflows
- multiple standards developing for data scientists and researchers
- Annotation for Transparent Inquiry (ATI)
- The CARE Principles for Indigenous Data Governance
- Complementing the FAIR Principles
- Emphasizing Collective benefit, Authority to control, Responsibility, and Ethics
- Traditional Knowledge Labels
- Complementing licenses and permissions for use
- Emphasizing relationships and engagement with Indigenous communities
- DH Curation Guide
- Asks, “How do we align the care for digital materials with the methods/goals of traditional humanities disciplines?”
- Introductory essays on different aspects of data curation in digital humanities, with links to relevant readings
- produced by NEH-funded workshops in 2014 at Maryland Institute for Technology in the Humanities and University of Illinois Center for Informatics Research in Science and Scholarship
- UCLA Library: Data Management for the Humanities
- extensive research guide
- PM4DH | Project Management for the Digital Humanities
- developed by Emory Center for Digital Scholarship
- “curriculum for managing digital projects in academic libraries and other settings”
- Data Nutrition Project
- “nutrition labels” graphically designed to resemble those on food packaging
- still in prototype stages
- “aims to create a standard for interrogating datasets for measures that will ultimately drive the creation of better, more inclusive machine learning models”
- “aims to highlight the key ingredients in a dataset such as meta-data and populations, as well as unique or anomalous features regarding distributions, missing data, and comparisons to other ‘ground truth’ datasets.”
- Digital Humanities Data Curation Guide (UMD)
- Resources from Humanities at Scale (DARIAH)
- Managing and Sharing Data: Best Practices for Researchers
[PDF]
- Created by the UK Data Archive, “the UK’s largest collection of digital research data in the social sciences and humanities.”
- produced in 2011, a slightly outdated but thorough rundown of best practices for sharing, management, documenting, formatting, storing, and ethics
- Kristin Briney, Data Management for Researchers: Organize, Maintain and Share Your Data for Research Success (Exeter, UK: Pelagic Publishing, 2015).
- PRDS Guide on Data Documentation
- README Guide from Cornell’s Research Data Management Service
- Data Paper Template from Princeton’s Center for Digital Humanities
- Best Practices for Data Description from DRYAD
- ICPSR Guide to Codebooks
- “Managing Qualitative Data” Module on Documentation
- Open Science Framework How-To for Data Dictionaries
- Gebru, et al. 2021. “Datasheets for Datasets.” DOI: 10.1145/3458723.
- Mitchell, et al. 2019. “Model Cards for Model Reporting.” DOI: 10.1145/3287560.3287596.
Example Datasets
-
browse projects featured in Journal of Open Humanities Data
- “features peer reviewed publications describing humanities data or techniques with high potential for reuse”
-
Our extensive, curated list, organized by field and topic, available at: https://cdh.princeton.edu/research/resources/humanities-datasets/
-
- developed by Katherine Bode alongside her book A World of Fiction: Digital Collections and the Future of Literary History (2018)
- identified and analyzed fiction over 21,000 novels, novellas and short stories in 19th- and early 20th-century Australian newspapers.
-
- “a community-driven, collaborative project to preserve public climate and environmental data”
- currently building a “Storybank”, or map of data use cases and “life stories”
- includes a number of toolkits for the rescue and protection of public data
- spearheaded by UPenn’s Program in Environmental History Lab
-
- wonderful example of thorough documentation
- networks of producers/actors/directors in early twentieth century “race film”
-
Collections as Data: Part to Whole
- UNLV / University of Iowa / U Penn led Mellon grant, supports a number of project applicants
- “Collections as data produced by project activity will exhibit high research value, demonstrate the capacity to serve underrepresented communities, represent a diversity of content types, languages, and descriptive practices, and arise from a range of institutional contexts.”
-
NYPL’s “What’s on the Menu?”
- crowdsourced project that has garnered lots of public interest
- interesting method of organically generating their data model
-
- “information related to over 600 African American short stories that appeared in 100 African American and American anthologies published between 1925 and 2017.”
- tabular data on underrepresented authors and circulation histories
-
British Library Digital Scholarship
- Extensive resource featuring digital collections and datasets drawn from the British Library collections, including digitized printed books, datasets for image analysis, datasets about the BL collections, datasets for content mining, digital mapping, and an archive of UK web content.
- Example: CM Taylor Keylogging Data from the author C M Taylor captured between 17 October 2014 to 5 March 2018, during the writing of the novel Staying On, 2018
-
- “an indexed collection of ancient texts and mapped places relevant the the history and mythology of the ancient Greeks from the Neolithic period up through the 2nd century CE”
-
- marking up “negotiated texts” written/decided by committee: constitutions, legislative proceedings, statements, etc.
- “legibility to the general public only of secondary concern” – an archive primarily for scholars
- example: https://www.quillproject.net/event_visualize/493
Further Readings
Data & Method
Tanya E. Clement, “Where Is Methodology in Digital Humanities?”, Debates in the Digital Humanities 2016
Ryan Cordell, “Teaching Humanistic Data Analysis” (2019)
Luke Stark and Anna Lauren Hoffmann, “Data Is the New What? Popular Metaphors & Professional Ethics in Emerging Data Culture,” Cultural Analytics (2019)
Daniel Rosenberg, “Data Before the Fact,” in “Raw Data” Is an Oxymoron, ed. Lisa Gitelman (MIT Press, 2013)
Johanna Drucker, “HTML and Structured Data” (2013)
Michael Hancher, “Re: Search and Close Reading,” Debates in the Digital Humanities 2016
Ricardo L. Punzalan, Diana E. Marsh, Kyla Cools, “Beyond Clicks, Likes, and Downloads: Identifying Meaningful Impacts for Digitized Ethnographic Archives,” Archivaria 84 (Fall 2017)
Klein, Lauren F. “The Image of Absence: Archival Silence, Data Visualization, and James Hemings.” American Literature 1 December 2013; 85 (4): 661–688. https://doi.org/10.1215/00029831-2367310.
Data Cleaning
Katie Rawson & Trevor Muñoz, “Against Cleaning” (2016)
Mia Ridge, “Mia Ridge explores the shape of Cooper-Hewitt collections”, Cooper-Hewitt Labs (2012)
Lauren F. Klein, “The Image of Absence: Archival Silence, Data Visualization, and James Hemings,” American Literature 85, no. 4 (2013)
Garfinkel, Simson L. “De-Identification of Personal Information.” National Institute of Standards and Technology NISTIR 8053, October 2015. http://dx.doi.org/10.6028/NIST.IR.8053.
Lincoln, Matthew D. “Tidy Data for the Humanities.” Matthew Lincoln, PhD (blog), 26 May 2020, https://matthewlincoln.net/2020/05/26/tidy-data-for-humanities.html. (Accessed January 31, 2022.)
Sperberg-McQueen, C.M. and David Dubin. “Data Representation.” DH Curation Guide, (no date) https://archive.mith.umd.edu/dhcuration-guide/guide.dhcuration.org/index.html%3Fp=63.html (Accessed January 31, 2022.)
Wickham, Hadley. “Tidy Data.” Journal of Statistical Software 2014; 49 (10): 1-23. https://doi.org/10.18637/jss.v059.i10.
Babau, Alison. “Classics, ‘Digital Classics’ and Issues for Data Curation.” DH Curation Guide, (no date) https://archive.mith.umd.edu/dhcuration-guide/humanitiesdatacurationguide.wordpress.com/contents/classics-digital-classics-and-issues-for-data-curation/index.html (Accessed February 14, 2022)
Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. “Datasheets for Datasets.” Communications of the ACM 64(12): pp. 86-92, 2021. https://doi.org/10.1145/3458723
Levine, Melissa. “Policy, Practice, and Law.” DH Curation Guide, (no date) https://archive.mith.umd.edu/dhcuration-guide/guide.dhcuration.org/legal/policy.html (Accessed February 14, 2022)
Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. “Model Cards for Model Reporting.” FAT* ’19: Proceedings of the Conference on Fairness, Accountability, and Transparency: pp. 220-229, 2019. https://doi.org/10.1145/3287560.3287596
Van den Eynden, Veerle, Louise Corti, Matthew Woollard, Libby Bishop, and Laurence Horton. “Managing and Sharing Data: Best Practices for Researchers.” 3rd Ed. Essex: UK Data Archive, 2011. https://dam.ukdataservice.ac.uk/media/622417/managingsharing.pdf (Accessed February 28, 2022)