Dataset Preparation

Prepare your dataset by inventorying and reviewing the contents to ensure that you are providing the data in the most useful and sustainable form as possible:

Review to ensure data files are in a preferred format (ie. non proprietary)
Create an inventory of the contents of your data files (e.g., data file inventory)
Describe your naming conventions and ensure they are followed in your folder and file names
Make sure you have the cleanest data set possible ready for publication.

File Organization

File Naming Conventions
Folder Structure or Hierarchy
Version Control Strategies

See U Gent’s short (5:42) video, “Knowledge Clip: Keeping research data organized”:

Key Ethical Considerations

Review any data use agreements and examine potential impacts of sharing this data. Consider:
- Individuals and communities represented
- Representativeness of diverse human populations
- Geographic locations (e.g., contested boundaries, historical and current political situations)
Is it possible that the dataset may impact a specific group?
- Use the CARE Principles for Indigenous Data
Does this dataset comply with any known institutional policies?

Essential Tasks

Survey your Data

Gather working copy of files for formal inventory and publication
Inventory your dataset
- Identify file formats (transform to preferred format as needed)
- Record file organization, hierarchy, and naming convention(s)
- Extract zip files when possible

Tidy your Data¹

Examine data (and/or code) for obvious errors/missing components, etc.
- No conflation between missing and blank values
What is the nature of your data?
- Original/pure, and needing to be preserved?
- Crude/nascent, and not valuable until processed?
- Erroneous/tainted, and problematic if not cleaned?
What is the dataset trying to do?
- What’s the picture it is providing?
- What were the steps to get to your destination?
- What were the blockers and friction points?
Given your source materials and intentions, what does tidy look like?
- Good, working order (can people use it to investigate)
- Durable usability
- Linked to Taxonomies and/or Controlled Vocabularies (e.g. Getty Vocabularies)

Strengthen your Data

Examine files, organization, and documentation:

Are there changes that could enhance the dataset?
- Are there missing data?
- Could a user with similar qualifications to the author’s understand and reuse these data and reproduce the results?
- Are the data, documentation and/or metadata presented in a way that aids in interpretation? (e.g., readme Example)
Ensure all files are in non-proprietary file formats. Recommended formats:
- txt, xml, and html for textual data;
- csv for tabular data;
- csv and xml for databases;
- tiff, png, or jpg for images;
- mp3, wav, or flac for audio files.

Tasks based on Format

Tasks vary based on file formats and subject domain. Sample tasks based on format:

Tabular Data (e.g, Microsoft Excel) Questions

Check the organization of the data–is it well-structured?
Are headers/codes clearly defined?
Is quality control clearly defined?
Is methodology clear and sufficient?

Database(s) Questions

Is there documentation on tables, relationships, queries, etc?
Can the data be exported (to CSV(s), TXT or other) easily?
Which tables or queries (if any) are used in a publication?

Code Questions

Does the provided code execute without errors?
Is the code commented, i.e., did the author provide descriptive information on sections of code?
Is data for input missing? Are environmental conditions and parameters noted? Is it clear which language(s) and version(s) are used?
Does the code use absolute paths or relative paths? If absolute paths, is this documented in the readme?
Are packages or additional libraries used? If so, is this noted with clear use instructions?
Are data organized consistently for access by code ?
Is there an indication of whether the depositor intends reusers to be able to run the code and reproduce results, or just see the process used?

Data Visualization Questions

Review alt-text and visualization descriptions. Ensure these describe, but do not interpret, associated visualizations.
Check data visualizations follow accessible color contrast guidelines

Others

To view additional steps based on format, view the following primers created by the Data Curation Network (DCN):

Acrobat PDF Primer
ATLAS.ti Primer
Geodatabase Primer
GeoJSON Primer
Jupyter Notebook Primer
Microsoft Access Primer
Microsoft Excel Primer
netCDF Primer and Tutorial using NCAR dataset
SPSS Primer
STL Primer
R Primer
Tableau Primer
Wordpress.com Primer

Tidy Data varies depending on sources and other aspects of inquiry. Tidy Data is not meant to erase or flatten historically complex data, rather it is the process by which this data becomes more usable for investigation and understanding through research.