Module 4: Data Curation

This page is currently under construction

The training curriculum is currently undergoing final revisions and quality checks. All materials will be released shortly. Until the official release, please refrain from using, distributing, or implementing any part of these resources.

Learning Objectives

Learning Objective 1 (LO1): Recognise the importance of FAIR data, "CARE-ful" data treatment, documentation, and organisation from a data curation perspective.
Learning Objective 2 (LO2): Apply CoreTrustSeal's requirements and CURATED checklist for datasets to certified repositories.
Learning Objective 3 (LO3): Summarise the means to and benefits of organising and cleaning data.
Learning Objective 4 (LO4): Recognise the difference between anonymisation and pseudo-anonymisation and the legal and ethical differences between the two.

Total Module Duration

Approx. 5 hours 30 minutes

Learning Objective 1

LO1: Recognise the importance of FAIR data, "CARE-ful" data treatment, documentation, and organisation from a data curation perspective.

Learning Activities

Lecture (45 mins): Lecture on FAIR and CARE principles in data curation (Resources 1, 2).
Discussion (45 mins): Debate about data sovereignty, CARE, and data curation.

Materials to Prepare

Slides for lecture, review materials from the other modules within RDM, particularly FAIR data.

Instructor Notes

General Remark:

The instructor can view other modules from the section of Research Data Management such as modules on FAIR data, Data Documentation and Storage, Data Sharing and Publishing.

Lecture and Discussion:

The instructor can reiterate some of the key points from the FAIR data module if required: FAIR is not an "is" or "is not" concept but rather a spectrum – flexibility is key, and should work on a case by case basis.
Introduce the CARE principles through a case study (Resource 3).
The instructor can facilitate discussion (Resources 1–3) around the following questions: how to ensure the CARE principles in data that is to be published or shared? How do these principles impact researchers in different countries? Different research domains? The idea here being researchers doing research on or in post-colonial countries will be impacted by these ethical considerations.

Resources

Carroll, Stephanie Russo, et al. "Operationalizing the CARE and FAIR Principles for Indigenous Data Futures." Scientific Data, vol. 8, no. 1, Apr. 2021, p. 108. DOI.org (Crossref), https://doi.org/10.1038/s41597-021-00892-0.
"CARE Principles for Indigenous Data Governance". Global Indigenous Data Alliance, 23. Jan. 2023. https://www.gida-global.org/care.
Carroll, Stephanie Russo, et al. "The CARE Principles for Indigenous Data Governance." Data Science Journal, vol. 19, Nov. 2020, p. 43. DOI.org (Crossref), https://doi.org/10.5334/dsj-2020-043.

Learning Objective 2

LO2: Apply CoreTrustSeal's requirements and CURATED checklist for datasets to certified repositories.

Learning Activities

Lecture (20 mins): The instructor can introduce CoreTrustSeal as a criteria for repositories and its advantages. The learners can be introduced to cataloguing data, CURATE(D) checklist prior to data.
Exercise (45 mins): The goal of the exercise is to apply the CoreTrustSeal requirements as a checklist for individual datasets. Query the NASA dataset repository: Ask participants to analyse which curatorial decisions NASA has made to curate their studies on the planet Venus. The exercise can be done in pairs or small groups so that there is discussion during the activity.

Materials to Prepare

Slides for lecture on the CoreTrustSeal.
A general familiarity with NASA website and curation of a dataset. A general familiarity with the requirements of CoreTrustSeal data repositories as well as the CURATE(D) checklist for curating data prior to publication.

Instructor Notes

Lecture:

The instructor can link to materials in the ontologies and meta data modules for the lecture.
The learners can be introduced to CoreTrustSeal with the instructor going through the CoreTrustSeal's requirements, and how to prepare a dataset for a repository with the CoreTrustSeal? How do we administer a local repository to meet these requirements?
Discuss the advantages of obtaining the CoreTrustSeal for organisations and data stewards, such as increased credibility, improved data sharing and collaboration opportunities, and enhanced user trust. Highlight case studies or examples of organisations that have successfully achieved certification and the positive impacts it had on their data management practices (Resource 7).
Explain the steps involved in the assessment and certification process for obtaining the CoreTrustSeal. Discuss the self-assessment tools, documentation requirements, and the role of external audits. Provide insights into how data stewards can prepare their repositories for evaluation.
Go through the CURATE(D) checklist; Check, Understand, Request, Augment, Transform, Evaluate, Document.
This can either be done by way of a theoretical discussion of each step in which a data steward would review a dataset, or practically, by means of a sample dataset, that you go through together.

Exercise:

For the learning activity, search for and review NASA's curatorial choices, and how they curate their data. How would you go through this dataset as a data steward in such a way that you are checking the requirements of CoreTrustSeal repositories and the CURATE(D) checklist when curating datasets. A way to get the conversation going could be to look at CoreTrustSeal's guidance with respect digital object management (p. 13 of the 2023–2025 requirements document). Does NASA note any changes to the data and metadata (versioning)? How does the repository handle providence? With respect to preservation, does NASA include documentation as to how these files will be preserved long term? Some items of the checklist are of course internal workflows at NASA that are not completely transparent, but try to pry out elements of CoreTrustSeal and CURATE(D) that can be accessed through what is openly available.

Resources

Overall:

CoreTrustSeal Standards And Certification Board. CoreTrustSeal Requirements 2023-2025. Zenodo, 5 Sept. 2022. DOI.org (Datacite), https://doi.org/10.5281/ZENODO.7051012.
CoreTrustSeal Standards And Certification Board. CoreTrustSeal Trustworthy Data Repositories Requirements: Glossary 2023-2025. Zenodo, 5 Sept. 2022. DOI.org (Datacite), https://doi.org/10.5281/ZENODO.7051125.
CoreTrustSeal Standards And Certification Board. CoreTrustSeal Trustworthy Digital Repositories Requirements 2023-2025 Extended Guidance. Zenodo, 5 Sept. 2022. DOI.org (Datacite), https://doi.org/10.5281/ZENODO.7051096.
NASA dataset query. https://nssdc.gsfc.nasa.gov/nmc/DatasetQuery.jsp.
"Data-Primers/Curated.Md at Main -- DataCurationNetwork/Data-Primers". GitHub. https://github.com/DataCurationNetwork/data-primers/blob/main/curated.md. Accessed 28. Mar. 2025.
Data Curation CURATE(D) checklist: https://github.com/DataCurationNetwork/data-primers/blob/main/curated.md#check-step.
CoreTrustSeal-AMT. https://amt.coretrustseal.org/certificates.

Learning Objective 3

LO3: Summarise the means to and benefits of organising and cleaning data.

Learning Activities

Lecture (60 mins): Introduce the concept of tidy data and the difficulties presented by untidy data.

Materials to Prepare

Slides for lecture on tidy data (defining tidy data and realising the F and A in FAIR).
Optional Activity.

Instructor Notes

Lecture:

Findability and Accessibility can sometimes be impaired by sloppy data practices. This can have knock on effects to its reusability.
Introduce the concept of tidy data as defined by Hadley Wickham, which states that each variable should have its own column, each observation should have its own row, and each type of observational unit should have its own table. Discuss the importance of tidy data for data analysis, visualisation, and reproducibility
Summarise common difficulties encountered by untidy, incomplete, or other issues in data. For instance, missing entries can create issues in coding. Thus, part of curating a dataset may be filling missing entries with dummy values. Using uniform formats enable sorting, filtering and further analysis. Deleting doubles
Highlight the importance of documenting changes to datasets. A lack of documentation can impact the dataset's use for reproducibility tests and lower overall transparency. This can be done by documented versioning as well as other forms of logging within data repositories.
Point towards further elements of automation that are enabled by way of coding, such as pulling data and checking it.

Optional Activity

The activity can focus on the following aspects: Automating boring tasks with Python and Pandas: Loops and Conditionals, Pandas and DataFrames, Aggregating data, Dealing with empty values and incomplete data, Type conversion, Unique and standard values, best practices and documentation.

The instructor can use Python as an example of a tool to use for data cleaning. Recommendation of how one can go on to work with this concretely:

Go through the basics of Python automation; loops and conditionals.
Go through the Pandas library for Python, covering crucial functions for tidy data such as pd.melt(), pd.pivot_table(), pd.concat(), dropna(), fillna(), astype(), and drop_duplicates().
Makel clear that this is not only work that the data steward may be called on to do, but might also be something that they teach.

Resources

Lecture:

Wickham, Hadley. "Tidy Data". Journal of Statistical Software, bd. 59, nr. 10, 2014. DOI.org (Crossref). https://doi.org/10.18637/jss.v059.i10.Recommendations.
For engaging directly with cleaning data: Tidy data: https://vita.had.co.nz/papers/tidy-data.pdf.
Copenhagen University Library Datalab intros to Python Pandas. https://kubdatalab.github.io/python/docs/intro.html.
Melanie Walsh's introduction to Cultural Analytics and Python. https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html.
Library Carpentries: Tidy data with Pandas. https://librarycarpentry.github.io/lc-python-intro/tidy.html.

Learning Objective 4

LO4: Recognise the difference between anonymisation and pseudo-anonymisation and the legal and ethical differences between the two.

Learning Activities

Lecture (45 mins): The instructor can introduce anonymisation and pseudonymisation.
Fictitious dataset activity (75 mins): Look at an exemplary dataset and evaluate it in class for privacy concerns, work with the anonymisation tool.

Materials to Prepare

Slides for lecture on definition and differentiation between anonymisation and pseudonymisation.
Dataset activity: the use of UKDS tool (Resource 1).

Instructor Notes

Lecture:

Define and differentiate between anonymisation and pseudonymisation. Explain that anonymisation irreversibly removes identifiable information, making it impossible to trace back to an individual, while pseudonymisation replaces identifiable information with pseudonyms or tokens, allowing for potential re-identification under controlled circumstances. Data anonymisation can take place for text data, audio data etc. and anonymisation tools can be used (for example UKDS anonymisation tool). Discuss the implications of each method for data privacy and compliance with GDPR.
Discuss the legal frameworks and ethical considerations surrounding anonymisation and pseudonymisation. Highlight the importance of understanding data protection regulations, such as GDPR and HIPAA, which govern the use of personal data. Emphasise the need for data stewards to ensure compliance while balancing data utility and privacy.
Introduce various techniques for anonymisation (such as data masking, aggregation, and noise addition) and pseudonymisation (such as hashing, tokenisation). Provide examples of how these techniques can be applied to different types of data, such as structured databases and unstructured text. Discuss the trade-offs between data utility and privacy.
Best practices; data audits to identify sensitive information; applying anonymisation and pseudonymisation techniques in data collection, reviewing and updating methods to comply with regulations.

Activity:

Find a dataset that is discipline relevant that includes some form of personal data. You can also use the dataset on support for childhood vaccination in Resource 8.
Instruct the learners to apply the anonymisation tool. Does it work? What concerns do the learners have, given the content of the lecture, of applying this kind of tool? What ethical issues might you be concerned with prior to sharing this data openly?

Resources

Input xxx:

Service, UK Data. "Anonymising Qualitative Data". UK Data Service. https://ukdataservice.ac.uk/learning-hub/research-data-management/anonymisation/anonymising-qualitative-data/.
Data with personal information in DORIS | Swedish National Data Service. https://snd.se/en/doris-researchers/describe-and-share-data-doris/data-personal-information-doris.
AEPD-EDPS Joint Paper on 10 Misunderstandings Related to Anonymisation | European Data Protection Supervisor. https://www.edps.europa.eu/data-protection/our-work/publications/papers/aepd-edps-joint-paper-10-misunderstandings-related_en.
ARX -- Data Anonymization Tool -- A Comprehensive Software for Privacy-Preserving Microdata Publishing. https://arx.deidentifier.org/.
Welcome to Faker's documentation! --- Faker 37.1.0 documentation. https://faker.readthedocs.io/en/master/.
"Hashlib --- Secure Hashes and Message Digests". Python Documentation. https://docs.python.org/3/library/hashlib.html.
Dimakopoulos, Manolis Terrovitis, Dimitris Tsitsigkos and Nikolaos. Amnesia Anonymization Tool - Data anonymization made easy. https://amnesia.openaire.eu/.
Chiavenna, Chiara. Replication Data for: Personal Risk or Societal Benefit? Investigating Adults' Support for COVID-19 Childhood Vaccination. text/tab-separated-values,application/x-stata-syntax,application/x-stata-syntax,text/tab-separated-values,text/tab-separated-values, Harvard Dataverse, 2022. DOI.org (Datacite), https://doi.org/10.7910/DVN/Y3WAJL.