Skip to content

Module 5: Data Preservation and Archiving

This page is currently under construction

The training curriculum is currently undergoing final revisions and quality checks. All materials will be released shortly. Until the official release, please refrain from using, distributing, or implementing any part of these resources.

Learning Objectives

  • Learning Objective 1 (LO1): Evaluate data in order to set up its preservation.
  • Learning Objective 2 (LO2): Provide practical advice on the technical aspects of data preservation and data archiving.
  • Learning Objective 3 (LO3): Identify information about local and national data infrastructures.

Total Module Duration

Approx. 3 hours

Learning Objective 1

LO1: Evaluate data in order to set up its preservation.

Learning Activities

  • Lecture (15 mins): Lecture on definitions, stakes, challenges, criteria for data preservation.
  • Group Work (30 mins): In small groups, provide learners with fictional datasets and their documentation of different types (simulation data, experimental data, survey data). Each group must classify these data into three categories: to preserve indefinitely, to evaluate periodically, or no need for long-term preservation/suppression. The group work can be followed by a reflection exercise where each group shares its decisions and justifications with all participants, followed by a discussion to compare the choices made by different groups.

Materials to Prepare

  • Slides presentation on long-term preservation and archiving, challenges etc.
  • A set of fictional, or real, and varied datasets accompanied by their documentation from different disciplines (depending on the target audience) and with differing preservation criteria.

Instructor Notes

Lecture:

  • The instructor can give a definition and stakes of long-term preservation and archiving as well as an overview of the challenges related to research data preservation (data selection, obsolete formats, obsolescence of hardware or software, lack of documentation).
  • Address the main selection criteria for data preservation (scientific value, legal requirements, preservation costs) along with other factors that may impact data selection (data volume, environmental impact, data quality, reusability).
  • The instructor can identify and apply data repository standards to datasets. Help others by selecting an appropriate data repository and classify data according to its content (such as sensitive or commercial amongst others). The learners can weigh the costs of archiving against its benefits and identify the parts of the data that need to be preserved to be sustainable.

Group work:

  • The trainer is encouraged to prepare a variety of datasets, which will encourage the learners to ask themselves the right questions. For example, choose a dataset that objectively lacks documentation. This encourages the learners to think about the relevance of recommending a dataset when little or nothing is known about how it was produced.
  • The instructor can also use non-fictional datasets that are licensed for re-use for educational purposes. Data can be found in data repositories (such as Zenodo, or thematic repositories depending on the target audience for the course, or OpenAIRE Explore which is the European platform for accessing publications and datasets). Resources 1 and 2 provide some links.
  • For the activity, the trainer can use these criteria to help the learners classify the data:
    • Preserve indefinitely: Data with indisputable scientific, legal, or operational value (crucial for future research, irreplaceable or very costly to reproduce, or legally required to be preserved).
    • Data that needs to be evaluated periodically to assure its preservation: Data with significant value at the time of creation but with uncertain long-term importance, whose relevance depends on the evolution of research, the cost of preservation, and so on.
    • No need for long-term preservation/suppression: Data with no meaningful preservation value that are perishable or subject to eventual destruction, such as personal data with legal destruction requirements.

Resources

Inspiration for instructor:

  1. Google Dataset Search. https://datasetsearch.research.google.com/. Accessed 10 Apr. 2025.
  2. "OpenAIRE | Search for Research Products." OpenAIRE - Explore. https://explore.openaire.eu/search/find/research-outcomes?type=datasets. Accessed 10 Apr. 2025.

Learning Objective 2

LO2: Provide practical advice on the technical aspects of data preservation and data archiving.

Learning Activities

  • Group work (30 mins): Learners are given several cards representing different elements to take into account in the creation of a data preservation workflow. The learners look at their card (for example, "formatting data in a durable format") and then they have to construct a short argument about this element. They should try to explain why they think this particular element is important to consider when setting up a workflow for data management.
  • Peer Review process (15 mins): The learners are encouraged to take notes while the person presents the element on their card. The learners then discuss the different presentations that have been made, adding to them and commenting on them.
  • Conclusion lecture (25 mins): The trainer goes over all the content of the cards and gives keys takeaways on each of the elements that constitute a workflow for preserving research data.

Materials to Prepare

  • Cards preparation: The instructor prepares the cards for this activity before the session.
  • Lecture on steps to take into account during data preservation.

Instructor Notes

Group work:

  • To facilitate the group work, the trainer can prepare a set of prompt cards. Keep in mind that each card should represent a clear distinct element or concept in the research data preservation process, and keep the description concise so they are easy to understand.
  • A few examples of cards include:
    • Anticipate data preservation in the DMP
    • Data value assessment
    • Consider preservation constraints/risks assessment
    • Convert file into durable formats, generating long term metadata
    • Control the quality and readability of converted files, define access restriction
    • Plan periodic review to ensure data integrity
    • Have an expert validate the data preservation plan.

Lecture:

  • The trainer should emphasise how data preservation steps may vary depending on data type, legal requirements, reinforcing the flexibility and adaptability required in data stewardship.
  • The trainer may recommend reading additional resources, such as case studies (Resource 3).
  • Other areas to cover include:
    • practical advice on the technical aspects of data preservation and data archiving,
    • best practices for organising and structuring data before archiving (file naming and arborescence),
    • file formats for long-term preservation,
    • risks while migrating data,
    • risk assessment/audit and certification,
    • generating long term metadata (PREMIS, METS),
    • access and security control,
    • benefits of Data Management Plans (DMPs), and
    • deposit in a trustworthy repository or Electronic archiving system.

Resources

Input for lecture about long term metadata:

  1. "Long Term Preservation Concept - CINES." https://www.cines.fr/preservation/what-is-digital-archiving/long-term-preservation-concept/. Accessed 15 Apr. 2025.
  2. "Long Term Metadata - CINES." https://www.cines.fr/preservation/what-is-digital-archiving/long-term-metadata/. Accessed 15 Apr. 2025.

Formats for long term preservation:

  1. Course: Essentials 4 Data Support (English) - Public | DANS. https://danstraining.moodlecloud.com/course/view.php?id=11.

Case study: the University of Glasgow's digital preservation journey:

  1. Spence, Alison, et al. «Case Study: The University of Glasgow's Digital Preservation Journey 2017-2019». Insights: The UKSG Journal, vol. 32, no 1. Mar. 2019. doaj.org. https://doi.org/10.1629/uksg.461.

What are the risks of digital preservation and archiving:

  1. Risks - Digital Preservation Coalition. https://www.dpconline.org/digipres/implement-digipres/dpeg-home/dpeg-risks.

Learning Objective 3

LO3: Identify information about local and national data infrastructures.

Learning Activities

  • Lecture (20 mins): Presentation by the trainer of one or more infrastructures and organisations in their country and/or internationally (show what the infrastructure/services offer for data preservation). Presentation of certification bodies such as CoreTrustSeal and discussion of certification criteria for warehouses with the learners.
  • Research and reflective Activity (30 mins): Learners select a data repository from their country or a repository recognised in their discipline. Using information from the presentation and their own knowledge (internet research, use of tools such as fair sharing), they prepare a short presentation on the chosen repository, addressing two questions:
    1. What elements make this repository trustworthy?
    2. Why might a data steward recommend this repository to a scientist?
  • Presentation and Debate (20 mins): Volunteers can present the results of their synthesis, while other learners are encouraged to highlight key elements for effective data preservation, as well as any potential "missing" elements.

Materials to Prepare

  • Slide presentation.

Instructor Notes

Lecture:

  • The instructor can introduce national and international infrastructures dedicated to data archiving (CINES, BNF, HumaNum and Public Archiving structures in France, Dans in the Netherlands, Library of Congress in the USA, CESSDA, etc.).
  • Additionally there can be information on certification bodies (CoreTrustSeal, nestor Seal for Trustworthy Digital Archives, etc.) as well discussing repositories and archival system (using services such as re3data and FAIRsharing to select trustworthy repositories).

Research and reflective activity:

  • In the case where the repository chosen by the learners holds the CoreTrustSeal certification, the instructor can encourage learners to review the repository's certification report on the CoreTrustSeal website.

Resources

Overall:

  1. Best Practice: Identify suitable repositories for the data. https://dataoneorg.github.io/Education/bestpractices/identify-suitable-repositories.
  2. Lin, D., Crabtree, J., Dillo, I. et al. The TRUST Principles for digital repositories. Sci Data 7, 144 (2020). https://doi.org/10.1038/s41597-020-0486-7.
  3. CoreTrustSeal Standards and Certification Board. (2022). CoreTrustSeal Requirements 2023-2025 (V01.00). Zenodo. https://doi.org/10.5281/zenodo.7051012.
  4. Witt, M., Cannon, M., Lister, A., Segundo, W., Shearer, K., Yamaji, K., & Research Data Alliance Data Repository Attributes Working Group. (2024). RDA Common Descriptive Attributes of Research Data Repositories (1.0). Zenodo. https://doi.org/10.15497/RDA00103.