1 of 2

Model Data Archiving Guidelines

README

These guidelines were informed by input from the U.S. Department of Energy's (DOE) Environmental System Science (ESS) land modeling community and are associated with the manuscript by Simmonds et al. (In Revision). The terrestrial modeling community has unique challenges related to data archiving because the models and simulations they use span scientific domains and address a diversity of research questions. By working with DOE terrestrial modelers we created these guidelines to help users determine which components of model simulations to archive, how to bundle files for data publication, and we discuss data repository tools that can facilitate future archiving of model data. We envision the recommendations being applied to other types of models beyond the terrestrial modeling community. Following the guidelines will help modelers have a clear understanding of what components of their model to archive and also enable model data reuse and integration.

These guidelines are the culmination of the aforementioned efforts, they will evolve over time based on ongoing community engagement and feedback received on the material in this GitHub repository.

Getting started

To begin using the model data archiving guidelines, visit the instructions page. There you will find the step-by-step procedure for making decisions about which files to archive.

Updates in v1.1.0

Updates to v1.1.0 as of November 18th 2021 include:

Removing figure of model data archiving guidelines and replacing the figure with a text-based description of the guidelines based on feedback.
Updating citation for our In Revision manuscript and adding two co-authors.
Updating figure of file-level metadata guidance with new recommendations from v1.0.0 of the file-level metadata reporting format.

How to contribute

If you would like to suggest a change to the model data archiving guidelines, please submit a GitHub issue using one of our issue templates.

If you would prefer to submit feedback over email, or for any other inquiries contact us at ess-dive-support [at] lbl.gov.

Usage license

The content in this repository is free to use under the CC BY 4.0 license, and we ask that you cite the paper below to attribute credit.

How to cite these guidelines

Simmonds, M.B., Riley, W.J., Agarwal, D.A., Chen, X., Cholia, S., Crystal-Ornelas, R., Coon, E.T., Dwivedi, D., Hendrix, V.C., Huang, M., Jan, A., Kakalia, Z., Kumar, J., Koven, C.D., Li, L., Melara, M., Ramakrishnan, L., Ricciuto, D.M., Walker, A.P., Zhi, W., Zhu, Q. and Varadharajan, C., 2022. Guidelines for Publicly Archiving Terrestrial Model Data to Enhance Usability, Intercomparison, and Synthesis. Data Science Journal, 21(1), p.3. DOI: http://doi.org/10.5334/dsj-2022-003
Simmonds M.B., Crystal-Ornelas, R., Riley, W.J., Agarwal, D.A., Chen, X., Cholia, S., Coon, E.T., Dwivedi, D., Hendrix, V.C., Huang, M., Jan, A., Kakalia, Z., Kumar, J., Li, L., Melara, M., Ramakrishnan, L., Ricciuto, D.M., Walker, A.P., Zhi, W., Zhu, Q., Varadharajan, C. (2021). ESS-DIVE guidelines for archiving terrestrial model data. Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE), ESS-DIVE repository. https://doi.org/10.15485/1813868

Funding and acknowledgements

ESS-DIVE is funded by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, Earth and Environmental Sciences Division, Data Management program under contract number DE-AC02-05CH11231. ESS-DIVE uses resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility operated under Contract No. DE-AC02-05CH11231. ORNL is managed by UT-Battelle, LLC, for the DOE under contract DE-AC05-1008 00OR22725. We thank two anonymous reviewers whose provided feedback on our associated manuscript, and William Collins (LBNL) for his thoughtful insights on model data archiving.

Simmonds, M B, W J Riley, M Melara, S Cholia, C Varadharajan (2020, Dec 8). Addressing Model Data Archiving Needs for the Department of Energy’s Environmental Systems Science Community, AGU Fall Meeting, Poster

Instructions

We compiled this set of model data archiving guidelines from a review of existing model archiving practices and also input from land modelers. Based on the information we collected, we provide a set of guidelines that can help modelers decide how to organize and archive data from their land model simulations.

We have organized the instructions into three sections.

Model data archiving guidelines: Guidelines for organizing model data files
Deciding how to bundle files: A decision tree to help users decide how to group files for archive
File-level metadata: A deeper look at one component of the model data guidelines

1. Model data archiving guidelines

Metadata – This refers to pertinent information about data and code archived (e.g., abstract, geographical and temporal extents), as well as description of the files being archived with links to other DOI-issued publications within the entire simulation workflow, as applicable.
Required Data Files – Archived datasets should specify or include model inputs, outputs, code, and scripts depending on whether the data are published elsewhere or exceed repository dataset size limits. File names should be unique and can use an intuitive file naming nomenclature to help with discoverability. File names should only contain letters, numbers, hyphens, and underscores, should not contain spaces, and should not rely on case-sensitive file systems.
1. Model Inputs – Input files should be included unless publicly available elsewhere, in which case a hyperlink to the specific input files (e.g., climate forcings, meshes, soil parameterizations) should be provided in the metadata and user guide. Use open-sourced formats such as comma separated value (.csv) or NetCDF (.nc) formats where possible.
2. Model Outputs- Archive all model outputs if the size of the data files are within the repository storage limitations. This output should include the raw and post-processed data, and if associated with a scientific publication the data that support the main findings, tables, and figures. If the size of the model output exceeds repository storage limitations, evaluate recommendations based on the decision tree (Figure 2 in associated manuscript and pictured below) on which data to publish. Use open-sourced formats such as comma separated value (.csv) or NetCDF (.nc) formats where possible.
3. Model Code – Include source code(s) used to generate results in paper unless the code is publicly available elsewhere (e.g., GitHub or Zenodo), in which case include specific version, hash information, or citation allowing the exact source code to be recovered. Include links to any external model codes in the metadata and user guide. If published on GitHub, provide the commit hash associated with the specific version. If available, include a reference (with DOI) to the tagged release in an established data repository.
4. Scripts – Include run scripts if they are necessary for running the model to generate published results. Optionally also include scripts necessary for reproducing the parameters and model configuration for the simulations and input files, for post-processing model outputs to produce the results (e.g., tables and figures in a publication), and for executing the entire workflow used to generate the model results.
Optional Files –
1. *File-level metadata (FLMD) – Include descriptions of all the data files as one file catalog (e.g., by using the file-level metadata reporting format. Optionally also include one data dictionary for each file type within the data publication describing columns and variables.
2. Model Testing Data - Include data files of observations from each location simulated to produce the results in the paper in an open source format (e.g., CSV). If the data are publicly available in another repository, include a reference (with DOI) in the metadata and user guide.
3. Documentation or user guide - Include a readme file (e.g., pdf) for each site-specific or large-scale simulation and provide details on the model name and version number, and required data or code dependencies. Also include a citation for the model code and licensing information if applicable.
Use in publications - If publishing model results, cite and include links to the data and code publication(s) in the Data or Code Availability section. Include the citations of the dataset and code publication(s) with DOI(s) in the references section. Examples of Data or Code Availability statements associated with the journal articles researched in this study are provided in Supplemental Table 1 of the associated manuscript.

2. Deciding how to bundle files

The decision tree below provides suggestions for which files to archive, and when to submit data to a public archive using the following considerations:

Repository storage limitations
Authorship
Downstream value

3. File-Level Metadata

We suggest researchers archiving model data include 2 types of file-level metadata. For more details see ESS-DIVE File-level Metadata. Note that one possible option is to use ncdump to create a metadata CSV file for NetCDF and HDF5 files.

¹ For more details on how to provide file-level metadata see here ² Details for the CSV reporting format ³ Report the Local Standard Time offset (+/- #hours) or time zone (abbreviations allowed). Do not report time using Daylight Savings Time ⁴ yyyy-mm-dd ⁵ If providing a non-point location (WGS84 decimal degrees) ⁶ If providing a single point location (WGS84 decimal degrees) ⁷ For columns containing numeric data, use "-9999" as the missing value code (or modify to match significant figures given the data). For columns containing character data, use "N/A" as the missing value code. ⁸ Orientation of the "Field Name" within the data matrix of the data file: 1) Horizontal with field names at the top of columns (i.e., column name) or 2) Vertical with field names starting rows (i.e., row name).

README

These guidelines are the culmination of the aforementioned efforts, they will evolve over time based on ongoing community engagement and feedback received on the material in this GitHub repository.

Getting started

To begin using the model data archiving guidelines, visit the instructions page. There you will find the step-by-step procedure for making decisions about which files to archive.

Updates in v1.1.0

Updates to v1.1.0 as of November 18th 2021 include:

Removing figure of model data archiving guidelines and replacing the figure with a text-based description of the guidelines based on feedback.
Updating citation for our In Revision manuscript and adding two co-authors.
Updating figure of file-level metadata guidance with new recommendations from v1.0.0 of the file-level metadata reporting format.

How to contribute

If you would like to suggest a change to the model data archiving guidelines, please submit a GitHub issue using one of our issue templates.

If you would prefer to submit feedback over email, or for any other inquiries contact us at ess-dive-support [at] lbl.gov.

Usage license

The content in this repository is free to use under the CC BY 4.0 license, and we ask that you cite the paper below to attribute credit.

How to cite these guidelines

Simmonds, M.B., Riley, W.J., Agarwal, D.A., Chen, X., Cholia, S., Crystal-Ornelas, R., Coon, E.T., Dwivedi, D., Hendrix, V.C., Huang, M., Jan, A., Kakalia, Z., Kumar, J., Koven, C.D., Li, L., Melara, M., Ramakrishnan, L., Ricciuto, D.M., Walker, A.P., Zhi, W., Zhu, Q. and Varadharajan, C., 2022. Guidelines for Publicly Archiving Terrestrial Model Data to Enhance Usability, Intercomparison, and Synthesis. Data Science Journal, 21(1), p.3. DOI: http://doi.org/10.5334/dsj-2022-003
Simmonds M.B., Crystal-Ornelas, R., Riley, W.J., Agarwal, D.A., Chen, X., Cholia, S., Coon, E.T., Dwivedi, D., Hendrix, V.C., Huang, M., Jan, A., Kakalia, Z., Kumar, J., Li, L., Melara, M., Ramakrishnan, L., Ricciuto, D.M., Walker, A.P., Zhi, W., Zhu, Q., Varadharajan, C. (2021). ESS-DIVE guidelines for archiving terrestrial model data. Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE), ESS-DIVE repository. https://doi.org/10.15485/1813868

Funding and acknowledgements

Instructions

We have organized the instructions into three sections.

Model data archiving guidelines: Guidelines for organizing model data files
Deciding how to bundle files: A decision tree to help users decide how to group files for archive
File-level metadata: A deeper look at one component of the model data guidelines

1. Model data archiving guidelines

Metadata – This refers to pertinent information about data and code archived (e.g., abstract, geographical and temporal extents), as well as description of the files being archived with links to other DOI-issued publications within the entire simulation workflow, as applicable.
Required Data Files – Archived datasets should specify or include model inputs, outputs, code, and scripts depending on whether the data are published elsewhere or exceed repository dataset size limits. File names should be unique and can use an intuitive file naming nomenclature to help with discoverability. File names should only contain letters, numbers, hyphens, and underscores, should not contain spaces, and should not rely on case-sensitive file systems.
1. Model Inputs – Input files should be included unless publicly available elsewhere, in which case a hyperlink to the specific input files (e.g., climate forcings, meshes, soil parameterizations) should be provided in the metadata and user guide. Use open-sourced formats such as comma separated value (.csv) or NetCDF (.nc) formats where possible.
2. Model Outputs- Archive all model outputs if the size of the data files are within the repository storage limitations. This output should include the raw and post-processed data, and if associated with a scientific publication the data that support the main findings, tables, and figures. If the size of the model output exceeds repository storage limitations, evaluate recommendations based on the decision tree (Figure 2 in associated manuscript and pictured below) on which data to publish. Use open-sourced formats such as comma separated value (.csv) or NetCDF (.nc) formats where possible.
3. Model Code – Include source code(s) used to generate results in paper unless the code is publicly available elsewhere (e.g., GitHub or Zenodo), in which case include specific version, hash information, or citation allowing the exact source code to be recovered. Include links to any external model codes in the metadata and user guide. If published on GitHub, provide the commit hash associated with the specific version. If available, include a reference (with DOI) to the tagged release in an established data repository.
4. Scripts – Include run scripts if they are necessary for running the model to generate published results. Optionally also include scripts necessary for reproducing the parameters and model configuration for the simulations and input files, for post-processing model outputs to produce the results (e.g., tables and figures in a publication), and for executing the entire workflow used to generate the model results.
Optional Files –
1. *File-level metadata (FLMD) – Include descriptions of all the data files as one file catalog (e.g., by using the file-level metadata reporting format. Optionally also include one data dictionary for each file type within the data publication describing columns and variables.
2. Model Testing Data - Include data files of observations from each location simulated to produce the results in the paper in an open source format (e.g., CSV). If the data are publicly available in another repository, include a reference (with DOI) in the metadata and user guide.
3. Documentation or user guide - Include a readme file (e.g., pdf) for each site-specific or large-scale simulation and provide details on the model name and version number, and required data or code dependencies. Also include a citation for the model code and licensing information if applicable.
Use in publications - If publishing model results, cite and include links to the data and code publication(s) in the Data or Code Availability section. Include the citations of the dataset and code publication(s) with DOI(s) in the references section. Examples of Data or Code Availability statements associated with the journal articles researched in this study are provided in Supplemental Table 1 of the associated manuscript.

2. Deciding how to bundle files

The decision tree below provides suggestions for which files to archive, and when to submit data to a public archive using the following considerations:

Repository storage limitations
Authorship
Downstream value

Model Data Archiving Guidelines

README

Getting started

Updates in v1.1.0

How to contribute

Usage license

How to cite these guidelines

Funding and acknowledgements

Related references

Instructions

1. Model data archiving guidelines

2. Deciding how to bundle files

3. File-Level Metadata

README

Getting started

Updates in v1.1.0

How to contribute

Usage license

How to cite these guidelines

Funding and acknowledgements

Related references

Instructions

1. Model data archiving guidelines

2. Deciding how to bundle files

3. File-Level Metadata