DataverseNO Digital Assets Report

This report gives a description of the digital assets held in the DataverseNO repository as of the date specified in the Document History and Version Control Table at the end of this document. This report is used as basis for the application of the DataverseNO Preservation Plan.

Digital Asset Overview

The DataverseNO repository contains a total of 640 Datasets containing digital research data including documentation of data. 601 of these Datasets are registered with a DOI (since 2016), and 39 with a Handle (until 2016). In their latest published version, these 640 Datasets contain together a total of 4 763 files.

The Datasets are from virtually all major research disciplines, as shown in the table below. Some Datasets are cross-/multi-disciplinary, the total numbers of Datasets and files shown in the table are therefore exceeding the total numbers of unique occurrences.

Discipline Datasets Files
Agricultural Sciences 2 11
Arts and Humanities 54 1 393
Astronomy and Astrophysics 1 13
Business and Management 4 26
Chemistry 5 637
Computer and Information Science 5 25
Earth and Environmental Sciences 517 2 496
Engineering 2 5
Mathematical Sciences 1 5
Medicine, Health and Life Sciences 38 568
Physics 383 1 097
Social Sciences 17 100
Total 1 029 6 376

Across these disciplines, the Datasets are heterogeneous also in terms of file category, as is shown in the following table. The file categories used in this analysis are based on the classification used in the file format registry FileInfo, except for the category Compressed Files, which has been renamed to Container Files because not all files in this group are using compression.

File Category Datasets Files
Audio Files 3 1 004
Container Files 440 1 000
Data Files 61 566
Database Files 1 2
Developer Files 35 63
GIS Files 5 7
Page Layout Files 52 113
Raster Image Files 5 58
Settings Files 3 5
Spreadsheet Files 41 71
Text Files 785 1 846
Video Files 2 27
Web Files 1 1
Total 1 434 4 763

As a result of file normalization before initial publication of Datasets, many of these files are represented in a file format that is preferred for long-term preservation. If the original file format is not preferred for long-term preservation, the DataverseNO Deposit Guidelines request research data items to be archived in both the original file format and a preferred file format. If data cannot be stored in a preferred file format, they can still be published in their original format, but in that case, DataverseNO does not commit to preserve the data in the long term. Another mandatory element to be included in each Dataset is a ReadMe file explaining and describing the Dataset to support reuse of the data.

Digital Asset Groups

With regard to the DataverseNO preservation program, the digital assets in DataverseNO are divided into the following five groups:

Name of Digital Asset Group Brief Description of Asset Group Number of Digital
Assets in the Group
Group 1 Items with only non-preferred file format(s) 92 files
Group 2 Datasets without ReadMe file 17 Datasets
Group 3 Container files (.zip or .tar) 998 files
Group 4 Files in file formats with unclear preferability status 378 files
Group 5 All other assets

Asset Group 1

There are 92 research data items in the DataverseNO repository that are stored only in file formats that are not considered as preferred in the DataverseNO deposit guidelines. These cases of non-compliance are due to the lack of provisions in previous guidelines, single occurrences of insufficient curation, or the fact that the data at the time of initial publication could not be saved in or converted into a preferred file format. The table below gives an overview of these files with non-preferred file formats grouped by file category and file extension.

File Category Number of Files File Category Number of Files
Container Files 7 Settings Files 1
   .7z 3    .DS_Store 1
   .gz 2 Spreadsheet Files 43
   .tgz 2    .123 1
Data Files 10    .ods 2
   .bag 1    .xls 18
   .bin 2    .xlsx 22
   .binary 2 Text Files 2
   .ppt 1    .docx 1
   .rda 3    .rtf 1
   .RData 1 Video Files 26
Database Files 2    .avi 26
   .dbf 2 Web Files 1
   .html 1
Total 92

The assets in group 1 are described in a detailed asset list (internal document) including the following information about the items: PID of the Dataset, Dataset version, publication date, date of last update, name of collection where the Dataset is published, name of Depositor, name of curator, file name, file extension, and file category.

Asset Group 2

There are 17 Datasets in the DataverseNO repository that lack a ReadMe file. These cases of non-compliance with the DataverseNO deposit guidelines are due to the lack of provisions in previous guidelines or single occurrences of insufficient curation.

The assets in group 2 are described in a detailed asset list (internal document) including the following information about the items: PID of the Dataset, Dataset version, Dataset title, publication date, date of last update, name of collection where the Dataset is published, name of Depositor, and name of curator.

Asset Group 3

The DataverseNO deposit guidelines do not recommend container files. In previous version of the repository software it was not possible to maintain the folder structure of ingested files. In cases where the folder structure was important DataverseNO has therefore accepted container files preferably of the type .zip or .tar. Since the repository software now supports retention of folder structure, DataverseNO considers to unpack these container files.

There are 998 container files in the DataverseNO repository of the type .zip (993 files) and .tar (5 files). These files are described in a detailed asset list (internal document) including the following information about the items: PID of the Dataset, Dataset version, publication date, date of last update, name of collection where the Dataset is published, name of Depositor, name of curator, file name, file extension, and file category.

Asset Group 4

There are 378 research data items in the DataverseNO repository that are stored only in file formats whose preferability status is considered as unclear by the repository management. The table below gives an overview of these files grouped by file category and file extension.

File Category Number of Files File Category Number of Files
Data Files 365 Developer Files 3
   .mat 48    .gms 2
   .out 107    .pro 1
   .pcm 102 GIS Files 6
   .pdb 2    .sbn 1
   .prj 2    .segy 1
   .sbx 1    .sgy 2
   .shx 2    .shp 2
   .xyz 101 Settings Files 3
   .mco 3
Video Files 1
   .gxf 1
Total 378

The assets in group 4 are described in a detailed asset list (internal document) including the following information about the items: PID of the Dataset, Dataset version, publication date, date of last update, name of collection where the Dataset is published, name of Depositor, name of curator, file name, file extension, and file category.

Asset Group 5

Any assets not contained in any of the groups 1 to 4 above are part of the residual group 5. These assets comply with the DataverseNO deposit guidelines. The assets in group 5 are described in a detailed asset list (internal document) including the following information about the items: PID of the Dataset, Dataset version, publication date, date of last update, name of collection where the Dataset is published, name of Depositor, name of curator, file name, file extension, and file category.

Acknowledgements and References

The references below point to documents and resources which the present document is adapted from and inspired by, or which are otherwise referred to in the present document.

Becker, C., Kulovits, H., Guttenbrunner, M., Strodl, S., Rauber, A., & Hofman, H. (2009). Systematic planning for Digital Preservation: evaluating potential strategies and building preservation plans. International Journal on Digital Libraries, 10(4), 133–157. https://doi.org/10.1007/s00799-009-0057-1

Data Archiving and Networked Services (DANS). File formats. https://dans.knaw.nl/en/about/services/easy/information-about-depositing-data/before-depositing/file-formats

FileInfo file format registry. https://fileinfo.com/

Sustainability of Digital Formats: Planning for Library of Congress Collections. https://www.loc.gov/preservation/digital/formats/fdd/descriptions.shtml

The National Archives. The technical registry PRONOM. http://www.nationalarchives.gov.uk/PRONOM/Default.aspx

Contact support@dataverse.no with questions or to request an addition or revision to this report.

Report Document History and Version Control Table

Version Action Approved By Action Date
1.0 Report issued. DataverseNO Repository Management 2019-07-04