This report gives a description of the digital assets held in the DataverseNO repository as of the date specified in the Document History and Version Control Table at the end of this document. This report is used as basis for the application of the DataverseNO Preservation Plan.
Digital Asset Overview
The DataverseNO repository contains a total of 640 Datasets containing digital research data including documentation of data. 601 of these Datasets are registered with a DOI (since 2016), and 39 with a Handle (until 2016). In their latest published version, these 640 Datasets contain together a total of 4 763 files.
The Datasets are from virtually all major research disciplines, as shown in the table below. Some Datasets are cross-/multi-disciplinary, the total numbers of Datasets and files shown in the table are therefore exceeding the total numbers of unique occurrences.
Discipline | Datasets | Files |
Agricultural Sciences | 2 | 11 |
Arts and Humanities | 54 | 1 393 |
Astronomy and Astrophysics | 1 | 13 |
Business and Management | 4 | 26 |
Chemistry | 5 | 637 |
Computer and Information Science | 5 | 25 |
Earth and Environmental Sciences | 517 | 2 496 |
Engineering | 2 | 5 |
Mathematical Sciences | 1 | 5 |
Medicine, Health and Life Sciences | 38 | 568 |
Physics | 383 | 1 097 |
Social Sciences | 17 | 100 |
Total | 1 029 | 6 376 |
Across these disciplines, the Datasets are heterogeneous also in terms of file category, as is shown in the following table. The file categories used in this analysis are based on the classification used in the file format registry FileInfo, except for the category Compressed Files, which has been renamed to Container Files because not all files in this group are using compression.
File Category | Datasets | Files |
Audio Files | 3 | 1 004 |
Container Files | 440 | 1 000 |
Data Files | 61 | 566 |
Database Files | 1 | 2 |
Developer Files | 35 | 63 |
GIS Files | 5 | 7 |
Page Layout Files | 52 | 113 |
Raster Image Files | 5 | 58 |
Settings Files | 3 | 5 |
Spreadsheet Files | 41 | 71 |
Text Files | 785 | 1 846 |
Video Files | 2 | 27 |
Web Files | 1 | 1 |
Total | 1 434 | 4 763 |
As a result of file normalization before initial publication of Datasets, many of these files are represented in a file format that is preferred for long-term preservation. If the original file format is not preferred for long-term preservation, the DataverseNO Deposit Guidelines request research data items to be archived in both the original file format and a preferred file format. If data cannot be stored in a preferred file format, they can still be published in their original format, but in that case, DataverseNO does not commit to preserve the data in the long term. Another mandatory element to be included in each Dataset is a ReadMe file explaining and describing the Dataset to support reuse of the data.
Digital Asset Groups
With regard to the DataverseNO preservation program, the digital assets in DataverseNO are divided into the following five groups:
Name of Digital Asset Group | Brief Description of Asset Group | Number of Digital Assets in the Group |
|
Group 1 | Items with only non-preferred file format(s) | 92 | files |
Group 2 | Datasets without ReadMe file | 17 | Datasets |
Group 3 | Container files (.zip or .tar) | 998 | files |
Group 4 | Files in file formats with unclear preferability status | 378 | files |
Group 5 | All other assets |
Asset Group 1
There are 92 research data items in the DataverseNO repository that are stored only in file formats that are not considered as preferred in the DataverseNO deposit guidelines. These cases of non-compliance are due to the lack of provisions in previous guidelines, single occurrences of insufficient curation, or the fact that the data at the time of initial publication could not be saved in or converted into a preferred file format. The table below gives an overview of these files with non-preferred file formats grouped by file category and file extension.
File Category | Number of Files | File Category | Number of Files | |
Container Files | 7 | Settings Files | 1 | |
.7z | 3 | .DS_Store | 1 | |
.gz | 2 | Spreadsheet Files | 43 | |
.tgz | 2 | .123 | 1 | |
Data Files | 10 | .ods | 2 | |
.bag | 1 | .xls | 18 | |
.bin | 2 | .xlsx | 22 | |
.binary | 2 | Text Files | 2 | |
.ppt | 1 | .docx | 1 | |
.rda | 3 | .rtf | 1 | |
.RData | 1 | Video Files | 26 | |
Database Files | 2 | .avi | 26 | |
.dbf | 2 | Web Files | 1 | |
.html | 1 | |||
Total | 92 |
The assets in group 1 are described in a detailed asset list (internal document) including the following information about the items: PID of the Dataset, Dataset version, publication date, date of last update, name of collection where the Dataset is published, name of Depositor, name of curator, file name, file extension, and file category.
Asset Group 2
There are 17 Datasets in the DataverseNO repository that lack a ReadMe file. These cases of non-compliance with the DataverseNO deposit guidelines are due to the lack of provisions in previous guidelines or single occurrences of insufficient curation.
The assets in group 2 are described in a detailed asset list (internal document) including the following information about the items: PID of the Dataset, Dataset version, Dataset title, publication date, date of last update, name of collection where the Dataset is published, name of Depositor, and name of curator.
Asset Group 3
The DataverseNO deposit guidelines do not recommend container files. In previous version of the repository software it was not possible to maintain the folder structure of ingested files. In cases where the folder structure was important DataverseNO has therefore accepted container files preferably of the type .zip or .tar. Since the repository software now supports retention of folder structure, DataverseNO considers to unpack these container files.
There are 998 container files in the DataverseNO repository of the type .zip (993 files) and .tar (5 files). These files are described in a detailed asset list (internal document) including the following information about the items: PID of the Dataset, Dataset version, publication date, date of last update, name of collection where the Dataset is published, name of Depositor, name of curator, file name, file extension, and file category.
Asset Group 4
There are 378 research data items in the DataverseNO repository that are stored only in file formats whose preferability status is considered as unclear by the repository management. The table below gives an overview of these files grouped by file category and file extension.
File Category | Number of Files | File Category | Number of Files | |
Data Files | 365 | Developer Files | 3 | |
.mat | 48 | .gms | 2 | |
.out | 107 | .pro | 1 | |
.pcm | 102 | GIS Files | 6 | |
.pdb | 2 | .sbn | 1 | |
.prj | 2 | .segy | 1 | |
.sbx | 1 | .sgy | 2 | |
.shx | 2 | .shp | 2 | |
.xyz | 101 | Settings Files | 3 | |
.mco | 3 | |||
Video Files | 1 | |||
.gxf | 1 | |||
Total | 378 |
The assets in group 4 are described in a detailed asset list (internal document) including the following information about the items: PID of the Dataset, Dataset version, publication date, date of last update, name of collection where the Dataset is published, name of Depositor, name of curator, file name, file extension, and file category.
Asset Group 5
Any assets not contained in any of the groups 1 to 4 above are part of the residual group 5. These assets comply with the DataverseNO deposit guidelines. The assets in group 5 are described in a detailed asset list (internal document) including the following information about the items: PID of the Dataset, Dataset version, publication date, date of last update, name of collection where the Dataset is published, name of Depositor, name of curator, file name, file extension, and file category.
Acknowledgements and References
The references below point to documents and resources which the present document is adapted from and inspired by, or which are otherwise referred to in the present document.
Becker, C., Kulovits, H., Guttenbrunner, M., Strodl, S., Rauber, A., & Hofman, H. (2009). Systematic planning for Digital Preservation: evaluating potential strategies and building preservation plans. International Journal on Digital Libraries, 10(4), 133–157. https://doi.org/10.1007/s00799-009-0057-1
Data Archiving and Networked Services (DANS). File formats. https://dans.knaw.nl/en/about/services/easy/information-about-depositing-data/before-depositing/file-formats
FileInfo file format registry. https://fileinfo.com/
Sustainability of Digital Formats: Planning for Library of Congress Collections. https://www.loc.gov/preservation/digital/formats/fdd/descriptions.shtml
The National Archives. The technical registry PRONOM. http://www.nationalarchives.gov.uk/PRONOM/Default.aspx
Contact support@dataverse.no with questions or to request an addition or revision to this report.
Report Document History and Version Control Table |
|||
Version | Action | Approved By | Action Date |
1.0 | Report issued. | DataverseNO Repository Management | 2019-07-04 |