Curator Guide

Below you will find guidelines on how research data support at DatverseNO partner institutions are meant to curate their archives in DataverseNO. In case you need additional help or have questions about curation, please contact the support services at your institution.

Curation of datasets

When a user has created a dataset and submitted it for review, the curator(s) of the (sub)dataverse in question are automatically noticed by email. As a curator, you log into DataverseNO and click on your username in the upper right corner, and then click on Notifications:

In the Notifications tab, search for the correct message, i.e. the message stating that the dataset in question has been submitted for review. Click on the dataset link:

The link takes you to the landing page of the dataset, and you are now ready to curate it.

The first thing you should check is whether the contents and the author(s) of the dataset meet the requirements in the DataverseNO Accession Policy. You should also check if the dataset is created in the right archive (= (sub)dataverse). For a ‘common’ user from a DataverseNO partner institution (e.g. UiT), this will be the institutional archive for that institution (e.g. UiT Open Research Data). But for other users, the right archive may be a subject-based archive. For a linguist, e.g., the right archive will usually be TROLLing. If a dataset is created in the wrong archive, you curate the dataset in the usual way (see below), but after the dataset has been published, you should inform research-data@support.uit.no about which archive the dataset has to be moved to.

The curation of a dataset is essentially about ensuring that dataset is structured and documented according to best practice, as described in the Deposit Guidelines (see the menu item Deposit). There are four main areas to be reviewed: files, metadata, terms, and versions. You get to these areas by clicking the corresponding tabs on the landing page, i.e. Files, Metadata, Terms, and Versions:

As of today, the notice telling the curator(s) that a new dataset has been submitted for review does not contain any information about whether this is a new dataset or a new version of a previously published dataset. Therefore, it is a good idea to start curating by looking at versions tab. Here you will immediately see whether the dataset is entirely new, or whether it is a new version of an existing dataset. In the latter case, the versions tab will also tell you what changes that have been made since the previous version, and you then only have to review these changes. Read more about versioning in the section New version of a published dataset below.

If your dealing with a new dataset, you should start curating by having a look at the metadata. They will give you an overview of what the dataset is about.

Metadata


Select the Metadata tab, and click the Add + Edit Metadata button.

Curation checklist for metadata:

  • Are the fields filled in correctly? E.g.:
    • Title: If the dataset is the basis for a publication, is “Replication data for:” added to the title of the dataset?
    • Author:
      • Is/are the name(s) of the author(s) inverted (family name, given name)?
      • Is affiliation provided (e.g. UiT The Arctic University of Norway)?
      • Authors should be encouraged to create an ORCID, and add it to the Identifier Scheme field.
    • Contact:
      • If the contact is one or several persons, the name(s) have to be inverted (family name, given name). If the contact is an institution, common format may be used.
      • Are affiliation and correct email address provided?
    • Description:
      • A brief presentation of the dataset should be given. One may use (parts of) the abstract of the related publication.
      • The Date field should be filled in in the format YYYY-MM-DD.
    • Keyword:
      • Are there reasonable keyword terms added? If there are more commonly used keywords already applied in the (sub)archive, you should make the user aware of this.
      • Has each keyword been assigned it own field? If not, ask the user to replicate the field for each keyword.
    • Related publication:
      • If the dataset is the basis for a publication, is the reference to the publication provided here?
      • If the dataset is used in a article or book manuscript that is submitted for review, but has not been accepted or published (yet), the name of the journal or publisher should not be mentioned in any place in the dataset. Rather, one should use the expression “Submitted for review” or the like.
    • Language: This field is most relevant for datasets about language and linguistics (e.g. in TROLLing). The field Language is not about the investigated language, but about the language of analyses. In a dataset about French nouns described/analysed in English, this field should not contain “French”, but “English” or be left empty.
    • Producer: Has there been entered a correct Producer of the dataset? In institutional archives, this field is pre-populated. But in subject-based archives of the TROLLing type, you should check whether the user has filled in the right institution (e.g. UiT The Arctic University of Norway or the name of another institution of funder).
    • Distributor: As a rule, this field is pre-filled with the name of the archive (e.g. UiT Open Research Data eller TROLLing). Make sure the content has not been edited.
    • Distribution Date: This field is used for specifying a file embargo. If the author has put any access restrictions on (some of) the files in the dataset, the field Distribution Date must contain the date (YYYY-MM-DD) when the file(s) will be available. In all other cases, this field must be empty. Read more about embargo here. One month before the end of the embargo period, a reminder is sent to research-data@support.uit.no, and the curator will then remind the author about this.
    • Geographic Coverage: Many datasets are related to one or more geographical places or areas. In case the user has not entered any metadata about this in the metadata section Geospatial Metadata, you should suggest to him/her to do so.
  • Generally:
    • Are the provided metadata sufficient to make the dataset findable in search engines?
    • Are the metadata provided in English or another commonly used communication language in the scientific field in question?
    • Note! Metadata fields must not contain HTML tags or other special characters (e.g. [ and ]). This applies in particular to the Description field.

Files


Curation checklist for files:

  • Are the data files documented in a ReadMe file?
  • Is forced numbering applied to the ReadMe file (e.g. “00_ReadMe.txt”), so that it appears on the top of the file overview?
  • Can the files be opened?
  • Are the file names consistent and understandable?
  • Are the data provided in (a) preferred file format(s), in addition to the original file format?
  • Do all files have a file extension, e.g. .txt, .pdf?
  • File size:
  • Standard size limit for uploads is 20 GB in total.
  • For datasets with file(s) bigger than 20 GB but less than 50 GB in total, then we can set a time window for file upload (normally 24 hours). This has to be agreed on and scheduled with Obi/Karl Magnus (UiT external partners should contact us at research-data@support.uit.no) in advance. After the time window has lapsed, the system will be reset to the standard 20 GB file size limit.
  • If the files are more than 50 GB in total, then they have to be imported (not uploaded). This has to also be agreed on and scheduled with us in advance. The curator agrees/clarifies with the researcher and with Obi/Karl Magnus (UiT external partners should contact us at research-data@support.uit.no). Obi creates script (program) for importing large files. After import, researchers must be granted ownership/rights to the data set (by admin), and metadata must be added (+ any smaller files) (by curator). In such cases, some waiting period must be calculated to get the job done.
  • Note! Individual researchers not associated with any DataverseNO partner institution are granted up to 10 GB storage space for free. Check if the total storage space used by the researcher is within this limit by summing up the file sizes.

You can find detailed information on file naming conventions, preferred file formats, and documentation of research data in the section Prepare your data for depositing in the menu item Deposit. There, you will also find detailed guidance on how to save/convert different document types into preferred file formats. If you have questions about this, please contact the support services of your home institution.

Best practice implies saving tabular data as tabulator-separated plain text files, encoded in Unicode UTF-8 without so-called BOM (‘Byte Order Mark’). If this is not possible within the spreadsheet software, you may do this in Notepad++ as described here:

  1. Open the (converted) text file (.txt) in Notepad ++ (Notepad++ is based on open source code, and may be downloaded from https://notepad-plus-plus.org/. Ask IT support at your institution for help to install the software on your computer).
  2. Click Encoding in the top menu, and select Convert to UTF-8 without BOM:
  3. Save the file.

 

Statistics data (e.g. R and SPSS)

A useful overview of file formats that are used in various statistics programs is available here. As for R, their conclusion is as follows:

In conclusion, if you are working with R you should provide a .csv* file which includes your data and separate .R- or .Rmd-files which include your syntax to ensure long-term availability. Additionally, you may add Rdata-files for easier access to the same information.

(* We recommend tabulator-separated Unicode UTF-8 .txt.)

The script is in the .R files. .R-files are plain text files (the .R extension may be replaced by .txt). .Rmd:

Rmd RMarkdown files are a great way to combine data documentation, data visualization and data analysis in one single file.

In other words, we want this:

  • Basic data as tabulator-separated Unicode UTF-8 text files (.txt) = preferred file format
  • The R code as Unicode UTF-8 (.R) = preferred file format
  • .rda = non-preferred file format, but works in R, which is an open source based and openly documented software
  • Possibly .rmd

Terms

Make sure the author has NOT changed the Terms of Use (= CC0) in the Terms tab. Any changes of the default terms must be discussed in the research data support group. (When the author selects the option of not accepting the CC0 terms, he/she is provided with a Sample Data Usage Agreement by the system.)


Return dataset to author

If a submitted dataset has not been appropriately structured and documented, the curator returns the dataset to the author:

Note! In addition, the curator sends an email to the author specifying the necessary changes to be made before the dataset can be published. The author should also be referred to (the relevant sections in) the Deposit Guide (https://site.uit.no/dataverseno/deposit/) on the DataverseNO info page (https://info.dataverse.no). It is possible to provide links to specific sections in the Deposit Guide. To get the right link address, hover your mouse on the link icon at the beginning of the section in question, right-click it, and select “Copy Link Location”:

 

You should also ask the author to click Submit for review once again after having made the necessary changes. The email to the author may be sent in two ways:

  1. When you are in the dataset in question, click the button with letter symbol, and write you message in the window that is popping up:
  2. When you are in the dataset in question, click Edit > Metadata, and copy the email address in the field Contact > Email:

    and send the message in your email program (e.g. Outlook).

In a future version of Dataverse, it will be possible to communicate with the author within the application.


Publish a dataset

When everything is OK with the dataset, the curator publishes it by clicking the Publish button:

The author receives an automatic confirmation by email stating that the dataset has been published.

 

Promotion in social media

Some archives are promoted by posting information about new datasets in social media. In TROLLing, the UiT Library post messages on Twitter and on the TROLLing Facebook group telling that a new dataset has been published. They also send a email to the author with the following message:

I have now published your dataset. Thanks for sharing your data! You can find an announcement of the upload on our Facebook and Twitter page, and we encourage you to like this in order to get updates about the archive: https://www.facebook.com/TromsoRepositoryofLanguageandLinguistics/.


New version of a published dataset (also when removing embargo)

When an author makes changes in a published dataset, a new draft is created. This draft must be submitted for review in order for the new version to be published. The curator(s) will then be noticed that a new dataset is waiting for curation. Note! As of today, it is not apparent from this message whether the submitted dataset is an entirely new dataset or a new version of a previously published dataset. Often, a long time may have passed since the previous version was published, and you may not recall that a previous version of this dataset already has been published. It is therefore advisable to start the curation process by checking whether the dataset has more than one version. To see this, click the Versions tab:

By clicking View Details, you get an overview of all changes that have been made between the different versions. As a next step, you should then have a closer look at the changes made in the metadata and/or files. To do this, follow the guidance in previous sections above. When you publish the dataset after having curated it, you are asked to specify the new version number:

As a general rule, the option Minor Release. should be selected when only the metadata have been changed. In case there have been changes in the data files, the option Major Release should be selected. Note! When publishing a new version after removing an embargo / locks on file(s), the alternative Minor Release should be chosen, since we do not want the version number in the dataset reference to be changed.

From time to time, the curator(s) should check whether there are unpublished datasets (drafts) that have not been submitted for review. If a dataset has the status Unpublished for more than three months, the curator(s) should contact the author and remind him/her that they have to click Submit for review in order for the (new version of the) dataset to be published.


Reading access to unpublished dataset


Scenario: An author wishes to grant access to a dataset to a collaborator, a peer reviewer, a journal editor or the like before the dataset is published.

Solution:

  • Log into DataverseNO, and go to the unpublished dataset.
  • Click the Edit button to the right, and select Private URL:
  • Copy the private URL, and send it to the author or, if agreed on, to the person who needs to access the dataset.

Reading access to locked file(s) in published dataset


Scenarios:

  • An author wishes to grant access to (a) locked file(s) (= file with embargo) in a published dataset to a collaborator, peer reviewer or the like.
  • A researcher requires access to (a) locked file(s) by clicking on the Request Access button.

Solution:

  • The person to be granted access to a locked file, must have a DataverseNO user account. If he/she does not have one, refer him/her to the section Step 1: Create a user account / Log in in the deposit guide. Once the user account is created:
  • Log into DataverseNO, and go to the dataset in question.
  • Click the Edit button to the right, and select Permissions and then File:
  • Click Grant Access to Users/Groups:
  • Search for and add the user who should have access to the file(s) in the field Users/Groups, select the file(s) the user needs to have access to, and click Grant:

(The contact for a dataset or dataverse is where email is sent when you click the Contact button. When access to restricted files is requested, email does not go to the contact. Rather, email is sent to the people who have the ability to grant access, which are the people who have a role that contains ManageDatasetPermissions. In DataverseNO, these people are usually the curators of the dataverse in question.)


Edit access to a dataset

When creating a dataset in Dataverse, the depositor is automatically granted edit access to that dataset.  However, in some cases it may be appropriate to manually assign edit access. Consider the following scenarios:

  • Scenario 1: A curator has created a dataset on behalf of an author (cf., e.g. the pilot project on research data management at UiT in 2016). After the dataset has been created, the author wants to have a look at the dataset and possibly make some changes before it is published.
  • Scenario 2: An author has created a dataset and wants other members of the research groups to be able to edit the dataset.
  • Scenario 3: An author has created one or several datasets in an institutional archive (e.g. UiT Open Research Data), but has not yet published them. The author quits the institution, and consequently cannot log into DataverseNO through Feide anymore. Therefore, we have to create a new user account for the author based on local authentication. When the new user account is created, the author must be granted access to his/her “old” datasets.

Please contact the Dataverse administrator at your institution to get changed/assigned access rights on dataset level.


Moving datasets

As of today, it is not possible to move a dataset between archives via the graphical user interface. If, e.g., a linguist from UiT has created a dataset on linguistics in UiT Open Research Data instead of TROLLing, this dataset should first be curated and published in the archive where it is created, and after the publication, the curator gives notice to research-data@support.uit.no about where the dataset should be moved.


Deleting published datasets

When a dataset has been published, its DOI has been activated. Through the DOI Agreement and the DataverseNO Preservation Policy, the archive is committed to provide enduring access to the dataset for at least 10 years after its publication. If, after its publication, it turns out that a dataset for ethical, legal or other reasons should not have been published, we may remove access to the data files in the dataset. However, the metadata entry will still be findable and accessible. Contact research-data@support.uit.no to get the file access in a dataset removed.