Import Data Using The Rub Importer

⚠️ Note: This page is within the system namespace/directory because as of 2024-09-20, the RUB importer in ReSeeD only supports importing from a subdirectory of the same S3 share ("share" meaning "parent of an S3 bucket") that is also used by ReSeeD to store the imported data (configured using the S3_ENDPOINT, S3_ACCESS_KEY, S3_SECRET_KEY, and S3_REGION variables in the .env file). The name of the S3 bucket from which the RUB importer imports the data is configured in the S3_FILE_UPLOAD_BUCKET variable in the .env file. Because access to this S3 share requires sysadmin access anyway, this page belongs into the system directory of the wiki for the time being.


About this process

This process uses the Bulkrax CSV from S3 parser to do imports. Metadata is prepared in CSV files, data for each dataset is provided in distinct folders (see below).

Prepare the data

The data to be imported needs to have the following file structure

Steps to run an import

  1. Upload the data you want to import (for example: the unzipped data in Example_RUB_import_data.zip) into the S3 bucket that ReSeed has access to

    For example: cl-reseed_import

    This bucket name needs to be filled in the form for Specify a bucket name with prefix

  2. Log into ReSeeD as an administrator.

  3. On the dashboard you should see the options Importers and Exporters. Click on Importers.

  4. In the importers page, click on New on the top left corner. This would open the importer form

  5. Fill in the Importer form with the values as shown in the screenshot above and click on Create and import

    Field name Value Note
    Name Any identifiable name for the import
    Administrative Set RUB publication workflow This will apply this workflow to all imported datasets
    Frequency Once We are running a one off import
    Limit 0 or leave blank This will import all records in the metadata.csv file
    Parser CSV from S3 - ReSeed CSV parser for work (Datasets) from local S3 This will choose the parser for ReSeed
    Visibility Private The workflow will need all datasets to be private until published
    Rights statement Leave blank It will pick up the rights statement from the csv file
    Specify a bucket name with prefix cl-reseed_import The bucket name with the prefix.
    You could also add a path within the bucket, for example:
    cl-reseed_import/set1

The metadata CSV format

Column header Cardinality Format Example 1 Example 2
title One String The title of the dataset Test dataset 1 for import Test dataset 2 for import
dataset_path One String Folder path within the bucket dataset1 dataset2
alternative_title Zero or more String The alternative title(s) of the dataset Multiple values should be separated with a semi-colon. The rhythms of old men who hit things with sticks The rhythms of old men who hit things with sticks; Huh?
description Zero or one String Description of the dataset A collection of rhythms from veteran rock drummers A collection of rhythms from veteran rock drummers
contributor Zero or more Names should be entered in the format: LAST_NAME, FORENAME(S). Multiple contributors should be separated with a semi-colon. The order of names is significant in relating them to: contributor_orcid contributor_affiliation Starr, Ringo; Bonham, John; Densmore, John; Moon, Keith Starr, Ringo; Bonham, John; Densmore, John; Moon, Keith
contributor_orcid Zero or more ORCIDS should be entered in their full https format. The order of ORCIDS is significant in relating them to contributor. ORCIDS should be separated with a semi-colon. It should ideally have the same number of semi-colons as contributor. ;;https://orcid.org/0000-0001-5109-3700; https://orcid.org/0000-0001-0001-3700;;;
contributor_affiliation Zero or more String The order of affiliations is significant in relating them to contributor. Affiliations should be separated with a semi-colon. It should ideally have the same number of semi-colons as contributor. The Beatles; Led Zeppelin; The Doors; The Who The Beatles;;The Doors;
creator One or more Names should be entered in the format: LAST_NAME, FORENAME(S) Multiple creators should be separated with a semi-colon. The order of names is significant in relating them to: creator_orcid creator_affiliation Lennon, John Lennon, John; McCartney, Paul
creator_orcid One or more ORCIDS should be entered in their full https format. The order of ORCIDS is significant in relating them to creator. ORCIDS should be separated with a semi-colon. It should ideally have the same number of semi-colons as creator. https://orcid.org/0000-0001-5109-3700 https://orcid.org/0000-0001-5109-3700;https://orcid.org/0000-0001-5109-3701
creator_affiliation One or more String The order of affiliations is significant in relating them to creator Affiliations should be separated with a semi-colon. It should ideally have the same number of semi-colons as creator. The Beatles The Beatles;The Beatles
keyword One or more String Multiple keywords should be separated with a semi-colon drumming drumming; pop stars
resource_type One or more Must be one or more of: Book BookChapter Collection ComputationalNotebook ConferencePaper DataPaper Dataset Dissertation Event Image InteractiveResource Journal JournalArticle Model OutputManagementPlan PeerReview PhysicalObject Preprint Report Service Software Sound Standard Text Workflow Other If the value is not one of the allowed values, we will set it to Dataset Dataset Dataset
license One Must be one of http://rightsstatements.org/vocab/InC/1.0/ https://creativecommons.org/licenses/by/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-nd/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-sa/4.0/ http://creativecommons.org/publicdomain/zero/1.0/ http://creativecommons.org/publicdomain/mark/1.0/ http://www.apache.org/licenses/LICENSE-2.0 http://www.gnu.org/licenses/gpl.html http://opensource.org/licenses/MIT If the license URI is not one of the allowed values, we will ignore it http://creativecommons.org/publicdomain/mark/1.0/ http://opensource.org/licenses/MIT
date Zero or more Dates should be entered in the format: YYYY-MM-DD . For example, 2024-05-29 Created Multiple dates should be separated with a semi-colon. Each date must have a date type which must be one of the following: Accepted Available Copyrighted Collected Created Deposited *Published ** Recorded Registered Submitted Updated Archived If the date type is not one of the allowed values, we will ignore the date and the type The dates entered here are all metadata dates. The system dates are saved in create_date, date_modified, modified_date, date_uploaded The published date if entered above will be overwritten when you go through the submission and review workflow. 2024-05-29 Created; 2024-06-10 Published 2024-05-29 Created; 2024-06-10 Published
subject Zero or more String Multiple subjects should be separated with a semi-colon drumming Drumming; music
language Zero or more String Multiple languages should be separated with a semi-colon English English
location Zero or more String Multiple languages should be separated with a semi-colon London
software_version Zero or more String Multiple software versions should be separated with a semi-colon
funder_identifier Zero or more Identifiers should be entered as full URIs Multiple funders Identifier should be separated with a semi-colon The order of identifiers is significant in relating them to: funder_name award_number award_uri award_title http://dx.doi.org/10.13039/501100001659 http://dx.doi.org/10.13039/501100001659;http://dx.doi.org/10.13039/50110000165999
funder_name Zero or more Multiple funder’s name should be separated with a semi-colon. It should ideally have the same number of semi-colons as identifier. The order of funder name is significant in relating them to: funder_identifier award_number award_uri award_title DFG DFG;RUB
award_number Zero or more Multiple Funder's award number should be separated with a semi-colon. It should ideally have the same number of semi-colons as identifier. The order of award number is significant in relating them to: funder_identifier funder_name award_uri award_title A0001 A0001;W3asxa3
award_uri Zero or more Multiple Funder's award uri should be separated with a semi-colon. It should ideally have the same number of semi-colons as identifier. The order of award uri is significant in relating them to: funder_identifier funder_name award_number award_title
award_title Zero or more Multiple Funder's award uri should be separated with a semi-colon. It should ideally have the same number of semi-colons as identifier. The order of award uri is significant in relating them to: funder_identifier funder_name award_number award_title