Import Data Using The Rub Importer

⚠️ Note: This page is within the system namespace/directory because as of 2024-09-20, the RUB importer in ReSeeD only supports importing from a subdirectory of the same S3 share ("share" meaning "parent of an S3 bucket") that is also used by ReSeeD to store the imported data (configured using the S3_ENDPOINT, S3_ACCESS_KEY, S3_SECRET_KEY, and S3_REGION variables in the .env file). The name of the S3 bucket from which the RUB importer imports the data is configured in the S3_FILE_UPLOAD_BUCKET variable in the .env file. Because access to this S3 share requires sysadmin access anyway, this page belongs into the system directory of the wiki for the time being.

About this process

This process uses the Bulkrax CSV from S3 parser to do imports. Metadata is prepared in CSV files, data for each dataset is provided in distinct folders (see below).

Prepare the data

The data to be imported needs to have the following file structure

There should be a file called metadata.csv
- The format of the columns in the metadata.csv file is explained in The metadata CSV format section (below)
- General csv remarks
  - Commas are used to separate columns, and semi-colons used (in some cases) to separate values within a single column.
  - All text containing commas or semi-colon not meant to be interpreted as separators (e.g. in description or when listing contributors by "LAST_NAME, FORENAME(S)") needs to be wrapped in quotation marks.
  - Encoding: UTF-8 without BOM is advised.
- The CSV file should contain one row for each dataset to be imported
  - The row should mention the path to the dataset relative to the directory containing the metadata.csv in the column dataset_path.

Within each dataset path, there should be a directory named data where all the data for the dataset is placed.

An example data structure for 2 datasets is shown below

cl-reseed_import/set1/
├── dataset1
│   └── data
│       └── 1529
│           ├── folder_1
│           │   ├── another_file.exe
│           │   └── some_other_file.json
│           ├── my_software.exe
│           └── mydata.json
├── dataset2
│   └── data
│       ├── AV02CP07GI0
│       │   ├── anat
│       │   │   └── sub-AV02CP07GI0_T1w.nii
│       │   └── func
│       │       └── sub-AV02CP07GI0_task-rest_bold.nii
│       ├── CHANGES
│       ├── README
│       ├── dataset_description.json
│       ├── participants.json
│       └── participants.tsv
└── metadata.csv

The example zip file Example_RUB_import_data.zip has the datasets and the metadata.csv structured as needed.

Steps to run an import

Upload the data you want to import (for example: the unzipped data in Example_RUB_import_data.zip) into the S3 bucket that ReSeed has access to

For example: cl-reseed_import

This bucket name needs to be filled in the form for Specify a bucket name with prefix
Log into ReSeeD as an administrator.
On the dashboard you should see the options Importers and Exporters. Click on Importers.
In the importers page, click on New on the top left corner. This would open the importer form

Fill in the Importer form with the values as shown in the screenshot above and click on Create and import

Field name	Value	Note
Name	Any identifiable name for the import
Administrative Set	RUB publication workflow	This will apply this workflow to all imported datasets
Frequency	Once	We are running a one off import
Limit	0 or leave blank	This will import all records in the metadata.csv file
Parser	CSV from S3 - ReSeed CSV parser for work (Datasets) from local S3	This will choose the parser for ReSeed
Visibility	Private	The workflow will need all datasets to be private until published
Rights statement	Leave blank	It will pick up the rights statement from the csv file
Specify a bucket name with prefix	cl-reseed_import	The bucket name with the prefix. You could also add a path within the bucket, for example: `cl-reseed_import/set1`

The metadata CSV format

Column header	Cardinality	Format	Example 1	Example 2
title	One	String The title of the dataset	Test dataset 1 for import	Test dataset 2 for import
dataset_path	One	String Folder path within the bucket	dataset1	dataset2
alternative_title	Zero or more	String The alternative title(s) of the dataset Multiple values should be separated with a semi-colon.	The rhythms of old men who hit things with sticks	The rhythms of old men who hit things with sticks; Huh?
description	Zero or one	String Description of the dataset	A collection of rhythms from veteran rock drummers	A collection of rhythms from veteran rock drummers
contributor	Zero or more	Names should be entered in the format: LAST_NAME, FORENAME(S). Multiple contributors should be separated with a semi-colon. The order of names is significant in relating them to: contributor_orcid contributor_affiliation	Starr, Ringo; Bonham, John; Densmore, John; Moon, Keith	Starr, Ringo; Bonham, John; Densmore, John; Moon, Keith
contributor_orcid	Zero or more	ORCIDS should be entered in their full https format. The order of ORCIDS is significant in relating them to contributor. ORCIDS should be separated with a semi-colon. It should ideally have the same number of semi-colons as contributor.	;;https://orcid.org/0000-0001-5109-3700;	https://orcid.org/0000-0001-0001-3700;;;
contributor_affiliation	Zero or more	String The order of affiliations is significant in relating them to contributor. Affiliations should be separated with a semi-colon. It should ideally have the same number of semi-colons as contributor.	The Beatles; Led Zeppelin; The Doors; The Who	The Beatles;;The Doors;
creator	One or more	Names should be entered in the format: LAST_NAME, FORENAME(S) Multiple creators should be separated with a semi-colon. The order of names is significant in relating them to: creator_orcid creator_affiliation	Lennon, John	Lennon, John; McCartney, Paul
creator_orcid	One or more	ORCIDS should be entered in their full https format. The order of ORCIDS is significant in relating them to creator. ORCIDS should be separated with a semi-colon. It should ideally have the same number of semi-colons as creator.	https://orcid.org/0000-0001-5109-3700	https://orcid.org/0000-0001-5109-3700;https://orcid.org/0000-0001-5109-3701
creator_affiliation	One or more	String The order of affiliations is significant in relating them to creator Affiliations should be separated with a semi-colon. It should ideally have the same number of semi-colons as creator.	The Beatles	The Beatles;The Beatles
keyword	One or more	String Multiple keywords should be separated with a semi-colon	drumming	drumming; pop stars
resource_type	One or more	Must be one or more of: Book BookChapter Collection ComputationalNotebook ConferencePaper DataPaper Dataset Dissertation Event Image InteractiveResource Journal JournalArticle Model OutputManagementPlan PeerReview PhysicalObject Preprint Report Service Software Sound Standard Text Workflow Other If the value is not one of the allowed values, we will set it to Dataset	Dataset	Dataset
license	One	Must be one of http://rightsstatements.org/vocab/InC/1.0/ https://creativecommons.org/licenses/by/4.0/ https://creativecommons.org/licenses/by-sa/4.0/ https://creativecommons.org/licenses/by-nd/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-sa/4.0/ http://creativecommons.org/publicdomain/zero/1.0/ http://creativecommons.org/publicdomain/mark/1.0/ http://www.apache.org/licenses/LICENSE-2.0 http://www.gnu.org/licenses/gpl.html http://opensource.org/licenses/MIT If the license URI is not one of the allowed values, we will ignore it	http://creativecommons.org/publicdomain/mark/1.0/	http://opensource.org/licenses/MIT
date	Zero or more	Dates should be entered in the format: YYYY-MM-DD . For example, 2024-05-29 Created Multiple dates should be separated with a semi-colon. Each date must have a date type which must be one of the following: Accepted Available Copyrighted Collected Created Deposited Published * Recorded Registered Submitted Updated Archived If the date type is not one of the allowed values, we will ignore the date and the type The dates entered here are all metadata dates. The system dates are saved in create_date, date_modified, modified_date, date_uploaded The published date if entered above will be overwritten when you go through the submission and review workflow.	2024-05-29 Created; 2024-06-10 Published	2024-05-29 Created; 2024-06-10 Published
subject	Zero or more	String Multiple subjects should be separated with a semi-colon	drumming	Drumming; music
language	Zero or more	String Multiple languages should be separated with a semi-colon	English	English
location	Zero or more	String Multiple languages should be separated with a semi-colon	London
software_version	Zero or more	String Multiple software versions should be separated with a semi-colon
funder_identifier	Zero or more	Identifiers should be entered as full URIs Multiple funders Identifier should be separated with a semi-colon The order of identifiers is significant in relating them to: funder_name award_number award_uri award_title	http://dx.doi.org/10.13039/501100001659	http://dx.doi.org/10.13039/501100001659;http://dx.doi.org/10.13039/50110000165999
funder_name	Zero or more	Multiple funder’s name should be separated with a semi-colon. It should ideally have the same number of semi-colons as identifier. The order of funder name is significant in relating them to: funder_identifier award_number award_uri award_title	DFG	DFG;RUB
award_number	Zero or more	Multiple Funder's award number should be separated with a semi-colon. It should ideally have the same number of semi-colons as identifier. The order of award number is significant in relating them to: funder_identifier funder_name award_uri award_title	A0001	A0001;W3asxa3
award_uri	Zero or more	Multiple Funder's award uri should be separated with a semi-colon. It should ideally have the same number of semi-colons as identifier. The order of award uri is significant in relating them to: funder_identifier funder_name award_number award_title
award_title	Zero or more	Multiple Funder's award uri should be separated with a semi-colon. It should ideally have the same number of semi-colons as identifier. The order of award uri is significant in relating them to: funder_identifier funder_name award_number award_title