Dataset specification

The dataset specification describes a set of data that can be downloaded and some information about the data being downloaded. For example, what outbreak code it belongs to.

The spreadsheet for the dataset is also meant to be read by a script GenfsGopher.pl so that it can all be automatically downloaded.

Detailed fields

The spreadsheet is divided into two parts:

the information describing the whole dataset and
the information describing each sample.

Whole dataset information

The first part describes the dataset. This is given as a two-column key/value format. The keys are case-insensitive, but the values are case-sensitive. The order of rows is unimportant.

Field	Description	Example
Organism	Usually genus and species, but there is no hard rule at this time.	SARS-CoV-2
Outbreak	This is usually an outbreak code but can be some other descriptor of the dataset.	1408MLGX6-3WGS
pmid	Any publications associated with this dataset should be listed as pubmed IDs.
tree	This is a URL to the newick-formatted tree. This tree serves as a guide to future analyses.	`https://...`
source	Where did this dataset come from?	Cheryl Tarr
intendedUsage	How do you think others will use this dataset?	cluster analysis
dataType	A description of the data	Outbreak clade and one outgroup with Illumina only

blank row

There is a blank row in the spreadsheet here

Header row

Header row with field names in the following section such as biosample_acc. Not all fields are required. These field names are case insensitive and can be in any order.

Extra unnamed fields are discouraged because other fields might be added, but they will not affect the GenfsGopher script from working.

Sample information

Some fields are required and are marked with ✔️. Other fields are optional but require - if the information is not present. These fields are marked with ✳️. Other fields are optional and are marked with ❎. You must use - to indicate absence. Previous versions of this repo allowed for NA but in the current version, - is required for absent data.

Field	Required?	Description	example
`biosample_acc`	✔️	The BioSample accession	SAMN012345
`strain`	✔️	The name of the genome or strain
`genbankAssembly`	✳️	GenBank accession number	GCA_027920385.1
`SRArun_acc`	✳️	SRR accession number	SRR012345
`outbreak`	❎	The name of the outbreak clade. Usually named after an outbreak code. If not part of an important clade, the field can be filled in using `outgroup`
`dataSetName`	❎	this should be redundant with the outbreak field in the first part of the spreadsheet
`suggestedReference`	❎	The suggested reference genome for analysis, e.g., SNP analysis.	`TRUE` or `FALSE`
`sha256sumAssembly`	✳️	A checksum for the GenBank file
`sha256sumRead1`	✳️	A checksum for the first read from the SRR accession
`sha256sumRead2`	✳️	A checksum for the second read from the SRR accession	`e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855` is a special example of when the second read is blank.
`nucleotide`	✳️	A single nucleotide accession. This is sometimes an alternative to an assembly especially for one-contig genomes.
`sha256sumnucleotide`	✳️	a checksum for the single nucleotide accession.
`amplicon_strategy`	❎	which amplicon strategy was used?	`ARTIC V3`
`AMR_genotype`	❎	The antimicrobial resistance genotype comma separated	`mdsB,mdsA,golT`
`Plasmids`	❎	Plasmids present, comma separated	`IncFIB(S),IncFII(S),IncX4`
`organism`	❎	The scientific name of the sample, or more taxonomic information as needed	Acinetobacter baumannii

Creating hashsums

This specification uses sha256 to calculate hashsums. To create a hashsum on a file, e.g., file.fastq, run the following

sha256sum file.fastq

We include a script adjustHashsums.pl to help create hashsums automatically in the spreadsheet. Here are the suggested steps:

create the spreadsheet as described above in the detailed fields. Do not include hashsum values in the relevant fields.
Run GenFSGopher.pl using your new spreadsheet. It will err due to incorrect hashsums.
A file in.tsv should be in the output directory identical to the input file.
Run adjustHashsums.pl on in.tsv to create a file out.tsv.
out.tsv will have correct hashsums.

Example

intendedUse	Fast assembly of ONT data									
Organism	Staphylococcus aureus									
source	George Bouras									
pmid	-									
dataType	toy dataset for ONT assembly and AMR									
Outbreak	-									
tree	-									
										
SRArun_acc	biosample_acc	genBankAssembly	nucleotide	outbreak	sha256sumAssembly	sha256sumRead1	sha256sumRead2	sha256sumnucleotide	strain	
-	SAMN32538168	GCA_027920385.1	-	-	-	-	-	-	C308	
SRR22859991	SAMN32360857	-	-	-	-	02d46259b402e83c62b143e96e2dc6761f86b1ac9bd7dfccf9c27f60492afc85	e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855	-	C113	
SRR22859768	SAMN32360972	-	-	-	-	fe7b008a59b3aadfccbfe5f8325bf79e9933fe6d44e0956d68e74eba6230ad2f	e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855	-	C347

Create your own

(future instructions for when Makefile is in place)

To create your own, create a spreadsheet with at least the required fields as defined above, e.g., biosample_acc and strain. If you include data such as genbankAssembly, then you must have the sha256 accompanying field too such as sha256sumAssembly.

For the values of the sha256 fields, use 1 as a placeholder.

Next, run make all (this will err due to the sha256sums) followed by make dataset.tsv.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPECIFICATION.md

SPECIFICATION.md

Dataset specification

Detailed fields

Whole dataset information

blank row

Header row

Sample information

Creating hashsums

Example

Create your own

Files

SPECIFICATION.md

Latest commit

History

SPECIFICATION.md

File metadata and controls

Dataset specification

Detailed fields

Whole dataset information

blank row

Header row

Sample information

Creating hashsums

Example

Create your own