structure format #58

wbsimey · 2020-07-13T20:49:39Z

Hello,
I am running fastStructure on Ubuntu 18. The test data provided works in my environment (bed files)
I have a large SNP dataset from a complete chromosome. I converted the vcf to a Structure format using PDGSpider and the file works in Structure, but I cannot get it to work in fastStructure.

I use the following command:
python structure.py -K 3 --format=str --input=Tse_43samples_scaff_22_SNP-INDEL_variants_filt_rnd2_TRUNC2 --output=Tse43_Chr24 --full --seed=100 --prior=logistic

and get the error:

Traceback (most recent call last):

  File "structure.py", line 172, in <module>
    G = parse_str.load(params['inputfile'])
  File "parse_str.pyx", line 10, in parse_str.load
    L = loci.shape[1]
IndexError: tuple index out of range

The file looks like the following for 43 taxa, so 86 rows + Header:
SNP_1 SNP_2 SNP_3 SNP_4 SNP_5 SNP_6 SNP_7 SNP_8 SNP_9 SNP_10
AHP1168 2 1 3 3 4 2 2 1 2 2
AHP1168 2 1 3 3 4 2 2 1 2 2
AHP2709 -9 1 3 3 -9 -9 2 3 2
AHP2709 -9 1 3 3 -9 -9 2 3 2

NOTE: My structure input file has the .str extension, but fastStructure appends it with another .str and then says it cannot find file, So in the command line I omit the .str extension and is gets past file loading.

thanks in advance.

The text was updated successfully, but these errors were encountered:

wbsimey · 2020-07-15T16:23:22Z

SOLUTION: I was able to convert the GATK4 generated vcf file into a bed file plus the .fam and .bim files required by fastSTRUCTURE using plink2.

plink2 --vcf 41samples_scaff_22.vcf --allow-extra-chr --out 41samples_scaff_22

janxkoci · 2020-12-04T14:43:31Z

I had the same issue while using stacks' populations module. I used it to convert VCF to STRUCTURE format, but I had to fix the file a little bit afterwards, to make it acceptable by faststructure. I used stacks 2.1 as later versions (2.3 & 2.4 in particular) gave me some errors while converting files. You can install both stacks and faststructure using conda manager and Bioconda channel.

populations -V filtered.vcf --structure

mv filtered.p.structure filtered.p.str # fix file extension

The file had incorrect header, with comment on line 1 and missing column names for first two columns at line 2:

$ awk '{print NF}' filtered.p.str | head -3
8
217474
217476

$ head -2 filtered.p.str | cut -f 1-8
# Stacks v2.1;  Structure v2.3; December 04, 2020
		1_0	2_0	4_0	5_0	7_0	9_0

Note the two leading tabulators at the beginning of line 2!

So I had to remove the comment and add the two missing column names at the beginning of the header line:

awk 'NR>1' filtered.p.str | sed 's/\t\t1_0/id\tpop\t1_0/' > filtered.p.fixed.str

After this fix, the file had correct format and faststructure no longer showed the error.

$ awk '{print NF}' filtered.p.fixed.str | head -3
217476
217476
217476

$ head -1 filtered.p.fixed.str | cut -f 1-8
id	pop	1_0	2_0	4_0	5_0	7_0	9_0

janxkoci · 2020-12-04T20:21:16Z

However the format is still wrong, as stacks currently doesn't convert missing values to -9. I reported the issue to stacks developers.

KSteffen · 2021-09-02T19:55:48Z

Hej,
regarding parsing the output of stacks2 populations:
I pull apart the vcf file, mask the loci/chromosome names with an 'X', and recombine the vcf file. Then, I created the bed files (*.bed, *.bim, *.fam) with plink.
This is not a pretty solution but it worked for me.

$ grep -v "#" populations.snps.vcf | sed -E 's/^/X_/g' > body.snps.vcf
$ grep "#" populations.snps.vcf > header
$ cat header body.snps.vcf >> recombined.snps.vcf
$ plink --vcf recombined.snps.vcf --double-id -aec --make-bed

plink produces plink.bed, plink.bim, plink.fam (and two more), and you could then run faststructure like so
$ structure.py -K 9 --input=plink --output=populations_k9 --seed=2021 --full --format=bed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

structure format #58

structure format #58

wbsimey commented Jul 13, 2020 •

edited

Loading

wbsimey commented Jul 15, 2020

janxkoci commented Dec 4, 2020

janxkoci commented Dec 4, 2020 •

edited

Loading

KSteffen commented Sep 2, 2021

structure format #58

structure format #58

Comments

wbsimey commented Jul 13, 2020 • edited Loading

wbsimey commented Jul 15, 2020

janxkoci commented Dec 4, 2020

janxkoci commented Dec 4, 2020 • edited Loading

KSteffen commented Sep 2, 2021

wbsimey commented Jul 13, 2020 •

edited

Loading

janxkoci commented Dec 4, 2020 •

edited

Loading