Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

structure format #58

Open
wbsimey opened this issue Jul 13, 2020 · 4 comments
Open

structure format #58

wbsimey opened this issue Jul 13, 2020 · 4 comments

Comments

@wbsimey
Copy link

wbsimey commented Jul 13, 2020

Hello,
I am running fastStructure on Ubuntu 18. The test data provided works in my environment (bed files)
I have a large SNP dataset from a complete chromosome. I converted the vcf to a Structure format using PDGSpider and the file works in Structure, but I cannot get it to work in fastStructure.

I use the following command:
python structure.py -K 3 --format=str --input=Tse_43samples_scaff_22_SNP-INDEL_variants_filt_rnd2_TRUNC2 --output=Tse43_Chr24 --full --seed=100 --prior=logistic

and get the error:

Traceback (most recent call last):

  File "structure.py", line 172, in <module>
    G = parse_str.load(params['inputfile'])
  File "parse_str.pyx", line 10, in parse_str.load
    L = loci.shape[1]
IndexError: tuple index out of range

The file looks like the following for 43 taxa, so 86 rows + Header:
SNP_1 SNP_2 SNP_3 SNP_4 SNP_5 SNP_6 SNP_7 SNP_8 SNP_9 SNP_10
AHP1168 2 1 3 3 4 2 2 1 2 2
AHP1168 2 1 3 3 4 2 2 1 2 2
AHP2709 -9 1 3 3 -9 -9 2 3 2
AHP2709 -9 1 3 3 -9 -9 2 3 2

NOTE: My structure input file has the .str extension, but fastStructure appends it with another .str and then says it cannot find file, So in the command line I omit the .str extension and is gets past file loading.

thanks in advance.

@wbsimey
Copy link
Author

wbsimey commented Jul 15, 2020

SOLUTION: I was able to convert the GATK4 generated vcf file into a bed file plus the .fam and .bim files required by fastSTRUCTURE using plink2.

plink2 --vcf 41samples_scaff_22.vcf --allow-extra-chr --out 41samples_scaff_22

@janxkoci
Copy link

janxkoci commented Dec 4, 2020

I had the same issue while using stacks' populations module. I used it to convert VCF to STRUCTURE format, but I had to fix the file a little bit afterwards, to make it acceptable by faststructure. I used stacks 2.1 as later versions (2.3 & 2.4 in particular) gave me some errors while converting files. You can install both stacks and faststructure using conda manager and Bioconda channel.

populations -V filtered.vcf --structure

mv filtered.p.structure filtered.p.str # fix file extension

The file had incorrect header, with comment on line 1 and missing column names for first two columns at line 2:

$ awk '{print NF}' filtered.p.str | head -3
8
217474
217476

$ head -2 filtered.p.str | cut -f 1-8
# Stacks v2.1;  Structure v2.3; December 04, 2020
		1_0	2_0	4_0	5_0	7_0	9_0

Note the two leading tabulators at the beginning of line 2!

So I had to remove the comment and add the two missing column names at the beginning of the header line:

awk 'NR>1' filtered.p.str | sed 's/\t\t1_0/id\tpop\t1_0/' > filtered.p.fixed.str

After this fix, the file had correct format and faststructure no longer showed the error.

$ awk '{print NF}' filtered.p.fixed.str | head -3
217476
217476
217476

$ head -1 filtered.p.fixed.str | cut -f 1-8
id	pop	1_0	2_0	4_0	5_0	7_0	9_0

@janxkoci
Copy link

janxkoci commented Dec 4, 2020

However the format is still wrong, as stacks currently doesn't convert missing values to -9. I reported the issue to stacks developers.

@KSteffen
Copy link

KSteffen commented Sep 2, 2021

Hej,
regarding parsing the output of stacks2 populations:
I pull apart the vcf file, mask the loci/chromosome names with an 'X', and recombine the vcf file. Then, I created the bed files (*.bed, *.bim, *.fam) with plink.
This is not a pretty solution but it worked for me.

$ grep -v "#" populations.snps.vcf | sed -E 's/^/X_/g' > body.snps.vcf
$ grep "#" populations.snps.vcf > header
$ cat header body.snps.vcf >> recombined.snps.vcf
$ plink --vcf recombined.snps.vcf --double-id -aec --make-bed

plink produces plink.bed, plink.bim, plink.fam (and two more), and you could then run faststructure like so
$ structure.py -K 9 --input=plink --output=populations_k9 --seed=2021 --full --format=bed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants