-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
structure format #58
Comments
SOLUTION: I was able to convert the GATK4 generated vcf file into a bed file plus the .fam and .bim files required by fastSTRUCTURE using plink2.
|
I had the same issue while using stacks' populations module. I used it to convert VCF to STRUCTURE format, but I had to fix the file a little bit afterwards, to make it acceptable by populations -V filtered.vcf --structure
mv filtered.p.structure filtered.p.str # fix file extension The file had incorrect header, with comment on line 1 and missing column names for first two columns at line 2: $ awk '{print NF}' filtered.p.str | head -3
8
217474
217476
$ head -2 filtered.p.str | cut -f 1-8
# Stacks v2.1; Structure v2.3; December 04, 2020
1_0 2_0 4_0 5_0 7_0 9_0 Note the two leading tabulators at the beginning of line 2! So I had to remove the comment and add the two missing column names at the beginning of the header line: awk 'NR>1' filtered.p.str | sed 's/\t\t1_0/id\tpop\t1_0/' > filtered.p.fixed.str After this fix, the file had correct format and $ awk '{print NF}' filtered.p.fixed.str | head -3
217476
217476
217476
$ head -1 filtered.p.fixed.str | cut -f 1-8
id pop 1_0 2_0 4_0 5_0 7_0 9_0 |
However the format is still wrong, as stacks currently doesn't convert missing values to |
Hej, $ grep -v "#" populations.snps.vcf | sed -E 's/^/X_/g' > body.snps.vcf plink produces plink.bed, plink.bim, plink.fam (and two more), and you could then run faststructure like so |
Hello,
I am running fastStructure on Ubuntu 18. The test data provided works in my environment (bed files)
I have a large SNP dataset from a complete chromosome. I converted the vcf to a Structure format using PDGSpider and the file works in Structure, but I cannot get it to work in fastStructure.
I use the following command:
python structure.py -K 3 --format=str --input=Tse_43samples_scaff_22_SNP-INDEL_variants_filt_rnd2_TRUNC2 --output=Tse43_Chr24 --full --seed=100 --prior=logistic
and get the error:
Traceback (most recent call last):
The file looks like the following for 43 taxa, so 86 rows + Header:
SNP_1 SNP_2 SNP_3 SNP_4 SNP_5 SNP_6 SNP_7 SNP_8 SNP_9 SNP_10
AHP1168 2 1 3 3 4 2 2 1 2 2
AHP1168 2 1 3 3 4 2 2 1 2 2
AHP2709 -9 1 3 3 -9 -9 2 3 2
AHP2709 -9 1 3 3 -9 -9 2 3 2
NOTE: My structure input file has the .str extension, but fastStructure appends it with another .str and then says it cannot find file, So in the command line I omit the .str extension and is gets past file loading.
thanks in advance.
The text was updated successfully, but these errors were encountered: