You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The identification of potentially paralogous sequences is critical to accurate phylogenies. If there is unknown paralogy in an otherwise single-copy gene, recovering a consensus sequence may lead to phylogenies that are incorrect-- if different samples have consensus sequences representing different paralogs, or unresolved-- if the consensus sequence is a mixture of bases from the paralogs.
In HybPiper, "paralog warnings" are a byproduct of the assembly process. If multiple paralogs exist, the graph-based assembler (SPAdes) will assemble multiple contigs, and HybPiper identifies the presence of two or more contigs that map to the reference sequence. With the overlap assembler we can't rely on this. My ideas:
Use the overlap assembler to identify positions with variable base calls (i.e. samtools pileup)
Filter out PCR and sequencing errors by requiring a minimum number of reads to support an ambiguous base call.
Produce statistics about the within sequence variability that users can compile across many samples to determine if the variability may reflect
This approach is an improvement over HybPiper in a few areas:
HybPiper cannot identify putative paralogs if the contigs do not represent a large portion of the targeted gene. Here, we would be summarizing variability across the whole gene, so even targeted loci with long introns could be flagged.
HybPiper is making an arbitrary distinction between heterozygosity and paralogy that can't be supported without building gene trees from many samples. Here, a summary statistic would be presented instead, allowing the user to make a decision about whether paralogs exist.
The text was updated successfully, but these errors were encountered:
The identification of potentially paralogous sequences is critical to accurate phylogenies. If there is unknown paralogy in an otherwise single-copy gene, recovering a consensus sequence may lead to phylogenies that are incorrect-- if different samples have consensus sequences representing different paralogs, or unresolved-- if the consensus sequence is a mixture of bases from the paralogs.
In HybPiper, "paralog warnings" are a byproduct of the assembly process. If multiple paralogs exist, the graph-based assembler (SPAdes) will assemble multiple contigs, and HybPiper identifies the presence of two or more contigs that map to the reference sequence. With the overlap assembler we can't rely on this. My ideas:
samtools pileup
)This approach is an improvement over HybPiper in a few areas:
The text was updated successfully, but these errors were encountered: