Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identification of putative paralogous sequences #1

Open
mossmatters opened this issue May 17, 2019 · 0 comments
Open

Identification of putative paralogous sequences #1

mossmatters opened this issue May 17, 2019 · 0 comments

Comments

@mossmatters
Copy link

The identification of potentially paralogous sequences is critical to accurate phylogenies. If there is unknown paralogy in an otherwise single-copy gene, recovering a consensus sequence may lead to phylogenies that are incorrect-- if different samples have consensus sequences representing different paralogs, or unresolved-- if the consensus sequence is a mixture of bases from the paralogs.

In HybPiper, "paralog warnings" are a byproduct of the assembly process. If multiple paralogs exist, the graph-based assembler (SPAdes) will assemble multiple contigs, and HybPiper identifies the presence of two or more contigs that map to the reference sequence. With the overlap assembler we can't rely on this. My ideas:

  1. Use the overlap assembler to identify positions with variable base calls (i.e. samtools pileup)
  2. Filter out PCR and sequencing errors by requiring a minimum number of reads to support an ambiguous base call.
  3. Produce statistics about the within sequence variability that users can compile across many samples to determine if the variability may reflect

This approach is an improvement over HybPiper in a few areas:

  1. HybPiper cannot identify putative paralogs if the contigs do not represent a large portion of the targeted gene. Here, we would be summarizing variability across the whole gene, so even targeted loci with long introns could be flagged.
  2. HybPiper is making an arbitrary distinction between heterozygosity and paralogy that can't be supported without building gene trees from many samples. Here, a summary statistic would be presented instead, allowing the user to make a decision about whether paralogs exist.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant