Identification of putative paralogous sequences #1

mossmatters · 2019-05-17T18:22:24Z

The identification of potentially paralogous sequences is critical to accurate phylogenies. If there is unknown paralogy in an otherwise single-copy gene, recovering a consensus sequence may lead to phylogenies that are incorrect-- if different samples have consensus sequences representing different paralogs, or unresolved-- if the consensus sequence is a mixture of bases from the paralogs.

In HybPiper, "paralog warnings" are a byproduct of the assembly process. If multiple paralogs exist, the graph-based assembler (SPAdes) will assemble multiple contigs, and HybPiper identifies the presence of two or more contigs that map to the reference sequence. With the overlap assembler we can't rely on this. My ideas:

Use the overlap assembler to identify positions with variable base calls (i.e. samtools pileup)
Filter out PCR and sequencing errors by requiring a minimum number of reads to support an ambiguous base call.
Produce statistics about the within sequence variability that users can compile across many samples to determine if the variability may reflect

This approach is an improvement over HybPiper in a few areas:

HybPiper cannot identify putative paralogs if the contigs do not represent a large portion of the targeted gene. Here, we would be summarizing variability across the whole gene, so even targeted loci with long introns could be flagged.
HybPiper is making an arbitrary distinction between heterozygosity and paralogy that can't be supported without building gene trees from many samples. Here, a summary statistic would be presented instead, allowing the user to make a decision about whether paralogs exist.

The text was updated successfully, but these errors were encountered:

mossmatters mentioned this issue May 17, 2019

Retrieval of phased sequences #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identification of putative paralogous sequences #1

Identification of putative paralogous sequences #1

mossmatters commented May 17, 2019

Identification of putative paralogous sequences #1

Identification of putative paralogous sequences #1

Comments

mossmatters commented May 17, 2019