Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many genes in DNAm reference missing from DNAm matrix generated using constAvBetaTSS #7

Open
ghost opened this issue Nov 28, 2022 · 3 comments

Comments

@ghost
Copy link

ghost commented Nov 28, 2022

Hello,

Thank you for developing EpiSCORE. I have been using it to calculate the proportion of cell types making up healthy prostate as well as prostate tumour samples and so have been using mrefProstate.m. I noted that of the 163 genes in mrefProstate.m, 45 are missing from the output I get from constAvBetaTSS(). I am wondering why almost one third of genes could be missing and if I can trust the resulting cell-type proportions calculated using wRPC()? For the input to constAvBetaTSS(), I use a matrix derived from WGBS data in which I mapped CpG sites to corresponding probes in the 450k array. A small minority of probes were missing, but not enough I think to explain why 1/3 genes in the reference matrix are missing. I am using EpiSCORE 0.9.5 on R version 4.2.0.

Best wishes,

Richard

@aet21
Copy link
Owner

aet21 commented Nov 28, 2022

Hi Richard,
Thanks for your enquiry.
You did not specify at what depth your WGBS is, but generally speaking I am not that surprised that after QC and your requirement to match the exact same CpGs, that you lose so many genes. Because wRPC works at the level of genes, not CpGs, you should NOT demand an exact matching of CpGs. In this instance what you should do is process your WGBS data, so that you summarize DNAm at the level of gene promoters annotated to their genes, and this is what you then pass onto the wRPC function. In other words, if you have WGBS data, you don't need to (and you should not) use the constAvBetaTSS function! Unfortunately, we don't have an analogous function for WGBS data, so we are expecting the users to generate the gene-promoter level DNAm data matrix themselves. However, thanks for the feedback, as we will try to include such a function in a future release.
Let me also clarify that the DNAm reference for prostate has 122 usable genes, because 41 of the 163 have a weight of zero (for these we could not impute values confidently). When publishing the references we kept these zero-weight genes for future purposes although they are never used in the inference. It would be important to have at least 10 marker genes per cell-type after integration with your correctly processed WGBS data matrix.

Hope this helps,
A

@ghost
Copy link
Author

ghost commented Nov 29, 2022

Hi Andrew,

The reason I went down the route of trying to map CpG sites from WGBS data to 450K probes and using constAvBetaTSS() was that I wasn't sure of exactly which TSS to use for the marker genes and I thought that approach would ensure that the same sites would be used to calculate promoter methylation values for both the input samples and those from mrefProstate.m. I usually work with TSS at transcript level rather than gene level and and so could you please briefly explain how TSS were selected for each gene in constructing the reference matrices?

Anyway, I was left with 99 genes from the mrefProstate.m with weights greater than 0. Would this be adequate to consider the results of wRPC trustworthy?

Thanks a lot for your help.

Richard

@aet21
Copy link
Owner

aet21 commented Nov 29, 2022

Hi Richard,

Well, a common misconception is that we only use array data to find imputable genes. That is not so: we use WGBS+RNA-Seq data from the Epigenomics Roadmap as one source of inference, and then the array data from the SCM2 as another source, the overlap of imputable genes from both sources being quite substantial, as described in the paper.
For both databases we used a 200bp region upstream of the TSS, averaging DNAm over all CpGs in this region.

Again, to answer your question about being trustworthy, that depends on the distribution of the 99 genes across the cell-types. If there are at least 10 marker genes per cell-type, it should be ok. A suggestion: try to validate the DNAm reference you got over those 99 genes in the TCGA prostate 450k set in terms of tumor purity and immune-cell infiltration to gain some confidence.

kr
A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant