-
Notifications
You must be signed in to change notification settings - Fork 28
Sample QC
We have developed an R package called meffil to perform QC for data measured with Illumina 450k arrays. Please make sure you have installed version 1.0.0 or higher. You can check the version like this:
packageVersion("meffil")
It is using parallelization and reads .idat
files directly. It is therefore much faster than minfi
and uses much less memory.
Load meffil and set how many cores to use for parallelization
library(meffil)
options(mc.cores=16)
Generate a samplesheet with your samples. The samplesheet can be generated automatically from the idat
basenames by giving the directory with idat
files or it can be done manually. It should contain at least the following necessary columns: Sample_Name
, Sex
(possible values "M","F" or "NA") and Basename
.
Basename
is the path to the idat file without the _Red.idat and _Grn.idat extension. It tries to parse the basenames to guess if the Sentrix plate and positions are present.
samplesheet <- meffil.create.samplesheet(path_to_idat_files, recursive=TRUE)
Please note:
- If you have your idat files in multiple directories, then you can run
meffil.create.samplesheet
with the path set to the base directory and withrecursive=T
. This will search through all subdirectories to find idat files. - If this is not optimal for some reason (e.g. the base directory would include idat files from too many other projects), then you can create multiple samplesheets by calling
meffil.create.sampleseheet
for each directory and then userbind
to combine them all together into a single samplesheet. - If you retrieve idat files you do not want to analyse, then you can simply remove those corresponding rows from the sample sheet.
- At this point it is worthwhile to manually modify the samplesheet data frame to replace the actual sample IDs in the
Sample_Name
column if necessary, and to add the sex values to theSex
column. It is fine to add additional columns to the samplesheet, but do not remove existing columns as they are required for downstream analyses.
Alternatively, if you have already a samplesheet with batch variables you can read it in using:
samplesheet <- meffil.read.samplesheet(base=".",pattern="mysamplesheet.csv")
Next perform the background correction, dye bias correction, sex prediction and cell count estimates. The meffil.qc
function processes your idat
files and returns a qc.object for each sample. You need to specify which cell type reference you need. You can find cell type references with the meffil.list.cell.type.references()
function. Currently there are whole blood and cord blood references implemented. Cell count predictions are done using the Houseman et al.. meffil and minfi are all using the houseman algorithm for estimating cell counts. In meffil we do something slightly different by normalizing each sample individually to the cell type reference dataset rather than normalizing all samples in a dataset to the cell type reference in order to avoid having cell count estimates depend on the other samples being included in the normalization.
meffil.list.cell.type.references()
qc.objects <- meffil.qc(samplesheet, cell.type.reference="blood gse35069 complete", verbose=TRUE)
save(qc.objects,file="qc.objects.Robj")
You can check if your samples are there by checking the number of samples and extracting the sample names.
length(qc.objects)
Sample names can be checked like this
names(qc.objects)
You can check data for the first sample
names(qc.objects[[1]])
qc.objects[[1]]$sample.name
If you have genotype data available on the same individuals with methylation profiles you can check for ID mismatches. The methylation arrays have 65 SNPs which can be extracted from the methylation data. These 65 SNPs can be compared to genotypes measured with genotype arrays.
In R:
featureset <- qc.objects[[1]]$featureset
writeLines(meffil.snp.names(featureset), con="snp-names.txt")
'snp-names.txt' has the SNP ids from the SNPs on the methylation arrays. You now need to extract these SNPs from the genotype arrays. Using the UNIX command shell (you will need plink1.90
, available here) you can can extract these SNPs from the genotype array profiled genotype data. You will need to have your genotype data in bed/bim/fam format (https://www.cog-genomics.org/plink/2.0/input#bed)
plink --bfile dataset --extract snp-names.txt --recode A --out genotypes
In R:
genotypes <- meffil.extract.genotypes("genotypes.raw")
genotypes <- genotypes[,match(samplesheet$Sample_Name, colnames(genotypes))]
We can now set some parameters for the QC of the raw data.
qc.parameters <- meffil.qc.parameters(
beadnum.samples.threshold = 0.1,
detectionp.samples.threshold = 0.1,
detectionp.cpgs.threshold = 0.1,
beadnum.cpgs.threshold = 0.1,
sex.outlier.sd = 5,
snp.concordance.threshold = 0.95,
sample.genotype.concordance.threshold = 0.8
)
-
beadnum.samples.threshold
= maximum threshold on the fraction of probes with too few detected beads (minimum number of detected beads is defined by setting thebeads.threshold
parameter ofmeffil.qc()
ormeffil.normalize.dataset()
). Samples with probe fractions above this are excluded from the final dataset. -
detectionp.samples.threshold
= maximum threshold on the fraction of undetected probes (probe detection is defined by setting the maximum probe detection p-value threshold parameterdetection.threshold
ofmeffil.qc()
ormeffil.normalize.dataset()
). Samples with probe fractions above this are excluded from the final dataset. -
beadnum.cpgs.threshold
= same asbeadnum.samples.threshold
but used to identify poor quality probes in terms of the fraction of samples in which the probe has too few detected beads. -
detectionp.cpgs.threshold
= same asdetectionp.cpgs.threshold
but used to identify poor quality probes in terms of the fraction of samples in which the probe is undetected. -
sex.outlier.sd
= number of standard deviations to determine whether sample is sex outlier -
snp.concordance.threshold
= concordance threshold to include snps to calculate sample concordance -
sample.genotype.concordance.threshold
= concordance threshold to determine whether sample is outlier
We can now summarise the QC analysis of the raw data.
qc.summary <- meffil.qc.summary(
qc.objects,
parameters = qc.parameters,
genotypes=genotypes
)
save(qc.summary, file="qcsummary.Robj")
We can also make a qc.report
.
meffil.qc.report(qc.summary, output.file="qc-report.html")
This creates the file qc-report.html
in the current work directory. The file should open up in your web browser.
After checking the qc report you can select bad samples to remove. I would remove all outliers in qc.summary$bad.samples
except the outliers based on control probes as there are often a few probes for each control type only. For the control probe categories, I would remove slides with dye bias Control probe (dye.bias)
which are captured by the normalisation probes (normC + normG)/(normA + normT)
or bisulfite conversion control probes (Control probe (bisulfite1)
and Control probe (bisulfite2)
) .
outlier <- qc.summary$bad.samples
table(outlier$issue)
index <- outlier$issue %in% c("Control probe (dye.bias)",
"Methylated vs Unmethylated",
"X-Y ratio outlier",
"Low bead numbers",
"Detection p-value",
"Sex mismatch",
"Genotype mismatch",
"Control probe (bisulfite1)",
"Control probe (bisulfite2)")
outlier <- outlier[index,]
You can remove bad samples prior to performing quantile normalization. Bad probes can be removed later on.
length(qc.objects)
qc.objects <- meffil.remove.samples(qc.objects, outlier$sample.name)
length(qc.objects)
save(qc.objects,file="qc.objects.clean.Robj")
Rerun QC summary on clean dataset
qc.summary <- meffil.qc.summary(qc.objects,parameters=qc.parameters,genotypes=genotypes)
save(qc.summary, file="qcsummary.clean.Robj")
Rerun QC report on clean dataset
meffil.qc.report(qc.summary, output.file="qc-report.clean.html")
- Installation
- Sample QC
- Functional normalization
- Functional normalizing separate datasets
- Extracting structural variants
- Estimating cellular composition
- Removing chrX and chrY probes
- Running EWAS
- Extracting CpG annotations
- Extracting SNP annotations
- Extracting detection p-values
- Extracting methylated and unmethylated intensities
- Generate normalization report from normalised betas
- Full pipeline for analysing massive datasets
- Common problems
- Citation