Skip to content

Sample QC

Matthew Suderman edited this page Jan 17, 2023 · 8 revisions

We have developed an R package called meffil to perform QC for data measured with Illumina 450k arrays. Please make sure you have installed version 1.0.0 or higher. You can check the version like this:

packageVersion("meffil")

It is using parallelization and reads .idat files directly. It is therefore much faster than minfi and uses much less memory.

Generating QC objects

Load meffil and set how many cores to use for parallelization

library(meffil)
options(mc.cores=16)

Generate a samplesheet with your samples. The samplesheet can be generated automatically from the idat basenames by giving the directory with idat files or it can be done manually. It should contain at least the following necessary columns: Sample_Name, Sex (possible values "M","F" or "NA") and Basename. Basename is the path to the idat file without the _Red.idat and _Grn.idat extension. It tries to parse the basenames to guess if the Sentrix plate and positions are present.

samplesheet <- meffil.create.samplesheet(path_to_idat_files, recursive=TRUE)

Please note:

  • If you have your idat files in multiple directories, then you can run meffil.create.samplesheet with the path set to the base directory and with recursive=T. This will search through all subdirectories to find idat files.
  • If this is not optimal for some reason (e.g. the base directory would include idat files from too many other projects), then you can create multiple samplesheets by calling meffil.create.sampleseheet for each directory and then use rbind to combine them all together into a single samplesheet.
  • If you retrieve idat files you do not want to analyse, then you can simply remove those corresponding rows from the sample sheet.
  • At this point it is worthwhile to manually modify the samplesheet data frame to replace the actual sample IDs in the Sample_Name column if necessary, and to add the sex values to the Sex column. It is fine to add additional columns to the samplesheet, but do not remove existing columns as they are required for downstream analyses.

Alternatively, if you have already a samplesheet with batch variables you can read it in using:

samplesheet <- meffil.read.samplesheet(base=".",pattern="mysamplesheet.csv")

Next perform the background correction, dye bias correction, sex prediction and cell count estimates. The meffil.qc function processes your idat files and returns a qc.object for each sample. You need to specify which cell type reference you need. You can find cell type references with the meffil.list.cell.type.references() function. Currently there are whole blood and cord blood references implemented. Cell count predictions are done using the Houseman et al.. meffil and minfi are all using the houseman algorithm for estimating cell counts. In meffil we do something slightly different by normalizing each sample individually to the cell type reference dataset rather than normalizing all samples in a dataset to the cell type reference in order to avoid having cell count estimates depend on the other samples being included in the normalization.

meffil.list.cell.type.references()
qc.objects <- meffil.qc(samplesheet, cell.type.reference="blood gse35069 complete", verbose=TRUE)
save(qc.objects,file="qc.objects.Robj")

You can check if your samples are there by checking the number of samples and extracting the sample names.

length(qc.objects)

Sample names can be checked like this

names(qc.objects)

You can check data for the first sample

names(qc.objects[[1]])
qc.objects[[1]]$sample.name

Obtaining genotype data to check for ID mismatches (if you have genotype array data available)

If you have genotype data available on the same individuals with methylation profiles you can check for ID mismatches. The methylation arrays have 65 SNPs which can be extracted from the methylation data. These 65 SNPs can be compared to genotypes measured with genotype arrays.

In R:

featureset <- qc.objects[[1]]$featureset
writeLines(meffil.snp.names(featureset), con="snp-names.txt")

'snp-names.txt' has the SNP ids from the SNPs on the methylation arrays. You now need to extract these SNPs from the genotype arrays. Using the UNIX command shell (you will need plink1.90, available here) you can can extract these SNPs from the genotype array profiled genotype data. You will need to have your genotype data in bed/bim/fam format (https://www.cog-genomics.org/plink/2.0/input#bed)

plink --bfile dataset --extract snp-names.txt --recode A --out genotypes

In R:

genotypes <- meffil.extract.genotypes("genotypes.raw")
genotypes <- genotypes[,match(samplesheet$Sample_Name, colnames(genotypes))]

Generating a QC report

We can now set some parameters for the QC of the raw data.

        qc.parameters <- meffil.qc.parameters(
	beadnum.samples.threshold             = 0.1,
	detectionp.samples.threshold          = 0.1,
	detectionp.cpgs.threshold             = 0.1, 
	beadnum.cpgs.threshold                = 0.1,
	sex.outlier.sd                        = 5,
	snp.concordance.threshold             = 0.95,
	sample.genotype.concordance.threshold = 0.8
)
  • beadnum.samples.threshold = maximum threshold on the fraction of probes with too few detected beads (minimum number of detected beads is defined by setting the beads.threshold parameter of meffil.qc() or meffil.normalize.dataset()). Samples with probe fractions above this are excluded from the final dataset.
  • detectionp.samples.threshold = maximum threshold on the fraction of undetected probes (probe detection is defined by setting the maximum probe detection p-value threshold parameter detection.threshold of meffil.qc() or meffil.normalize.dataset()). Samples with probe fractions above this are excluded from the final dataset.
  • beadnum.cpgs.threshold = same as beadnum.samples.threshold but used to identify poor quality probes in terms of the fraction of samples in which the probe has too few detected beads.
  • detectionp.cpgs.threshold = same as detectionp.cpgs.threshold but used to identify poor quality probes in terms of the fraction of samples in which the probe is undetected.
  • sex.outlier.sd = number of standard deviations to determine whether sample is sex outlier
  • snp.concordance.threshold = concordance threshold to include snps to calculate sample concordance
  • sample.genotype.concordance.threshold = concordance threshold to determine whether sample is outlier

We can now summarise the QC analysis of the raw data.

qc.summary <- meffil.qc.summary(
	qc.objects,
	parameters = qc.parameters,
	genotypes=genotypes
)

save(qc.summary, file="qcsummary.Robj")

We can also make a qc.report.

meffil.qc.report(qc.summary, output.file="qc-report.html")

This creates the file qc-report.html in the current work directory. The file should open up in your web browser.

Removing bad samples

After checking the qc report you can select bad samples to remove. I would remove all outliers in qc.summary$bad.samples except the outliers based on control probes as there are often a few probes for each control type only. For the control probe categories, I would remove slides with dye bias Control probe (dye.bias) which are captured by the normalisation probes (normC + normG)/(normA + normT) or bisulfite conversion control probes (Control probe (bisulfite1) and Control probe (bisulfite2)) .

outlier <- qc.summary$bad.samples
table(outlier$issue)
index <- outlier$issue %in% c("Control probe (dye.bias)", 
                              "Methylated vs Unmethylated",
                              "X-Y ratio outlier",
                              "Low bead numbers",
                              "Detection p-value",
                              "Sex mismatch",
                              "Genotype mismatch",
                              "Control probe (bisulfite1)",
                              "Control probe (bisulfite2)")

outlier <- outlier[index,]

You can remove bad samples prior to performing quantile normalization. Bad probes can be removed later on.

length(qc.objects)
qc.objects <- meffil.remove.samples(qc.objects, outlier$sample.name)
length(qc.objects)
save(qc.objects,file="qc.objects.clean.Robj")

Rerun QC summary on clean dataset

qc.summary <- meffil.qc.summary(qc.objects,parameters=qc.parameters,genotypes=genotypes)
save(qc.summary, file="qcsummary.clean.Robj")

Rerun QC report on clean dataset

meffil.qc.report(qc.summary, output.file="qc-report.clean.html")