Skip to content

Extracting structural variants

Matthew Suderman edited this page Mar 26, 2019 · 7 revisions

To generate copy number variants from Ilumina 450k methylation arrays you will need the original .idat files for your samples because copy number variation is estimated from array intensities.

Load meffil and set how many cores to use for parallelization

library(meffil)
options(mc.cores=16)

Generate a samplesheet with your samples. The samplesheet can be generated automatically from the idat basenames by giving the directory with idat files or it can be done manually. It should contain at least the following necessary columns: Sample Name, Sex (possible values M, F or NA) and Basename. It tries to parse the basenames to guess if the Sentrix plate and positions are present.

samplesheet <- meffil.create.samplesheet("/path/to/idat/files")

At this point please ensure that the Sample_Name column contains the actual sample IDs that are being used for the other data types. Please also add the sex values to the Sex column. Don't change these column names though.

Copy number is estimated by comparison to a reference dataset. One is available from Bioconductor package CopyNumber450kData. To use it, ensure that the package is installed:

BiocManager::install("IlluminaHumanMethylation450kmanifest")
BiocManager::install("IlluminaHumanMethylation450kanno.ilmn12.hg19")
BiocManager::install("CopyNumber450kData")

Note: use 'biocLite' instead of 'BiocManager::install' for older installations of Bioconductor.

CopyNumber450kData is not available in the most recent versions of Bioconductor. If the install fails, then you can install it from source. First download the source file:

https://bioc.ism.ac.jp/packages/3.3/data/experiment/src/contrib/CopyNumber450kData_1.8.0.tar.gz

Then install the package, either from the command line:

R CMD INSTALL CopyNumber450kData_1.8.0.tar.gz

or in R:

install.packages("CopyNumber450kData_1.8.0.tar.gz", repos = NULL, type="source")

(this assumes that the file was downloaded to your current working directory).

Once installed, make the data available to meffil:

library(CopyNumber450kData)
controls <- meffil.add.copynumber450k.references()

Now estimate the CNVs:

cnv_values <- meffil.calculate.cnv(samplesheet, cnv.reference="copynumber450k", verbose=T)

A matrix of genetic copy number variation at each probe can now be generated:

cnv <- meffil.cnv.matrix(cnv_values)

Please save this object to the godmc/input_data folder:

save(cnv, file="/path/to/godmc/input_data/cnv.RData")

and make sure that the object name that you are saving is cnv, as this is the name that the pipeline will be expecting. For ARIES comprising 5469 samples, it took 30 hrs to extract cnvs using 6 cores. It takes about 30seconds for each sample.