Pitfalls in PennCNV HMM

Many studies using PennCNV to predict CNVs observed a large number of false positives. This can happen by many factors, here we describe some of the most relevant ones that are responsible for the majority of false positive observed in iPSYCH project. Variation in hybridization intensity can mislead methods suggesting a CNV, genomic regions with high GC content are more likely to have lower hybridization as energy necessary for DNA denaturation is higher. Therefore, is common to observed those regions with lower Log R ratio than the rest of the chromosome. A lower value for LRR, when compared to the rest of the chromosome, will suggest a deletion. However, for a deletion to be true its BAF should be homozygous (BAF values 0 or 1). Some noise in BAF can create false heterozygous values (> 0 and < 1), but in high quality samples the amount of noise is below 5%. Therefore, any method to predict CNVs should be able to avoid false positive when LRR deviates from expecting using the heterozygosity level to control for false positives. Unfortunately, PennCNV model does not use well BAF information, allowding up to 25% heterozygosity in a deletion. This is often observed at telomeric and centromeric regions. To compensate for the poor model, it is suggested to remove those regions from the analysis, see at PennCNV user guide.

Example 1: When PennCNV model fail for deletion using real data.

Example of true and false deletion predicted by PennCNV.

Example 2: When PennCNV model fail for duplication using real data.

Example of true and false duplication predicted by PennCNV.

Determinating the maximum heterozygosity accepted by PennCNV model.

library(iPsychCNV)

CNVMap = MakeLongMockSample(Heterozygosity=0.1, CNVDistance=1000, Type=c(1,2,3), Size=500, Mean=c(-0.45, 0, 0.3))

Sample = ReadSample('LongMockSample.tab')

CNVs = RunPennCNV(PathRawData='.', Pattern='LongMockSample.tab', Skip=0, HMM='/services/tools/PennCNV-1.0.3/lib/hh550.hmm',
Path2PennCNV='/services/tools/PennCNV-1.0.3/', Cores=1)

PlotLRRAndCNVs(CNV=CNVs, Sample=Sample, Name='PennCNV_Heterozygosity_10.png', Roi=CNVMap)

sum(Sample$B.Allele.Freq[2000:2500] > 0.4 & Sample$B.Allele.Freq[2000:2500] < 0.6)/500 # 0.14

Example of false deletion predicted by PennCNV using simulated data.

# repeat the same but now with higher heterozygosity will solve remove the false positive prediction at the second CNV.

CNVMap = MakeLongMockSample(Heterozygosity=0.3, CNVDistance=1000, Type=c(1,2,3), size=500, Mean=c(-0.45, 0, 0.3))

Sample = ReadSample('LongMockSample.tab')

CNVs = RunPennCNV(PathRawData='.', Pattern='LongMockSample.tab', Skip=0, HMM='/service/tools/PennCNV-1.0.3/lib/hh550.hmm',
 Path2PennCNV='/service/toold/PennCNV-1.0.3/', Cores=1)

PlotLRRAndCNVs(CNV=CNVs, Sample=Sample, Name='PennCNV_Heterozygosity_30.png', Roi=CNvMap)

sum(Sample$B.Allele.Freq[2000:2500] > 0.4 & Sample$B.Allele.Freq[2000:2500] < 0.6)/500 # 0.33

Now let's make the first CNV (CN = 1) become 2 by increasing heterozygosity in BAF.
# Define an index where BAF will be heterozygous.
Indx = seq(from=1000, to=1500, by=5) # This should give 20% heterozygosity.

Sample$B.Allele.Freq[Indx] = rnorm(n=length(Indx), mean=0.5, sd=0.03)
Write.table(Sample, file='LongMockSample.tab', sep='\t', row.names=F, quote=F)

CNVs = RunPennCNV(PathRawData='.', Pattern='LongMockSample.tab', Skip=0, HMM='/service/tools/PennCNV-1.0.3/lib/hh550.hmm',
Path2PennCNV='/service/toold/PennCNV-1.0.3/', Cores=1)

PlotLRRAndCNVs(CNv=CNVs, Sample=Sample, Name='PennCNV_Heterozygosity_20.png', Roi=CNVMap)

# Let's increase it even more ! But first we need to remove all heterozygosity from BAF
Sample$B.allele.Freq[1000:1500] = sample(x=c(0,1), size=501), replace=TRUE)

# Define an index where BAF will be heterozygous.
Indx = seq(from=1000, to=1500, by=4) # This should give 25% heterozygosity.

Sample$B.Allele.Freq[Indx] = rnorm(n=length(Indx), mean=0.5, sd=0.03)
Write.table(Sample, file='LongMockSample.tab', sep='\t', row.names=F, quote=F)

CNVs = RunPennCNV(PathRawData='.', Pattern='LongMockSample.tab', Skip=0, HMM='/service/tools/PennCNV-1.0.3/lib/hh550.hmm',
 Path2PennCNV='/service/toold/PennCNV-1.0.3/', Cores=1)

PlotLRRAndCNVs(CNv=CNVs, Sample=Sample, Name='PennCNV_Heterozygosity_25.png', Roi=CNVMap)

Challenges

Amplified DNA from dried blood spots offers number of challenges for copy number of variation detection. Here we describe some of the challenges one can find working with DBS data.

Tools

iPsychCNV package offers a series of tools that can be used in the CNV prediction pipeline, but also independent with other programs.

Classification

Evaluation of CNV prediction performance is an important step for methods comparison. Here we describe how binary classification is used to evaluate the method performance.

Methods

iPsychCNV uses many different methods to perform a series of functions. Here we describe in detail the methods used by iPsychCNV.

Github

iPsychCNV is an open source R package project. People are welcome to give suggestions, code new functions and/or improve existing ones. The source code is available at Github .
About iPsychCNV

iPsychCNV is a method to find copy number variation from amplified DNA from dried blood spots on Illumina SNP array. It is designed to handle large variation on Log R ratio, and uses B allele frequency to improve CNV calls. iPsychCNV is an open source project on Github
About iPSYCH

The project will study five specific mental disorders; autism, ADHD, schizophrenia, bipolar disorder and depression. All disorders are associated with major human and societal costs all over the world. The iPSYCH project will study these disorders from many different angles, ranging from genes and cells to population studies, from fetus to adult, from cause to symptoms of the disorder, and this knowledge will be combined in new ways across scientific fields, visit iPSYCH.

Pitfalls in PennCNV HMM

Example 1: When PennCNV model fail for deletion using real data.

Example 2: When PennCNV model fail for duplication using real data.

Determinating the maximum heterozygosity accepted by PennCNV model.

Github

About iPsychCNV

About iPSYCH