- Genotypes
- Allele frequencies and genotype frequencies
- Hardy-Weinberg equilibrium
- Linkage disequilibrium
- Genetic effects on phenotypes
- The additive model
- Dominance
- Epistasis
- Gene-Environment Interaction
- Gene-Environment Correlation
April 2016
A biallelic locus can lead to 3 genotypes
Haplotypes: A, C
Genotypes: AA, AC, CC
\(Pr(A) = p\)
\(Pr(C) = q = 1-p\)
\(Pr(AA) = p^2\)
\(Pr(AC) = 2*p*q\)
\(Pr(CC) = q^2\)
x = seq(0,1,len=1e3) plot(x, 2*x*(1-x), ylim=c(0,1), col='red', xlab='allele frequency', ylab='genotype frequency') points(x, x^2, col='blue') points(x, (1-x)^2, col='green')
set.seed(634) n = 1e3 # 1000 people p = 0.2 # MAF 20% haplo = sample(0:1, n, replace=T, prob=c(1-p, p)) head(haplo, 20)
## [1] 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0
table(haplo)
## haplo ## 0 1 ## 809 191
from_mum = sample(0:1, n, replace=T, prob=c(1-p, p)) from_dad = sample(0:1, n, replace=T, prob=c(1-p, p)) geno = from_mum + from_dad head(geno, 20)
## [1] 0 1 0 1 1 1 1 0 0 1 1 1 2 1 1 0 0 1 0 0
table(geno)
## geno ## 0 1 2 ## 629 327 44
# shorter way to calculate the same thing: geno = rbinom(n, 2, p) head(geno, 20)
## [1] 0 0 1 0 1 0 2 2 0 1 1 1 2 1 0 0 0 0 1 1
table(geno)
## geno ## 0 1 2 ## 644 308 48
(p_est = sum(geno)/(2*length(geno)))
## [1] 0.202
2*p*(1-p)
## [1] 0.32
2*p_est*(1-p_est)
## [1] 0.322392
var(geno)
## [1] 0.3371211
\[ \chi^2 = \sum_i \frac{(O_i-E_i)^2}{E_i} \]
Distribution of a sum of the squares of k independent standard normal random variables.
k = degrees of freedom
In our case k = 1 (# genotypes - # alleles)
hist(rnorm(1e4)^2, 200, freq=F, xlim=c(0,5), main='') plot(function(x) dchisq(x, df=1), add=T, xlim=c(0,5), col='red')
(observed = table(geno))
## geno ## 0 1 2 ## 644 308 48
expected_0 = n * (1-p_est)^2 expected_1 = n * 2 * p_est * (1-p_est) expected_2 = n * p_est^2 (expected = c(expected_0, expected_1, expected_2))
## [1] 636.804 322.392 40.804
chisq = sum( (observed - expected)^2/expected ) pval = 1-pchisq(chisq, 1) c(chisq, pval)
## [1] 1.9928462 0.1580436
seq1 = seq(1e-6,chisq,len=1e4) seq2 = seq(chisq,100,len=1e4) plot(function(x) dchisq(x, 1), xlim=c(0,4), ylim=c(0,4)) polygon(list(x=c(0, seq1, chisq), y=c(0, dchisq(seq1, 1), 0)), col='blue', border=NA) polygon(list(x=c(chisq, seq2), y=c(0, dchisq(seq2, 1))), col='red', border=NA)
set.seed(123) m = 1e4 mafs = runif(m, 0, 0.5) hist(mafs, 30, col='blue')
genotype = sapply(1:m, function(i) rbinom(n, 2, mafs[i])) genotype[1:10, 1:5]
## [,1] [,2] [,3] [,4] [,5] ## [1,] 0 0 0 1 0 ## [2,] 0 1 0 0 0 ## [3,] 1 0 1 1 1 ## [4,] 0 2 2 1 0 ## [5,] 0 0 0 2 2 ## [6,] 0 1 0 1 2 ## [7,] 1 1 0 2 1 ## [8,] 1 1 1 0 1 ## [9,] 1 0 1 0 2 ## [10,] 0 0 0 0 2
dim(genotype)
## [1] 1000 10000
There may be monomorphic SNPs!
head(sort(colSums(genotype)), 20)
## [1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
mafs = runif(2*m, 0, 0.5) genotype = sapply(1:(2*m), function(i) rbinom(n, 2, mafs[i])) select = which(colSums(genotype) > 0)[1:m] genotype = genotype[,select] mafs = mafs[select]
map = data.frame(chr=1, snp=paste0('rs', 1:m), pos1=1:m, pos2=1:m) head(map)
## chr snp pos1 pos2 ## 1 1 rs1 1 1 ## 2 1 rs2 2 2 ## 3 1 rs3 3 3 ## 4 1 rs4 4 4 ## 5 1 rs5 5 5 ## 6 1 rs6 6 6
genoletters = c('A A', 'A B', 'B B') ped = data.frame(FID=1:n, IID=1:n, pat=0, mat=0, sex=1, pheno=-9, matrix(genoletters[genotype+1], n, m)) genotype[1:4, 1:9]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] ## [1,] 1 2 1 0 1 0 0 1 0 ## [2,] 2 2 0 0 0 0 0 1 1 ## [3,] 1 1 0 0 0 0 0 1 0 ## [4,] 1 0 1 0 1 0 0 1 0
ped[1:4, 1:15]
## FID IID pat mat sex pheno X1 X2 X3 X4 X5 X6 X7 X8 X9 ## 1 1 1 0 0 1 -9 A B B B A B A A A B A A A A A B A A ## 2 2 2 0 0 1 -9 B B B B A A A A A A A A A A A B A B ## 3 3 3 0 0 1 -9 A B A B A A A A A A A A A A A B A A ## 4 4 4 0 0 1 -9 A B A A A B A A A B A A A A A B A A
write.table(map, 'test.map', row.names=F, col.names=F, quote=F) write.table(ped, 'test.ped', row.names=F, col.names=F, quote=F)
#system('curl http://pngu.mgh.harvard.edu/~purcell/plink/dist/ # plink-1.07-mac-intel.zip -o plink-1.07-mac-intel.zip') #system('unzip -o plink-1.07-mac-intel.zip') system('plink-1.07-mac-intel/plink --file test --hardy2 --out test --noweb')
hwe = read.table('test.hwe', stringsAsFactors = F, header=T) hwe = hwe[hwe$TEST == 'ALL',] head(hwe)
## CHR SNP TEST A1 A2 GENO O.HET. E.HET. P ## 1 1 rs1 ALL B A 218/518/264 0.518 0.4989 0.2271 ## 4 1 rs2 ALL B A 170/479/351 0.479 0.4836 0.7626 ## 7 1 rs3 ALL B A 45/324/631 0.324 0.3283 0.6786 ## 10 1 rs4 ALL B A 28/259/713 0.259 0.2654 0.4466 ## 13 1 rs5 ALL B A 71/410/519 0.410 0.3996 0.4127 ## 16 1 rs6 ALL B A 12/158/830 0.158 0.1654 0.1551
p_est = apply(genotype, 2, function(x) sum(x/(2*n))) plot(mafs, p_est); abline(0,1)
e_0 = (1-p_est)^2 e_1 = 2 * p_est * (1-p_est) e_2 = p_est^2 e = n * cbind(e_0, e_1, e_2) head(e, 3)
## e_0 e_1 e_2 ## [1,] 273.5290 498.9420 227.5290 ## [2,] 348.6903 483.6195 167.6903 ## [3,] 628.8490 328.3020 42.8490
o = t(apply(genotype, 2, function(x) c(sum(x==0), sum(x==1), sum(x==2)))) head(o, 3)
## [,1] [,2] [,3] ## [1,] 264 518 218 ## [2,] 351 479 170 ## [3,] 631 324 45
chisq = sapply(1:m, function(i) sum( (o[i,] - e[i,])^2/e[i,] )) pval = 1-pchisq(chisq, 1) head(chisq)
## [1] 1.45899740 0.09123938 0.17170928 0.57929649 0.67095476 2.02135133
head(pval)
## [1] 0.2270897 0.7626074 0.6785972 0.4465882 0.4127189 0.1551011
plot(pval, hwe$P); abline(0,1)
hist(hwe$P)
qqPlot(hwe$P)
plot(hwe$O.HET., hwe$P)
BB = as.numeric(gsub('/.+', '', hwe$GENO)) plot(hwe$O.HET., hwe$P, col=BB+1)
plot3d(hwe$O.HET., BB, hwe$P)
You must enable Javascript to view this page properly.
\(p_A\) = MAF at locus 1
\(p_B\) = MAF at locus 2
\(p_{AB}\) = probablility of A and B occuring together
\(D_{AB} = p_{AB} - p_A p_B\)
\(D' = D/D_{max}\)
\(r = \frac{D}{\sqrt{p_A (1-p_A) p_B (1-p_B)}}\)
ldmat = cor(genotype[,1:100]) cols = c('black', 'green', 'blue', 'red', 'white') image(ldmat, col=colorRampPalette(colors=cols)(30))
hist(ldmat, 100)
jitter_geno = function(g, dummy, frac=0.1, permuteprob=0.3) { change = sample(-1:1, length(g), replace=T, prob=c(frac/2, (1-frac), frac/2)) g = g+change if(runif(1) < permuteprob) g = sample(g) g[g==3] = 0 g[g==-1] = 2 g } g = rbinom(n, 2, .1) g = do.call(cbind, Reduce(jitter_geno, replicate(100, g, simplify=F), accumulate=T)) ldmat = cor(g)
image(ldmat, col=colorRampPalette(colors=cols)(30))
hist(ldmat, 100)
\[ P = G + E \]
No dominance effects
No epistasis
No Gene-Environment Interaction
No dominance: effect of heterozygotes exactly halfway between homozygotes
Dominance: everything else
\[ \sigma^2_G = \sigma^2_A + \sigma^2_D \]
\[ P = g_1 + g_2 + \dots + g_1 \times g_2 + e_1 + e_2 + \dots \]
snps = 1e6 choose(snps, 2) # number of SNP pairs
## [1] 499999500000
choose(snps, 3) # number of SNP triplets
## [1] 1.666662e+17
\[ P = G + E + G \times E \]
Very much dependent on scale of phenotype!
Statistical interaction completely different from biological interaction.
Gene-Environment Interaction means the combination of G and E is more informative for the phenotype than just the sum of both
Gene-Environment Correlation is not concerned with the phenotype
Three types
h2 = 0.8 beta = rnorm(m, 0, sqrt(h2/m)) g = scale(genotype) bv = g %*% beta e = rnorm(n, 0, sqrt(1-h2)) pheno = bv + e var(pheno)
## [,1] ## [1,] 0.9907548
\[ \hat{\beta} = \frac{X^T y}{n} \]
beta_est = (t(genotype) %*% pheno) / n cor(beta, beta_est[,1])
## [1] 0.261639
\[ R^2 = \frac{h^2}{1+\frac{m}{n * h^2} (1-R^2) } \]
sqrt(h2/(1+m/(n*h2)))
## [1] 0.2434322
l = .07 plot(beta, beta_est, xlim=c(-l,l), ylim=c(-l,l)); abline(0,1) suppressMessages(library(MASS)) z = kde2d(beta, beta_est, n=50) contour(z, col=cols, add=T, drawlabels = F)