Biometrical models

April 2016

Biometrical models - topics

Genotypes
- Allele frequencies and genotype frequencies
- Hardy-Weinberg equilibrium
- Linkage disequilibrium
Genetic effects on phenotypes
- The additive model
- Dominance
- Epistasis
- Gene-Environment Interaction
- Gene-Environment Correlation

Genotypes

Allele frequencies and genotype frequencies

Humans are diploid
A biallelic locus can lead to 3 genotypes

Haplotypes: A, C

Genotypes: AA, AC, CC

\(Pr(A) = p\)

\(Pr(C) = q = 1-p\)

\(Pr(AA) = p^2\)

\(Pr(AC) = 2*p*q\)

\(Pr(CC) = q^2\)

Allele frequencies and genotype frequencies

x = seq(0,1,len=1e3)
plot(x, 2*x*(1-x), ylim=c(0,1), col='red', xlab='allele frequency',
     ylab='genotype frequency')
points(x, x^2, col='blue')
points(x, (1-x)^2, col='green')

Assumptions for Hardy-Weinberg-Equilibrium

organisms are diploid
only sexual reproduction occurs
generations are non overlapping
mating is random
population size is infinitely large
allele frequencies are equal in the sexes
there is no migration, mutation or selection

Simulating genotypes for one SNP

set.seed(634)
n = 1e3         # 1000 people
p = 0.2         # MAF 20%
haplo = sample(0:1, n, replace=T, prob=c(1-p, p))
head(haplo, 20)

##  [1] 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0

table(haplo)

## haplo
##   0   1 
## 809 191

Simulating genotypes for one SNP

from_mum = sample(0:1, n, replace=T, prob=c(1-p, p))
from_dad = sample(0:1, n, replace=T, prob=c(1-p, p))
geno = from_mum + from_dad
head(geno, 20)

##  [1] 0 1 0 1 1 1 1 0 0 1 1 1 2 1 1 0 0 1 0 0

table(geno)

## geno
##   0   1   2 
## 629 327  44

Simulating genotypes for one SNP

# shorter way to calculate the same thing:
geno = rbinom(n, 2, p)
head(geno, 20)

##  [1] 0 0 1 0 1 0 2 2 0 1 1 1 2 1 0 0 0 0 1 1

table(geno)

## geno
##   0   1   2 
## 644 308  48

Estimating allele frequency and genotype variance for one SNP

(p_est = sum(geno)/(2*length(geno)))

## [1] 0.202

2*p*(1-p)

## [1] 0.32

2*p_est*(1-p_est)

## [1] 0.322392

var(geno)

## [1] 0.3371211

Hardy-Weinberg Equilibrium test

\[ \chi^2 = \sum_i \frac{(O_i-E_i)^2}{E_i} \]

Distribution of a sum of the squares of k independent standard normal random variables.

k = degrees of freedom

In our case k = 1 (# genotypes - # alleles)

hist(rnorm(1e4)^2, 200, freq=F, xlim=c(0,5), main='')
plot(function(x) dchisq(x, df=1), add=T, xlim=c(0,5), col='red')

Hardy-Weinberg Equilibrium test

(observed = table(geno))

## geno
##   0   1   2 
## 644 308  48

expected_0 = n * (1-p_est)^2
expected_1 = n * 2 * p_est * (1-p_est)
expected_2 = n * p_est^2

(expected = c(expected_0, expected_1, expected_2))

## [1] 636.804 322.392  40.804

chisq = sum( (observed - expected)^2/expected )
pval = 1-pchisq(chisq, 1)
c(chisq, pval)

## [1] 1.9928462 0.1580436

Hardy-Weinberg Equilibrium test

seq1 = seq(1e-6,chisq,len=1e4)
seq2 = seq(chisq,100,len=1e4)
plot(function(x) dchisq(x, 1), xlim=c(0,4), ylim=c(0,4))
polygon(list(x=c(0, seq1, chisq),
             y=c(0, dchisq(seq1, 1), 0)), col='blue', border=NA)
polygon(list(x=c(chisq, seq2),
             y=c(0, dchisq(seq2, 1))), col='red', border=NA)

Simulating many SNPs

set.seed(123)
m = 1e4
mafs = runif(m, 0, 0.5)
hist(mafs, 30, col='blue')

Simulating many SNPs

genotype = sapply(1:m, function(i) rbinom(n, 2, mafs[i]))
genotype[1:10, 1:5]

##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    0    0    0    1    0
##  [2,]    0    1    0    0    0
##  [3,]    1    0    1    1    1
##  [4,]    0    2    2    1    0
##  [5,]    0    0    0    2    2
##  [6,]    0    1    0    1    2
##  [7,]    1    1    0    2    1
##  [8,]    1    1    1    0    1
##  [9,]    1    0    1    0    2
## [10,]    0    0    0    0    2

dim(genotype)

## [1]  1000 10000

Simulating many SNPs

There may be monomorphic SNPs!

head(sort(colSums(genotype)), 20)

##  [1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1

mafs = runif(2*m, 0, 0.5)
genotype = sapply(1:(2*m), function(i) rbinom(n, 2, mafs[i]))
select = which(colSums(genotype) > 0)[1:m]
genotype = genotype[,select]
mafs = mafs[select]

Convert R genotypes to PLINK format

map = data.frame(chr=1, snp=paste0('rs', 1:m), pos1=1:m, pos2=1:m)

head(map)

##   chr snp pos1 pos2
## 1   1 rs1    1    1
## 2   1 rs2    2    2
## 3   1 rs3    3    3
## 4   1 rs4    4    4
## 5   1 rs5    5    5
## 6   1 rs6    6    6

Convert R genotypes to PLINK format

genoletters = c('A A', 'A B', 'B B')
ped = data.frame(FID=1:n, IID=1:n, pat=0, mat=0, sex=1, pheno=-9,
                 matrix(genoletters[genotype+1], n, m))

genotype[1:4, 1:9]

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,]    1    2    1    0    1    0    0    1    0
## [2,]    2    2    0    0    0    0    0    1    1
## [3,]    1    1    0    0    0    0    0    1    0
## [4,]    1    0    1    0    1    0    0    1    0

ped[1:4, 1:15]

##   FID IID pat mat sex pheno  X1  X2  X3  X4  X5  X6  X7  X8  X9
## 1   1   1   0   0   1    -9 A B B B A B A A A B A A A A A B A A
## 2   2   2   0   0   1    -9 B B B B A A A A A A A A A A A B A B
## 3   3   3   0   0   1    -9 A B A B A A A A A A A A A A A B A A
## 4   4   4   0   0   1    -9 A B A A A B A A A B A A A A A B A A

write.table(map, 'test.map', row.names=F, col.names=F, quote=F)
write.table(ped, 'test.ped', row.names=F, col.names=F, quote=F)

Run HWE test in PLINK

#system('curl http://pngu.mgh.harvard.edu/~purcell/plink/dist/
#       plink-1.07-mac-intel.zip -o plink-1.07-mac-intel.zip')
#system('unzip -o plink-1.07-mac-intel.zip')

system('plink-1.07-mac-intel/plink --file test --hardy2 --out test --noweb')

hwe = read.table('test.hwe', stringsAsFactors = F, header=T)
hwe = hwe[hwe$TEST == 'ALL',]
head(hwe)

##    CHR SNP TEST A1 A2        GENO O.HET. E.HET.      P
## 1    1 rs1  ALL  B  A 218/518/264  0.518 0.4989 0.2271
## 4    1 rs2  ALL  B  A 170/479/351  0.479 0.4836 0.7626
## 7    1 rs3  ALL  B  A  45/324/631  0.324 0.3283 0.6786
## 10   1 rs4  ALL  B  A  28/259/713  0.259 0.2654 0.4466
## 13   1 rs5  ALL  B  A  71/410/519  0.410 0.3996 0.4127
## 16   1 rs6  ALL  B  A  12/158/830  0.158 0.1654 0.1551

Run HWE test in R for many SNPs

p_est = apply(genotype, 2, function(x) sum(x/(2*n)))

plot(mafs, p_est); abline(0,1)

Run HWE test in R for many SNPs

e_0 = (1-p_est)^2
e_1 = 2 * p_est * (1-p_est)
e_2 = p_est^2
e = n * cbind(e_0, e_1, e_2)
head(e, 3)

##           e_0      e_1      e_2
## [1,] 273.5290 498.9420 227.5290
## [2,] 348.6903 483.6195 167.6903
## [3,] 628.8490 328.3020  42.8490

o = t(apply(genotype, 2, function(x) c(sum(x==0), sum(x==1), sum(x==2))))
head(o, 3)

##      [,1] [,2] [,3]
## [1,]  264  518  218
## [2,]  351  479  170
## [3,]  631  324   45

Run HWE test in R for many SNPs

chisq = sapply(1:m, function(i) sum( (o[i,] - e[i,])^2/e[i,] ))
pval = 1-pchisq(chisq, 1)

head(chisq)

## [1] 1.45899740 0.09123938 0.17170928 0.57929649 0.67095476 2.02135133

head(pval)

## [1] 0.2270897 0.7626074 0.6785972 0.4465882 0.4127189 0.1551011

Compare R calculations to PLINK calculations

plot(pval, hwe$P); abline(0,1)

Analyse p-values

hist(hwe$P)

Analyse p-values

qqPlot(hwe$P)

Analyse p-values

plot(hwe$O.HET., hwe$P)

Analyse p-values

BB = as.numeric(gsub('/.+', '', hwe$GENO))
plot(hwe$O.HET., hwe$P, col=BB+1)

Analyse p-values

plot3d(hwe$O.HET., BB,  hwe$P)

You must enable Javascript to view this page properly.

Linkage disequilibrium

\(p_A\) = MAF at locus 1

\(p_B\) = MAF at locus 2

\(p_{AB}\) = probablility of A and B occuring together

\(D_{AB} = p_{AB} - p_A p_B\)

\(D' = D/D_{max}\)

\(r = \frac{D}{\sqrt{p_A (1-p_A) p_B (1-p_B)}}\)

Linkage disequilibrium

ldmat = cor(genotype[,1:100])

cols = c('black', 'green', 'blue', 'red', 'white')
image(ldmat, col=colorRampPalette(colors=cols)(30))

Linkage disequilibrium

hist(ldmat, 100)

Linkage disequilibrium

jitter_geno = function(g, dummy, frac=0.1, permuteprob=0.3) {
  change = sample(-1:1, length(g), replace=T, prob=c(frac/2, (1-frac), frac/2))
  g = g+change
  if(runif(1) < permuteprob) g = sample(g)
  g[g==3] = 0
  g[g==-1] = 2
  g
}


g = rbinom(n, 2, .1)
g = do.call(cbind, Reduce(jitter_geno, replicate(100, g, simplify=F), accumulate=T))

ldmat = cor(g)

Linkage disequilibrium

image(ldmat, col=colorRampPalette(colors=cols)(30))

Linkage disequilibrium

hist(ldmat, 100)

Genetic effects on phenotypes

The additive model

\[ P = G + E \]

No dominance effects
No epistasis
No Gene-Environment Interaction

Dominance effects

No dominance: effect of heterozygotes exactly halfway between homozygotes
Dominance: everything else

\[ \sigma^2_G = \sigma^2_A + \sigma^2_D \]

Epistasis

\[ P = g_1 + g_2 + \dots + g_1 \times g_2 + e_1 + e_2 + \dots \]

snps = 1e6
choose(snps, 2) # number of SNP pairs

## [1] 499999500000

choose(snps, 3) # number of SNP triplets

## [1] 1.666662e+17

Gene-Environment Interaction

\[ P = G + E + G \times E \]

Very much dependent on scale of phenotype!
Statistical interaction completely different from biological interaction.

Gene-Environment Correlation

Gene-Environment Interaction means the combination of G and E is more informative for the phenotype than just the sum of both
Gene-Environment Correlation is not concerned with the phenotype
Three types
- passive
- evocative
- active

Simulating additive genetic effects

h2 = 0.8
beta = rnorm(m, 0, sqrt(h2/m))
g = scale(genotype)

bv = g %*% beta

e = rnorm(n, 0, sqrt(1-h2))

pheno = bv + e

var(pheno)

##           [,1]
## [1,] 0.9907548

Estimating betas

\[ \hat{\beta} = \frac{X^T y}{n} \]

beta_est = (t(genotype) %*% pheno) / n

cor(beta, beta_est[,1])

## [1] 0.261639

\[ R^2 = \frac{h^2}{1+\frac{m}{n * h^2} (1-R^2) } \]

sqrt(h2/(1+m/(n*h2)))

## [1] 0.2434322

Estimating betas

l = .07
plot(beta, beta_est, xlim=c(-l,l), ylim=c(-l,l)); abline(0,1)

suppressMessages(library(MASS))
z = kde2d(beta, beta_est, n=50)
contour(z, col=cols, add=T, drawlabels = F)