April 2016

Biometrical models - topics

  • Genotypes
    • Allele frequencies and genotype frequencies
    • Hardy-Weinberg equilibrium
    • Linkage disequilibrium
  • Genetic effects on phenotypes
    • The additive model
    • Dominance
    • Epistasis
    • Gene-Environment Interaction
    • Gene-Environment Correlation

Genotypes

Allele frequencies and genotype frequencies

  • Humans are diploid
  • A biallelic locus can lead to 3 genotypes

    Haplotypes: A, C

    Genotypes: AA, AC, CC

\(Pr(A) = p\)

\(Pr(C) = q = 1-p\)

\(Pr(AA) = p^2\)

\(Pr(AC) = 2*p*q\)

\(Pr(CC) = q^2\)

Allele frequencies and genotype frequencies

x = seq(0,1,len=1e3)
plot(x, 2*x*(1-x), ylim=c(0,1), col='red', xlab='allele frequency',
     ylab='genotype frequency')
points(x, x^2, col='blue')
points(x, (1-x)^2, col='green')

Assumptions for Hardy-Weinberg-Equilibrium

  • organisms are diploid
  • only sexual reproduction occurs
  • generations are non overlapping
  • mating is random
  • population size is infinitely large
  • allele frequencies are equal in the sexes
  • there is no migration, mutation or selection

Simulating genotypes for one SNP

set.seed(634)
n = 1e3         # 1000 people
p = 0.2         # MAF 20%
haplo = sample(0:1, n, replace=T, prob=c(1-p, p))
head(haplo, 20)
##  [1] 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0
table(haplo)
## haplo
##   0   1 
## 809 191

Simulating genotypes for one SNP

from_mum = sample(0:1, n, replace=T, prob=c(1-p, p))
from_dad = sample(0:1, n, replace=T, prob=c(1-p, p))
geno = from_mum + from_dad
head(geno, 20)
##  [1] 0 1 0 1 1 1 1 0 0 1 1 1 2 1 1 0 0 1 0 0
table(geno)
## geno
##   0   1   2 
## 629 327  44

Simulating genotypes for one SNP

# shorter way to calculate the same thing:
geno = rbinom(n, 2, p)
head(geno, 20)
##  [1] 0 0 1 0 1 0 2 2 0 1 1 1 2 1 0 0 0 0 1 1
table(geno)
## geno
##   0   1   2 
## 644 308  48

Estimating allele frequency and genotype variance for one SNP

(p_est = sum(geno)/(2*length(geno)))
## [1] 0.202
2*p*(1-p)
## [1] 0.32
2*p_est*(1-p_est)
## [1] 0.322392
var(geno)
## [1] 0.3371211

Hardy-Weinberg Equilibrium test

\[ \chi^2 = \sum_i \frac{(O_i-E_i)^2}{E_i} \]

Distribution of a sum of the squares of k independent standard normal random variables.

k = degrees of freedom

In our case k = 1 (# genotypes - # alleles)

hist(rnorm(1e4)^2, 200, freq=F, xlim=c(0,5), main='')
plot(function(x) dchisq(x, df=1), add=T, xlim=c(0,5), col='red')

Hardy-Weinberg Equilibrium test

(observed = table(geno))
## geno
##   0   1   2 
## 644 308  48
expected_0 = n * (1-p_est)^2
expected_1 = n * 2 * p_est * (1-p_est)
expected_2 = n * p_est^2

(expected = c(expected_0, expected_1, expected_2))
## [1] 636.804 322.392  40.804
chisq = sum( (observed - expected)^2/expected )
pval = 1-pchisq(chisq, 1)
c(chisq, pval)
## [1] 1.9928462 0.1580436

Hardy-Weinberg Equilibrium test

seq1 = seq(1e-6,chisq,len=1e4)
seq2 = seq(chisq,100,len=1e4)
plot(function(x) dchisq(x, 1), xlim=c(0,4), ylim=c(0,4))
polygon(list(x=c(0, seq1, chisq),
             y=c(0, dchisq(seq1, 1), 0)), col='blue', border=NA)
polygon(list(x=c(chisq, seq2),
             y=c(0, dchisq(seq2, 1))), col='red', border=NA)

Simulating many SNPs

set.seed(123)
m = 1e4
mafs = runif(m, 0, 0.5)
hist(mafs, 30, col='blue')

Simulating many SNPs

genotype = sapply(1:m, function(i) rbinom(n, 2, mafs[i]))
genotype[1:10, 1:5]
##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    0    0    0    1    0
##  [2,]    0    1    0    0    0
##  [3,]    1    0    1    1    1
##  [4,]    0    2    2    1    0
##  [5,]    0    0    0    2    2
##  [6,]    0    1    0    1    2
##  [7,]    1    1    0    2    1
##  [8,]    1    1    1    0    1
##  [9,]    1    0    1    0    2
## [10,]    0    0    0    0    2
dim(genotype)
## [1]  1000 10000

Simulating many SNPs

There may be monomorphic SNPs!

head(sort(colSums(genotype)), 20)
##  [1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
mafs = runif(2*m, 0, 0.5)
genotype = sapply(1:(2*m), function(i) rbinom(n, 2, mafs[i]))
select = which(colSums(genotype) > 0)[1:m]
genotype = genotype[,select]
mafs = mafs[select]

Convert R genotypes to PLINK format

Convert R genotypes to PLINK format

Run HWE test in PLINK

Run HWE test in R for many SNPs

p_est = apply(genotype, 2, function(x) sum(x/(2*n)))

plot(mafs, p_est); abline(0,1)

Run HWE test in R for many SNPs

e_0 = (1-p_est)^2
e_1 = 2 * p_est * (1-p_est)
e_2 = p_est^2
e = n * cbind(e_0, e_1, e_2)
head(e, 3)
##           e_0      e_1      e_2
## [1,] 273.5290 498.9420 227.5290
## [2,] 348.6903 483.6195 167.6903
## [3,] 628.8490 328.3020  42.8490
o = t(apply(genotype, 2, function(x) c(sum(x==0), sum(x==1), sum(x==2))))
head(o, 3)
##      [,1] [,2] [,3]
## [1,]  264  518  218
## [2,]  351  479  170
## [3,]  631  324   45

Run HWE test in R for many SNPs

chisq = sapply(1:m, function(i) sum( (o[i,] - e[i,])^2/e[i,] ))
pval = 1-pchisq(chisq, 1)

head(chisq)
## [1] 1.45899740 0.09123938 0.17170928 0.57929649 0.67095476 2.02135133
head(pval)
## [1] 0.2270897 0.7626074 0.6785972 0.4465882 0.4127189 0.1551011

Compare R calculations to PLINK calculations

Analyse p-values

hist(hwe$P)

Analyse p-values

qqPlot(hwe$P)

Analyse p-values

plot(hwe$O.HET., hwe$P)

Analyse p-values

BB = as.numeric(gsub('/.+', '', hwe$GENO))
plot(hwe$O.HET., hwe$P, col=BB+1)

Analyse p-values

plot3d(hwe$O.HET., BB,  hwe$P)

You must enable Javascript to view this page properly.

Linkage disequilibrium

\(p_A\) = MAF at locus 1

\(p_B\) = MAF at locus 2

\(p_{AB}\) = probablility of A and B occuring together

\(D_{AB} = p_{AB} - p_A p_B\)

\(D' = D/D_{max}\)

\(r = \frac{D}{\sqrt{p_A (1-p_A) p_B (1-p_B)}}\)

Linkage disequilibrium

ldmat = cor(genotype[,1:100])

cols = c('black', 'green', 'blue', 'red', 'white')
image(ldmat, col=colorRampPalette(colors=cols)(30))

Linkage disequilibrium

hist(ldmat, 100)

Linkage disequilibrium

jitter_geno = function(g, dummy, frac=0.1, permuteprob=0.3) {
  change = sample(-1:1, length(g), replace=T, prob=c(frac/2, (1-frac), frac/2))
  g = g+change
  if(runif(1) < permuteprob) g = sample(g)
  g[g==3] = 0
  g[g==-1] = 2
  g
}


g = rbinom(n, 2, .1)
g = do.call(cbind, Reduce(jitter_geno, replicate(100, g, simplify=F), accumulate=T))

ldmat = cor(g)

Linkage disequilibrium

image(ldmat, col=colorRampPalette(colors=cols)(30))

Linkage disequilibrium

hist(ldmat, 100)

Genetic effects on phenotypes

The additive model

\[ P = G + E \]

  • No dominance effects

  • No epistasis

  • No Gene-Environment Interaction

Dominance effects

  • No dominance: effect of heterozygotes exactly halfway between homozygotes

  • Dominance: everything else

\[ \sigma^2_G = \sigma^2_A + \sigma^2_D \]

Epistasis

\[ P = g_1 + g_2 + \dots + g_1 \times g_2 + e_1 + e_2 + \dots \]

snps = 1e6
choose(snps, 2) # number of SNP pairs
## [1] 499999500000
choose(snps, 3) # number of SNP triplets
## [1] 1.666662e+17

Gene-Environment Interaction

\[ P = G + E + G \times E \]

  • Very much dependent on scale of phenotype!

  • Statistical interaction completely different from biological interaction.

Gene-Environment Correlation

  • Gene-Environment Interaction means the combination of G and E is more informative for the phenotype than just the sum of both

  • Gene-Environment Correlation is not concerned with the phenotype

  • Three types

    • passive
    • evocative
    • active

Simulating additive genetic effects

h2 = 0.8
beta = rnorm(m, 0, sqrt(h2/m))
g = scale(genotype)

bv = g %*% beta

e = rnorm(n, 0, sqrt(1-h2))

pheno = bv + e

var(pheno)
##           [,1]
## [1,] 0.9907548

Estimating betas

\[ \hat{\beta} = \frac{X^T y}{n} \]

beta_est = (t(genotype) %*% pheno) / n

cor(beta, beta_est[,1])
## [1] 0.261639

\[ R^2 = \frac{h^2}{1+\frac{m}{n * h^2} (1-R^2) } \]

sqrt(h2/(1+m/(n*h2)))
## [1] 0.2434322

Estimating betas

l = .07
plot(beta, beta_est, xlim=c(-l,l), ylim=c(-l,l)); abline(0,1)

suppressMessages(library(MASS))
z = kde2d(beta, beta_est, n=50)
contour(z, col=cols, add=T, drawlabels = F)