library(ggplot2)

Objective

In this practical we will tackle power calculatuions for GWAS and conduct a small GWAS in R.
Part 1 will investigate the power to detect genome-wide significant loci for quantitative traits,
Part 2 will conduct a small GWAS
Part 3 will plot a qq-plot for the association study in Part 2.

The data for the prac can be found in the following directory:
/data/module1/4_statisticalTests/

dir="/data/module1/3_statisticalTests/"

Part 1: Power to detect loci

Knowledge of the factors affecting statistical power to detect loci is essential for understanding and interpreting GWAS results. We will use and expand upon the simple power calculations for quantitative traits found here. Please read.

As outlined in the wiki, the factors affecting power are:
1. \(N\), the sample size
2. \(h^2\), the 'heritability' of the variant. This is the proportion of phenotypic variance explained by the locus. We will expand on this parameter to include allele frequency and the effect size, as well as LD between causal variants and genotyped SNP below.
3. \(\alpha\), the significance level

First, let us try a simple example. What is the power to detect a loci explaining 1% of the phenotypic variance (i.e. heritability of the marker is 0.01) with a sample size of 500 at genome-wide signficance (\(\alpha = 5x10^{-8}\))?

    N = 500                         # sample size
    alpha = 0.05/(1*10^6)           # significance level
    h2 = 0.01                       # heritability of the locus
    threshold = qchisq(alpha, df = 1, lower.tail = FALSE)
    power = pchisq(threshold, df = 1, lower.tail = FALSE, ncp = N * h2)
    power

## [1] 0.0006516719

Ouch, pretty poor! Less than 1% power to detect a loci with this sample size at \(P<5x10^{-8}\). How big does the sample size need to be to detect such as locus with 80% power?

    N = c(500,seq(1000,5000,500)) #guess some sample sizes
    power = pchisq(threshold, df = 1, lower.tail = FALSE, ncp = N * h2)
    power

##  [1] 0.0006516719 0.0110387240 0.0572452426 0.1637468872 0.3258829115
##  [6] 0.5103374868 0.6789516787 0.8087351921 0.8956038978 0.9473577969

    { plot(power ~ N, type="b", main = "sample size to detect loci explaining 1% of phenotypic variance")
      abline(h=0.8, col="orange", lty = 2, lwd = 2)}

So that's about 4,000 (unrelated) samples with genotypes and phenotypes to detect a genotyped locus explaining 1% of the phenotypic variance. Now, what happens if you haven't genotyped the 'causal' locus and you need to consider LD between the locus and the genotyped SNP?

Given a sample size of 4,000; how is the power to detect this locus reduced when the LD between the causal variant and the genotyped SNP decreases?

   N = 4000
   r2 = seq(0.05,1,0.05)
   h2 = 0.01*r2
   power = pchisq(threshold, df = 1, lower.tail = FALSE, ncp = N * h2)
   
   { plot(power ~ r2, type="b", main = "reduction in power due to incomplete LD with causal variant")
      abline(h=0.8, col="orange", lty = 2, lwd = 2)}

In the lectures we said that the sample size needed to increase by \(1/r^2\) to maintain power. Show this and plot the sample size required to maintain 80% power as a function of \(r^2\).

   N = 4000/r2
   power = pchisq(threshold, df = 1, lower.tail = FALSE, ncp = N * h2)
   power # check that it is 80%

##  [1] 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352
##  [8] 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352
## [15] 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352

   plot(N ~ r2, log='y', type="b", main = "sample size required to maintain 80% power")

The variance explained by a locus is dependent on allele frequency, and it can be calculated as \(2p(1-p)a^2\), where \(p\) is the allele frequency and \(a\) is the effect size in trait units.

Use the power calculator to show the reduction in power by changing the allele frequency of an allele when the allele explains 0.01 of \(\sigma^2_P\) at a frequency of \(p = 0.5\)

   N = 4000
   a = sqrt(0.01/(2*0.5^2))   # backsolve for the effect size
   p = seq(0.01,0.99,0.01)    # range of allele frequencies
   h2 = 2*p*(1-p)*a^2         # variance explained by the locus
   power = pchisq(threshold, df = 1, lower.tail = FALSE, ncp = N * h2)
   
   plot(power ~ p, type="b", main = "influence of allele frequency to detect loci", las = 1)

_________________________________________

Part 2: a toy association study in R

We will now conduct a small example GWAS in R. The genotypes are from the European ancestry 1000 genomes data, and there are some simulated phenotypes. There are three relevant files in the directory above are:
1. phenotypes: sim.phen
2. genotypes: 1kg_chr2v3_EUR.raw
3. map file: map.txt

First read in these files and investigate what they look like using (for example) the head() and dim() functions in R. How many individuals do you have?

  # read in the genotypes, this is plink .raw format 
  # remove 1st 6 columns but save the IDs so we can match them with phenotype file
  geno = read.table(paste0(dir,"1kg_chr2v3_EUR.raw"), header=T)
  info = geno[,1:6]     # save IDs
  geno <- geno[,-1:-6]  # remove columns from geno object
  dim(geno)             # there are 503 individuals, and 9,801 SNPs

## [1]  503 9801

  # map file
  map  <- read.table(paste0(dir,"map.txt"),header=T)
  dim(map)      # map file matches

## [1] 9801    5

  # phenotype file
  pheno = read.table(paste0(dir,"sim.phen"))
  dim(pheno)

## [1] 503   3

  sum(pheno[,1]==info[,1])  # check that the phenotypes & genotypes IDs are in the same order

## [1] 503

A linear model on a single SNP

Let us start with a running a linear model on a single SNP. Try SNP number 491.

   snp = 491
   lm0 <- lm(pheno[,3] ~ geno[,snp])
   summary(lm0)

## 
## Call:
## lm(formula = pheno[, 3] ~ geno[, snp])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2858  -2.2242  -0.2176   2.1688  11.5730 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.5682     0.2227  -2.552  0.01101 * 
## geno[, snp]   0.7544     0.2282   3.306  0.00101 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.47 on 501 degrees of freedom
## Multiple R-squared:  0.02135,    Adjusted R-squared:  0.0194 
## F-statistic: 10.93 on 1 and 501 DF,  p-value: 0.001014

What do the results mean? Is the SNP significantly associated with the trait?

The SNP is significantly associated with the trait, P = 0.001.

The effect of the allele on the trait is 0.7544 units and, from the model, the trait variance explained by the marker (i.e. the model \(R^2\) or 'heritability' of the marker) is 0.021 \(\sigma^2_P\). This is a large effect.

We can also calculate the variance explained by the marker using \(2p(1-p)a^2\), where \(p\) is the allele frequency and \(a\) is the allele effect size. Thus,

   p = sum(geno[,snp])/(dim(geno)[1]*2)
   p

## [1] 0.3508946

   varMarker = 2*p*(1-p)*lm0$coefficients[2]^2
   h2 = varMarker / var(pheno[,3])
   h2

## geno[, snp] 
##  0.02111313

We will plot the genotypic means for the locus, to illustrate what our regression is estimating.

   geno2 = geno[,snp]+rnorm(dim(geno)[1],mean=0,sd=0.05)  #add a little jitter to look pretty!
   { plot(pheno[,3] ~ geno2, pch=20, xlab = "genotype", ylab = "phenotype")
   abline(lm0, col="light green", lty=2, lwd=2) }

A small association study

OK, so we look like we are ready to write a simple loop in R to test each marker one-at-at-time for an association with the trait. We will use a linear model and save the corresponding p-value at each marker for the manhattan plot.

  pVals <- rep(NA,ncol(geno))   # create a vector to save p-values into

  for(i in 1:length(pVals)){
      lm0 <- lm(pheno[,3] ~ geno[,i])
        pVals[i] <- summary(lm0)$coefficients[2,4]
    }

  # basic manhattan plot
  map$mbp = map$pos*10^(-6)
  { plot(-log10(pVals)~map$mbp,xlab="Base pairs",pch=20) 
  abline (h = -log10(5*10^(-8)), col="orange", lwd=2 , lty = 2)  }

The test statistic criteria (\(P<5x10^{-8}\)) is too conservative for this particular 'association study' as we didn't really do 1M independent tests.

Because it is a simulated phenotype, we also know the loci which were simulated with causal effects. Let us do another plot highlighting these SNPs, what do you make of the results?

  causal = read.table(paste0(dir,"causativeLoci.txt"))
  causalSNP = which(map$rsID%in%causal[,1])
 { plot(-log10(pVals)~map$mbp,xlab="Base pairs",pch=20) 
   abline (h = -log10(5*10^(-8)), col="orange", lwd=2 , lty = 2)
   points(-log10(pVals)[causalSNP]~map$mbp[causalSNP], col="red")
   abline(v=map$mbp[causalSNP],col="red") }

_________________________________________

Part 3: QQ plots

QQ-plots are often used as diagnostic tools in linear models to check the assumption of normality for the residuals. They are often examined in GWAS to check for departures from the expected distribution of the test statistic. QQ-plots are somewhat subjective but are a quick visual of the distribution.

QQ-plots involved plotting observed standardized residuals or test statistic (y − axis) on the theoretical quartiles for the data points. The expectation (if the residuals or test statistic are distributed according to the distribution tested) is that the data points will lie on a straight y = x line.

We are going to make a qqplot for our test statistics from our small GWAS by hand. There are several R packages to help with making nice qq-plots, including those with 95% confidence intervals. I obtained some of my code for making qq-plots using the base packages in R here.

  # Calculate expectations
  exp.pvalues<-(rank(pVals, ties.method="first")+.5)/(length(pVals)+1)

  #Make plot
  { plot(-log10(exp.pvalues), -log10(pVals),
       pch = 20, xlab="Theoretical Quartiles", ylab = "Sample Quartiles")
   abline(0,1, lwd=1.3, col="red")   }

    # or can use a package such as 'qqman'
    # install.packages("qqman")
    # library(qqman)
    # qq(pVals)

UQWS Genetic Mapping, session 4 - statistical tests