In this practical we will tackle power calculations for GWAS and
conduct a small GWAS in R.
Part 1 will investigate the power to detect genome-wide
significant loci for quantitative traits,
Part 2a will conduct a small GWAS
Part 2b will plot a qq-plot for the association study in Part
2.
You may download the data files for Part 2 from the 1000G directly to your laptops if you wish.
The data for the prac can be found in the following directory:
/data/module1/downloads/3_statisticalTestsPrac/
Use the following commands in R on the cluster to define the path to
the data. If you have downloaded the data you will need to update the
path for dir
.
dir="/data/module1/downloads/3_statisticalTestsPrac/"
Again, if you are working on the cluster you will need to save the
plots below using the same commands in the first prac, i.e. if the plot
is made using plot()
then:
jpeg(file="myPlot.jpg")
# open the jpeg.
plot(...)
# call the plot.
dev.off()
# close the device.
This part of the practical doesn’t use any data, you may run it directly on your laptop or run it on the server and download any plots.
Knowledge of the factors affecting statistical power to detect loci is essential for understanding and interpreting GWAS results. We will use and expand upon the simple power calculations for quantitative traits found here. Please read.
As outlined in the wiki, the factors affecting power are:
1. \(N\), the sample size
2. \(h^2\), the ‘heritability’ of the
variant. This is the proportion of phenotypic variance explained by the
locus. We will expand on this parameter to include allele frequency and
the effect size, as well as LD between causal variants and genotyped SNP
below.
3. \(\alpha\), the significance
level
First, let us try a simple example. What is the power to detect a loci explaining 1% of the phenotypic variance (i.e. heritability of the marker is 0.01) with a sample size of 500 at genome-wide significance (\(\alpha = 5x10^{-8}\))?
N = 500 # sample size
alpha = 0.05/(1*10^6) # significance level
h2 = 0.01 # heritability of the locus
threshold = qchisq(alpha, df = 1, lower.tail = FALSE)
power = pchisq(threshold, df = 1, lower.tail = FALSE, ncp = N * h2)
power
## [1] 0.0006516719
Ouch, pretty poor! Less than 1% power to detect a loci with this sample size at \(P<5x10^{-8}\). How big does the sample size need to be to detect such as locus with 80% power?
N = c(500,seq(1000,5000,500)) #guess some sample sizes
power = pchisq(threshold, df = 1, lower.tail = FALSE, ncp = N * h2)
power
## [1] 0.0006516719 0.0110387240 0.0572452426 0.1637468872 0.3258829115
## [6] 0.5103374868 0.6789516787 0.8087351921 0.8956038978 0.9473577969
{ plot(power ~ N, type="b", main = "sample size to detect loci explaining 1% of phenotypic variance")
abline(h=0.8, col="orange", lty = 2, lwd = 2) }
So that’s about 4,000 (unrelated) samples with genotypes and phenotypes to detect a genotyped locus explaining 1% of the phenotypic variance. Now, what happens if you haven’t genotyped the ‘causal’ locus and you need to consider LD between the locus and the genotyped SNP?
Given a sample size of 4,000; how is the power to detect this locus reduced when the LD between the causal variant and the genotyped SNP decreases?
N = 4000
r2 = seq(0.05,1,0.05)
h2 = 0.01*r2
power = pchisq(threshold, df = 1, lower.tail = FALSE, ncp = N * h2)
{ plot(power ~ r2, type="b", main = "reduction in power due to incomplete LD with causal variant")
abline(h=0.8, col="orange", lty = 2, lwd = 2)}
In the lectures we said that the sample size needed to increase by \(1/r^2\) to maintain power. Show this and plot the sample size required to maintain 80% power as a function of \(r^2\).
N = 4000/r2
power = pchisq(threshold, df = 1, lower.tail = FALSE, ncp = N * h2)
power # check that it is 80%
## [1] 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352
## [8] 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352
## [15] 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352 0.8087352
plot(N ~ r2, log='y', type="b", main = "sample size required to maintain 80% power")
The variance explained by a locus is dependent on allele frequency, and it can be calculated as \(2p(1-p)a^2\), where \(p\) is the allele frequency and \(a\) is the effect size in trait units.
Use the power calculator to show the reduction in power by changing the allele frequency of an allele when the allele explains 0.01 of \(\sigma^2_P\) at a frequency of \(p = 0.5\)
N = 4000
a = sqrt(0.01/(2*0.5^2)) # backsolve for the effect size
p = seq(0.01,0.99,0.01) # range of allele frequencies
h2 = 2*p*(1-p)*a^2 # variance explained by the locus
power = pchisq(threshold, df = 1, lower.tail = FALSE, ncp = N * h2)
plot(power ~ p, type="b", main = "influence of allele frequency to detect loci", las = 1)
_________________________________________
We will now conduct a small example GWAS in R. The genotypes are from the European ancestry 1000 genomes data, and there are some simulated phenotypes.
There are three relevant files in the directory above are:
1. phenotypes: sim.phen
2. genotypes: 1kg_chr2v3_EUR.raw
3. map file: map.txt
First read in these files and investigate what they look like using
(for example) the head()
and dim()
functions
in R. How many individuals do you have?
# read in the genotypes, this is plink .raw format
# remove 1st 6 columns but save the IDs so we can match them with phenotype file
geno = read.table(paste0(dir,"1kg_chr2v3_EUR.raw"), header=T)
info = geno[,1:6] # save IDs
geno <- geno[,-1:-6] # remove columns from geno object
dim(geno) # there are 503 individuals, and 9,801 SNPs
## [1] 503 9801
# map file
map <- read.table(paste0(dir,"map.txt"),header=T)
dim(map) # map file matches
## [1] 9801 5
# phenotype file
pheno = read.table(paste0(dir,"sim.phen"))
dim(pheno)
## [1] 503 3
sum(pheno[,1]==info[,1]) # check that the phenotypes & genotypes IDs are in the same order
## [1] 503
Let us start with a running a linear model on a single SNP. Try SNP number 491.
snp = 491
lm0 <- lm(pheno[,3] ~ geno[,snp])
summary(lm0)
##
## Call:
## lm(formula = pheno[, 3] ~ geno[, snp])
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.2858 -2.2242 -0.2176 2.1688 11.5730
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5682 0.2227 -2.552 0.01101 *
## geno[, snp] 0.7544 0.2282 3.306 0.00101 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.47 on 501 degrees of freedom
## Multiple R-squared: 0.02135, Adjusted R-squared: 0.0194
## F-statistic: 10.93 on 1 and 501 DF, p-value: 0.001014
What do the results mean? Is the SNP significantly associated with the trait?
The SNP is significantly associated with the trait, P = 0.001.
The effect of the allele on the trait is 0.7544 units and, from the model, the trait variance explained by the marker (i.e. the model \(R^2\) or ‘heritability’ of the marker) is 0.021 \(\sigma^2_P\). This is a large effect.
We can also calculate the variance explained by the marker using \(2p(1-p)a^2\), where \(p\) is the allele frequency and \(a\) is the allele effect size. Thus,
p = sum(geno[,snp])/(dim(geno)[1]*2)
p
## [1] 0.3508946
varMarker = 2*p*(1-p)*lm0$coefficients[2]^2
h2 = varMarker / var(pheno[,3])
h2
## geno[, snp]
## 0.02111313
We will plot the genotypic means for the locus, to illustrate what our regression is estimating.
geno2 = geno[,snp]+rnorm(dim(geno)[1],mean=0,sd=0.05) #add a little jitter to look pretty!
{ plot(pheno[,3] ~ geno2, pch=20, xlab = "genotype", ylab = "phenotype")
abline(lm0, col="light green", lty=2, lwd=2) }
OK, so we look like we are ready to write a simple loop in R to test each marker one-at-at-time for an association with the trait. We will use a linear model and save the corresponding p-value at each marker for the Manhattan plot.
pVals <- rep(NA,ncol(geno)) # create a vector to save p-values into
for(i in 1:length(pVals)){
lm0 <- lm(pheno[,3] ~ geno[,i])
pVals[i] <- summary(lm0)$coefficients[2,4]
}
# basic Manhattan plot
map$mbp = map$pos*10^(-6)
{ plot(-log10(pVals)~map$mbp,xlab="Base pairs",pch=20)
abline (h = -log10(5*10^(-8)), col="orange", lwd=2 , lty = 2) }
The test statistic criteria (\(P<5x10^{-8}\)) is too conservative for this particular ‘association study’ as we didn’t really do 1M independent tests.
Because it is a simulated phenotype, we also know the loci which were simulated with causal effects. Let us do another plot highlighting these SNPs, what do you make of the results?
causal = read.table(paste0(dir,"causativeLoci.txt"))
causalSNP = which(map$rsID%in%causal[,1])
{ plot(-log10(pVals)~map$mbp,xlab="Base pairs",pch=20)
abline (h = -log10(5*10^(-8)), col="orange", lwd=2 , lty = 2)
points(-log10(pVals)[causalSNP]~map$mbp[causalSNP], col="red")
abline(v=map$mbp[causalSNP],col="red") }
_________________________________________
QQ-plots are often used as diagnostic tools in linear models to check the assumption of normality for the residuals. They are often examined in GWAS to check for departures from the expected distribution of the test statistic. QQ-plots are somewhat subjective but are a quick visual of the distribution.
QQ-plots involved plotting observed standardized residuals or test statistic (y − axis) on the theoretical quartiles for the data points. The expectation (if the residuals or test statistic are distributed according to the distribution tested) is that the data points will lie on a straight y = x line.
We are going to make a qqplot for our test statistics from our small GWAS by hand. There are several R packages to help with making nice qq-plots, including those with 95% confidence intervals. I obtained some of my code for making qq-plots using the base packages in R here.
# Calculate expectations
exp.pvalues<-(rank(pVals, ties.method="first")+.5)/(length(pVals)+1)
#Make plot
{ plot(-log10(exp.pvalues), -log10(pVals),
pch = 20, xlab="Theoretical Quartiles", ylab = "Sample Quartiles")
abline(0,1, lwd=1.3, col="red") }
# or can use a package such as 'qqman'
# install.packages("qqman")
library(qqman)
##
## For example usage please run: vignette('qqman')
##
## Citation appreciated but not required:
## Turner, (2018). qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. Journal of Open Source Software, 3(25), 731, https://doi.org/10.21105/joss.00731.
##
qq(pVals)