In this practical you will learn how to perform polygenic prediction using Bayesian methods with individual genotypes and phenotypes. We will use the same data sets in the BLUP session, i.e., a small data set with 10 SNPs and simulated phenotypes for 325 individuals , and a larger data set of 277,719 SNPs and 3,000 GWAS individuals from the UK Biobank (UKB). As discussed in the lecture, BLUP can be regarded as a Bayesian model assuming a normal prior distribution for the SNP effects. With different prior assumptions, especially with a mixture prior, Bayesian methods are flexible in accounting for the genetic architecture of the trait. We will explore the principle of various Bayesian methods using the toy example and R script, and then will employ GCTB to perform the UKB data analysis.
Note that the small data set is free to download to your own laptop if you want. However, the UKB data set is not allowed to download because it contains real individual genotypes from the UK Biobank (although individual IDs has been masked).
This is a toy example data set consists of a training population (aka GWAS population, discovery population, or reference population) of 325 individuals. Each of these individuals had been genotyped for 10 SNPs. The trait was simulated in a way that the first SNP has an effect size of 2 and the 5th SNP has an effect size of 1 on the phenotype. The trait heritability is 0.5. The validation population (aka target population) is a hold-out sample of 31 individuals.
Path to Data set 1:
ls /data/module5/toy/
This is a larger data set based on real genotypes at 273,604 SNPs from the UK Biobank. We simulated a trait with 100 causal variants with heritability of 0.5. The entire data set was partitioned into three independent populations:
Path to Data set 2:
ls /data/module5/ukb/
This section is designed to help you understand the R code used in the method. You may skip it for now and return to it later if you prefer.
The following code is highlighted for explanatory purpose as part of the method. No need to run the code in this section.
Different Bayesian methods differ only in the prior specification for
SNP effects. Here, we will focus on two Bayesian methods, BayesC and
BayesR, where BayesC is a special case of BayesR with two-component
mixture. The algorithm for BayesR (i.e., Markov chain Monte Carlo
sampling scheme) has been implemented in a R script,
bayesr.R
. Let’s load it in R and have a
look.
source("/data/module5/scr/bayesr.R")
The script includes a function called bayesr
. BayesR is
a general model where each SNP effects is assumed to follow a mixture of
normal distributions. The input parameters are
# Input parameters for bayesr. Do not run this.
bayesr = function(X, y, niter, gamma, startPi, startH2){
...
}
where niter
is the number of iterations for MCMC
sampling, X
is the genotype matrix, y
is the
vector of phenotypes, gamma
is a vector of scaling factor
for the variance of a mixture distribution, startPi
is a
vector of starting values for the \(\pi\) parameter, startH2
is
the starting value of SNP-based heritability. Note that the number of
elements in gamma
and startPi
define the
number of mixture components, and therefore need to be matched.
Let’s look into more details what the code is doing. Do not run the code below in your R.
The first step is to declar and initialise variables. In addition,
adjusting y for population mean and all SNP effects, so that
ycorr
is like the residuals.
# Step 1. Declaration and initialisation of variables
n = nrow(X) # number of observations
m = ncol(X) # number of SNPs
ndist = length(startPi) # number of mixture distributions
pi = startPi # starting value for pi
h2 = startH2 # starting value for heritability
vary = var(y) # phenotypic variance
varg = vary*h2 # starting value for genetic variance
vare = vary*(1-h2) # starting value for residual variance
sigmaSq = varg/(m*sum(gamma*pi)) # common factor of SNP effect variance
nub = 4 # prior degrees of freedom for SNP effect variance
nue = 4 # prior degrees of freedom for residual variance
scaleb = (nub-2)/nub*sigmaSq # prior scale parameter for SNP effect variance
scalee = (nue-2)/nue*vare # prior scale parameter for residual variance
beta = array(0,m) # vector of SNP effects
beta_mcmc = matrix(0,niter,m) # MCMC samples of SNP effects
mu = mean(y) # overall mean
xpx = apply(X, 2, crossprod) ## diagonal elements of X'X
probDelta = vector("numeric", ndist)
logDelta = array(0,2)
keptIter = NULL # keep MCMC samples
# adjust/correct y for population mean and all SNP effects (which all have zero as initial value)
ycorr = y - mu
Then the Gibbs sampling begins. The principle of Gibbs sampling is that we sample each parameter from its full conditional distribution (posterior distribution conditional on the values of other parameters). By doing this iteratively, we obtain samples of all parameters from a stationary distribution that converges to the joint posterior distribution. Within each iteration, the first step is to sample the population mean (fixed effect) from its full conditional distribution, given a constant prior. As you can see from the code, the full conditional distribution is centered at the BLUE solution, with the variance equal to the inverse of the coefficient matrix in the “mini” mixed model equation for the mean.
# MCMC begins
for (iter in 1:niter){
# Step 2. Sample the mean from a normal posterior
ycorr = ycorr + mu # unadjust y with the old sample
rhs = sum(ycorr) # right hand side of the mixed model equation
invLhs = 1/n # inverse of left hand side of the mixed model equation
muHat = invLhs*rhs # BLUE solution
mu = rnorm(1, muHat, sqrt(invLhs*vare)) # posterior is a normal distribution
ycorr = ycorr - mu # adjust y with the new sample
The next step is to sample the distribution membership and effect
size for each SNPs, conditional on the effects of all the other SNPs. We
first sample the indicator variable delta
for the mixture
distribution membership, and then sample the SNP effect
beta
from the normal distribution or a point mass at zero.
The likelihood is a normal distribution, so the full conditional
distribution given the membership is also a normal distribution. The
full conditional distribution for the membership is a multinomial
distribution.
# Step 3. Sample each SNP effect from a multi-normal mixture distribution
logPi = log(pi)
logPiComp = log(1-pi)
invSigmaSq = 1/(gamma*c(sigmaSq))
logSigmaSq = log(gamma*c(sigmaSq))
nsnpDist = rep(0, ndist) # number of SNPs asigned to each mixture component
ssq = 0 # weighted sum of squares of SNP effects
nnz = 0 # number of nonzero effects
ghat = array(0,n) # individual genetic value
# loop over SNPs
for (j in 1:m){
oldSample = beta[j]
rhs = crossprod(X[,j], ycorr) + xpx[j]*oldSample # right hand side of the mixed model equation
rhs = rhs/vare
invLhs = 1.0/(xpx[j]/c(vare) + invSigmaSq) # inverse of left hand side of the mixed model equation
betaHat = invLhs*c(rhs) # BLUP solution
# sample the mixture distribution membership
logDelta = 0.5*(log(invLhs) - logSigmaSq + betaHat*c(rhs)) + logPi # log likelihood + prior for nonzero effects
logDelta[1] = logPi[1] # log likelihood + prior for zero effect
for (k in 1:ndist) {
probDelta[k] = 1.0/sum(exp(logDelta - logDelta[k])) # posterior probability for each distribution membership
}
delta = sample(1:ndist, 1, prob = probDelta) # indicator variable for the distribution membership
nsnpDist[delta] = nsnpDist[delta] + 1
if (delta > 1) {
beta[j] = rnorm(1, betaHat[delta], sqrt(invLhs[delta])) # given the distribution membership, the posterior is a normal distribution
ycorr = ycorr + X[,j]*(oldSample - beta[j]) # update ycorr with the new sample
ghat = ghat + X[,j]*beta[j]
ssq = ssq + beta[j]^2 / gamma[delta]
nnz = nnz + 1
} else {
if (oldSample) ycorr = ycorr + X[,j]*oldSample # update ycorr with the new sample which is zero
beta[j] = 0
}
}
beta_mcmc[iter,] = beta
Next, we sample the other parameters in the model, including mixing probabilities (\(\pi\)), SNP effect variance (\(\sigma^2_{\beta}\)), and residual variance (\(\sigma^2_e\)). The full conditional distribution for \(\pi\) is a beta distribution. For \(\sigma^2_{\beta}\) and \(\sigma^2_e\), they are scaled inverse chi-square distributions.
# Step 4. Sample the distribution membership probabilities from a Dirichlet distribution
pi = rdirichlet(1, nsnpDist + 1)
# Step 5. Sample the SNP effect variance from a scaled inverse chi-square distribution
sigmaSq = (ssq + nub*scaleb)/rchisq(1, nnz+nub)
# Step 6. Sample the residual variance from a scaled inverse chi-square distribution
vare = (crossprod(ycorr) + nue*scalee)/rchisq(1, n+nue)
Given the genetic values and residual variance, we can compute the SNP-based heritability by definition. This is an appealling property of the MCMC approach that given the sampled values of SNP effects, you can easily compute any statistics of interest.
# Step 7. Compute the SNP-based heritability
varg = var(ghat)
h2 = varg/(varg + vare)
The final step in each iteration is to store the MCMC samples for the parameter values
keptIter <- rbind(keptIter,c(pi, nnz, sigmaSq, h2, vare, varg))
After MCMC sampling, we can calculate the point estimate and its variance from the MCMC samples.
colnames(keptIter) <- c(paste0("Pi", 1:length(pi)),"Nnz","SigmaSq","h2","Vare","Varg")
postMean = apply(keptIter, 2, mean) # posterior mean of MCMC samples is used as the point estimate for each parameter
cat("\nPosterior mean:\n")
print(postMean)
return(list(par=keptIter, beta=beta_mcmc))
Read in the data by R.
# data for training population
data_path="/data/module5/toy/"
X <- as.matrix(read.table(paste0(data_path,"xmat_trn.txt")))
y <- as.matrix(read.table(paste0(data_path,"yvec_trn.txt")))
# data for validation population
Xval <- as.matrix(read.table(paste0(data_path,"xmat_val.txt")))
yval <- as.matrix(read.table(paste0(data_path,"yvec_val.txt")))
For the first exercise, we will analyse this small data set using BayesC.
\[\beta_j \begin{cases} \sim N(0, \sigma^2_\beta) & \text{ with probability } \pi \\ \\ = 0 & \text{ with probability } 1 - \pi \end{cases} \] where parameter \(\pi\) is estimated from the data. It can be seen that when \(\pi = 0\), the prior collapses to a normal distribution, i.e., \(\beta_j \sim N(0, \sigma^2_\beta)\), which is the assumption in BLUP.
We will analyse the data with a script written in R,
BayesR.R
. Let’s load it in R.
source("/data/module5/scr/bayesr.R")
The script includes a function called bayesr
. BayesR is
a general model where each SNP effects is assumed to follow a mixture
distribution of normal distributions. BayesC is a special case of
BayesR. The input parameters are
# Input parameters for bayesr. Do not run this.
bayesr = function(X, y, niter, gamma, startPi, startH2)
where niter
is the number of iterations for MCMC
sampling, X
is the genotype matrix, y
is the
vector of phenotypes, gamma
is a vector of scaling factor
for the variance of a mixture distribution, startPi
is a
vector of starting values for the \(\pi\) parameter, startH2
is
the starting value of SNP-based heritability. Note that the number of
elements in gamma
and startPi
define the
number of mixture components, and therefore need to be matched.
We can use this program to run BayesC by specifying the mixture model as
gamma = c(0, 1)
startPi = c(0.8, 0.2)
This means the SNP effects are, a priori, assumed to be normally distributed with a probability of 0.2 and have a point mass at zero with a probability of 0.8. Now we can run the analysis using command
res_bayesc = bayesr(X = X, y = y, gamma = gamma, startPi = startPi)
##
## iter 100, nnz = 7, sigmaSq = 1.850, h2 = 0.484, vare = 3.974, varg = 3.735
##
## iter 200, nnz = 4, sigmaSq = 1.039, h2 = 0.561, vare = 3.261, varg = 4.168
##
## iter 300, nnz = 2, sigmaSq = 2.065, h2 = 0.470, vare = 4.010, varg = 3.560
##
## iter 400, nnz = 2, sigmaSq = 1.212, h2 = 0.498, vare = 3.722, varg = 3.696
##
## iter 500, nnz = 3, sigmaSq = 1.599, h2 = 0.498, vare = 3.568, varg = 3.534
##
## iter 600, nnz = 7, sigmaSq = 0.655, h2 = 0.475, vare = 4.254, varg = 3.849
##
## iter 700, nnz = 6, sigmaSq = 1.393, h2 = 0.496, vare = 4.016, varg = 3.958
##
## iter 800, nnz = 5, sigmaSq = 1.830, h2 = 0.489, vare = 4.622, varg = 4.423
##
## iter 900, nnz = 6, sigmaSq = 0.675, h2 = 0.529, vare = 3.375, varg = 3.793
##
## iter 1000, nnz = 4, sigmaSq = 2.855, h2 = 0.465, vare = 4.327, varg = 3.763
##
## Posterior mean:
## Pi1 Pi2 Nnz SigmaSq h2 Vare Varg
## 0.5441817 0.4558183 4.4730000 1.7018875 0.4968014 3.8873645 3.8516138
Here it outputs the sampled values for some of the key model
parameters in every 100 iterations. For example, Nnz
is the
number of nonzero effects fitted in the model. As you can see, in
contrast to BLUP which assumes all SNPs contribute equally to the trait,
BayesC knocks out some of the SNPs by allowing their effect size to be
exactly zero. For this reason, Nnz
is less than the number
of SNPs. Of course, there may be different SNPs fitted as nonzero across
iterations, due to sampling variation.
After the MCMC is finished, you can find the sampled values for the model parameters and SNP effects for each iteration in the result list. For example, you can look at the posterior mean and standard deviation for each parameter by
colMeans(res_bayesc$par)
## Pi1 Pi2 Nnz SigmaSq h2 Vare Varg
## 0.5441817 0.4558183 4.4730000 1.7018875 0.4968014 3.8873645 3.8516138
apply(res_bayesc$par, 2, sd)
## Pi1 Pi2 Nnz SigmaSq h2 Vare Varg
## 0.21126203 0.21126203 1.98070548 1.27101871 0.03563397 0.30474265 0.44320081
The posterior mean is often used as point estimate of the parameter, and the posterior standard error can be approximated by the standard deviation of MCMC samples.
You can use the plotting facilities in R to investigate changes in the parameters over iterations. For example, to look at the effect of the first marker across iterations, you would enter the command
# Trace plot for SNP 1, which is the causal variant with effect size of 2
#png("beta1_vs_iter.png")
plot(1:nrow(res_bayesc$beta), res_bayesc$beta[,1], xlab="Iteration", ylab="beta[,1]")
abline(h = 2, col="red")
#dev.off()
Question: Use this command to investigate each of the parameters in turn. Do they appear to be fluctuating around the correct values?
# Trace plot for SNP 2, which is a null SNP
#png("beta2_vs_iter.png")
plot(1:nrow(res_bayesc$beta), res_bayesc$beta[,2], xlab="Iteration", ylab="beta[,1]")
abline(h = 0, col="red")
#dev.off()
# Trace plot for SNP 5, which is a causal variant with a smaller effect of 1 and in LD with other SNPs
#png("beta5_vs_iter.png")
plot(1:nrow(res_bayesc$beta), res_bayesc$beta[,5], xlab="Iteration", ylab="beta[,1]")
abline(h = 1, col="red")
#dev.off()
We can also plot the posterior distribution, for example for the effect of the first SNP. We would discard the first 100 iterations of the program as “burn in”:
# Posterior distribution of SNP 1 effect
#png("dist_beta1.png")
plot(density(res_bayesc$beta[100:1000,1]))
#dev.off()
Question: Does the distribution appear to be normal? What about the distributions of the other parameters?
# Posterior distribution of SNP 2 effect
#png("dist_beta2.png")
plot(density(res_bayesc$beta[100:1000,2]))
#dev.off()
Question: The posterior distribution of the SNP effect can be summarised to quantify the evidence of the SNP associated with the trait. For example, write a R code to cacluate the probability of SNP 1 having a nonzero effect, which is known as PIP (posterior inclusion probability).
mean(res_bayesc$beta[100:1000,1] != 0)
## [1] 1
From the posterior samples, we can also calculate the Bayesian 90% credible interval, which represents 95% probability that the true parameter would lie within the interval, given the evidence from the data. What can we conclude from the Bayesian credible interval?
quantile(res_bayesc$beta[100:1000,1], probs=c(0.05, 00.95))
## 5% 95%
## 1.716366 2.469117
With MCMC samples, it’s straightforward to compute the posterior inclusion probability (PIP). PIP is a Bayesian measurement for SNP-trait association, that is, the posterior probability that the SNP is associated with trait. It has been widely used in the fine-mapping analysis. It can be calculated as the frequency of the SNP being fitted as a non-zero effect in the model across MCMC samples.
Question: Can you tell which SNPs are likely to be the causal variants based on their PIP?
delta = (res_bayesc$beta != 0) # indicator variable for each SNP in each MCMC cycle
pip = colMeans(delta) # frequency of the indicator variable being one across MCMC cycles
plot(pip, type="h", xlab="SNP", ylab="Posterior inclusion probability")
To get the SNP effect estimates, we would calculate the posterior mean of SNP effects:
betaMean = colMeans(res_bayesc$beta)
Question: What’s the prediction accuracy using BayesC? How does it compare to that from C+PT or BLUP? What could cause the difference?
ghat_bayesc=Xval %*% betaMean
summary(lm(yval ~ ghat_bayesc))$r.squared
## [1] 0.5319675
BayesR prior assumes that some SNPs have zero effect, some have small effects, and some have large effects, by assuming a mixture of multiple normal distributions, including a point mass at zero.
\[ \beta_j \sim \sum_k \pi_k N(0, \gamma_k \sigma^2_\beta) \]
For example, we can set a 4-component mixture model:
gamma = c(0, 0.01, 0.1, 1)
startPi = c(0.5, 0.3, 0.15, 0.05)
bayesr.res = bayesr(X = X, y = y, gamma = gamma, startPi = startPi)
##
## iter 100, nnz = 8, sigmaSq = 1.234, h2 = 0.490, vare = 3.394, varg = 3.266
##
## iter 200, nnz = 10, sigmaSq = 1.631, h2 = 0.547, vare = 3.795, varg = 4.574
##
## iter 300, nnz = 8, sigmaSq = 1.551, h2 = 0.510, vare = 3.834, varg = 3.989
##
## iter 400, nnz = 8, sigmaSq = 2.449, h2 = 0.465, vare = 4.246, varg = 3.689
##
## iter 500, nnz = 2, sigmaSq = 2.763, h2 = 0.552, vare = 3.957, varg = 4.875
##
## iter 600, nnz = 3, sigmaSq = 5.617, h2 = 0.566, vare = 3.620, varg = 4.727
##
## iter 700, nnz = 10, sigmaSq = 1.563, h2 = 0.544, vare = 3.233, varg = 3.852
##
## iter 800, nnz = 5, sigmaSq = 3.690, h2 = 0.546, vare = 3.669, varg = 4.421
##
## iter 900, nnz = 10, sigmaSq = 2.164, h2 = 0.495, vare = 3.720, varg = 3.648
##
## iter 1000, nnz = 5, sigmaSq = 10.054, h2 = 0.516, vare = 3.758, varg = 4.007
##
## Posterior mean:
## Pi1 Pi2 Pi3 Pi4 Nnz SigmaSq h2 Vare
## 0.2836194 0.2907440 0.2293404 0.1962962 7.0450000 3.9312372 0.4993304 3.9175322
## Varg
## 3.9214288
Question: Is the prediction accuracy improved? Why or why not?
betaMean = colMeans(bayesr.res$beta)
ghat_bayesr=Xval %*% betaMean
summary(lm(yval ~ ghat_bayesr))$r.squared
## [1] 0.5302406
Take a moment to think about it before continuing.
Bayesian methods differ to each other mainly in the assumed prior distribution for SNP effects. The best method is the one that assumes a distribution closest to the true distribution of causal effects. In our simulation, we only two causal effects so fitting a multi-component model is sort of a “overkill”.
However, you may want to play around with the number of mixture components and see how sensitive the results are.
GCTB (https://cnsgenomics.com/software/gctb/#Overview) is a
c++ software for performing Bayesian analysis of large-scale genomic
data. It takes the PLINK bed file as input and generate estimates for
model parameters and SNP effects for polygenic prediction. You can use
the following code to run BayesR analysis. It will take about 5 mins or
more to finish, depending on how busy the cluster is, and about 4GB RAM.
If you are short of time, you can check the results that produced from
the same code at /data/module5/ukb/bayesr.parRes
and
/data/module5/ukb/bayesr.snpRes
. You can also run with
chromosome 1 only by adding a flag --chr 1
for a quick
run.
The default MCMC chain length is 3000 with the first 1000 iterations as burn-in. Here, for the interest of time, we set a shorter chain length. In your own real trait analysis, it is recommended to use the default setting or even longer chain.
bfile="/data/module5/ukb/gwas"
pheno="/data/module5/ukb/gwas.phen"
covar="/data/module5/ukb/gwas.cov"
gctb --bayes R \
--bfile $bfile \
--pheno $pheno \
--covar $covar \
--gamma 0,0.01,0.1,1 \
--pi 0.994,0.003,0.002,0.001 \
--chain-length 1000 \
--burn-in 200 \
--out bayesr
Result file bayesr.parRes
shows the estimates of model
parameters. Result file bayesr.snpRes
shows the estimates
of SNP effects. If you didn’t make it to run the genome-wide analysis,
copy /data/module5/sim/bayesr.snpRes
to your current
folder.
cat bayesr.parRes
NumSnp1-NumSnp4
are the numbers of SNPs in the mixture
components 1 to 4 (component 1: zero effect; component 2: small effects
whose variance = 0.01 \(\times
\sigma^2_\beta\); component 3: medium effects whose variance =
0.1 \(\times \sigma^2_\beta\);
component 4: large effects whose variance = 1 \(\times \sigma^2_\beta\)). hsq
is the SNP-based heritability estimate. We set chain length to be 1000
here for demonstration purpose. In practice, we recommend to run at
least 3,000 iterations. It is common to discard the first 20-40% of the
samples (e.g., 1,000) as burn-in.
Question: Are the estimates from the genome-wide BayesR analysis consistent with the true values in simulation? Note that we simulated 100 causal variants with a trait heritability of 0.5.
Let’s also have a look at the SNP effect results:
head bayesr.snpRes
We then use PLINK to perform prediction:
target="/data/module5/ukb/target"
snpRes="bayesr.snpRes"
plink --bfile $target --score $snpRes 2 5 8 header sum center --out target.bayesr
Use the R script to evaluate the prediction accuracy
phenFile="/data/module5/ukb/target.phen"
covFile="/data/module5/ukb/target.cov"
prsFile="target.bayesr.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile
Question 6: How does the prediction accuracy from BayesR compare to that from C+PT?