In this practical, we will learn how to perform polygenic prediction using Bayesian methods with individual genotypes and phenotypes. As discussed in the lecture, BLUP can be regarded as a Bayesian model assuming a normal prior distribution for the SNP effects. With different prior assumptions, especially with a mixture prior, Bayesian methods are flexible in accounting for the genetic architecture of the trait. We will explore the principle of Bayesian methods using a small data set and R script, and then will employ GCTB (https://cnsgenomics.com/software/gctb/#Overview) to perform the UKB data analysis.
Note: the small data set can be freely downloaded. The UK Biobank data set cannot be downloaded because it contains real individual genotypes, although individual IDs are masked.
This toy data set includes:
The trait was simulated such that SNP 1 has an effect size of 2 and SNP 5 has an effect size of 1. The trait heritability is set at 0.5.
Data location:
ls /data/module5/toy/
This data set is based on real genotypes at 273,604 SNPs from the UK Biobank, with a simulated trait affected by 100 causal variants and heritability 0.5. The samples are split into three independent cohorts:
Data location:
ls /data/module5/ukb/
This section helps you understand the R code used in the method.
Different Bayesian methods vary only in the prior specification for
SNP effects. Here, we focus on two Bayesian methods: BayesC and BayesR.
BayesC is a special case of BayesR with a two-component mixture. The
algorithm for BayesR (i.e., Markov chain Monte Carlo sampling scheme) is
implemented in the R script bayesr.R
.
cat /data/module5/scr/bayesr.R
Do not run the R code below in this section.
BayesR assumes each SNP effect follows a mixture of normal distributions. The input parameters include
# Input parameters for bayesr. Do not run this.
bayesr = function(X, y, niter, gamma, startPi, startH2){
...
}
niter
is the number of iterations for MCMC
samplingX
is the genotype matrixy
is the vector of phenotypesgamma
is a vector of scaling factors for the variance
of the mixture componentsstartPi
is a vector of starting values for the mixture
proportions \(\pi\)startH2
is the starting value of SNP-based
heritability.The number of elements in gamma
and startPi
determines the number of mixture components, and they must match.
Let’s dive into more details to understand what the code does.
The first step is to declare and initialise variables. This step also
adjusts y for the population mean and SNP effects (which are initially
set to zero), so that ycorr
represents the residuals.
# Step 1. Declaration and initialisation of variables
n = nrow(X) # number of observations
m = ncol(X) # number of SNPs
ndist = length(startPi) # number of mixture distributions
pi = startPi # starting value for pi
h2 = startH2 # starting value for heritability
vary = var(y) # phenotypic variance
varg = vary*h2 # starting value for genetic variance
vare = vary*(1-h2) # starting value for residual variance
sigmaSq = varg/(m*sum(gamma*pi)) # common factor of SNP effect variance
nub = 4 # prior degrees of freedom for SNP effect variance
nue = 4 # prior degrees of freedom for residual variance
scaleb = (nub-2)/nub*sigmaSq # prior scale parameter for SNP effect variance
scalee = (nue-2)/nue*vare # prior scale parameter for residual variance
beta = array(0,m) # vector of SNP effects
beta_mcmc = matrix(0,niter,m) # MCMC samples of SNP effects
mu = mean(y) # overall mean
xpx = apply(X, 2, crossprod) ## diagonal elements of X'X
probDelta = vector("numeric", ndist)
logDelta = array(0,2)
keptIter = NULL # keep MCMC samples
# adjust/correct y for population mean and all SNP effects (which all have zero as initial value)
ycorr = y - mu
Then the Gibbs sampling begins. Gibbs sampling iteratively draws samples from the full conditional distribution of each parameter (i.e., posterior distribution conditional on the values of other parameters). Over time, this converges to the joint posterior distribution.
The first step within each iteration is to sample the population mean (fixed effect) from its full conditional distribution, assuming a flat prior. This distribution is centered at the BLUE (best linear unbiased estimator) solution, with variance equal to the inverse of the coefficient matrix in the “mini” mixed model equation for the mean.
# MCMC begins
for (iter in 1:niter){
# Step 2. Sample the mean from a normal posterior
ycorr = ycorr + mu # unadjust y with the old sample
rhs = sum(ycorr) # right hand side of the mixed model equation
invLhs = 1/n # inverse of left hand side of the mixed model equation
muHat = invLhs*rhs # BLUE solution
mu = rnorm(1, muHat, sqrt(invLhs*vare)) # posterior is a normal distribution
ycorr = ycorr - mu # adjust y with the new sample
The next step is to sample the mixture component indicator variable
and effect size for each SNP, conditional on the effects of all the
other SNPs. We first sample the indicator variable delta
for the mixture distribution membership (multinomial posterior), and
then sample the SNP effect beta
from the corresponding
normal distribution or a point mass at zero.
# Step 3. Sample each SNP effect from a multi-normal mixture distribution
logPi = log(pi)
logPiComp = log(1-pi)
invSigmaSq = 1/(gamma*c(sigmaSq))
logSigmaSq = log(gamma*c(sigmaSq))
nsnpDist = rep(0, ndist) # number of SNPs asigned to each mixture component
ssq = 0 # weighted sum of squares of SNP effects
nnz = 0 # number of nonzero effects
ghat = array(0,n) # individual genetic value
# loop over SNPs
for (j in 1:m){
oldSample = beta[j]
rhs = crossprod(X[,j], ycorr) + xpx[j]*oldSample # right hand side of the mixed model equation
rhs = rhs/vare
invLhs = 1.0/(xpx[j]/c(vare) + invSigmaSq) # inverse of left hand side of the mixed model equation
betaHat = invLhs*c(rhs) # BLUP solution
# sample the mixture distribution membership
logDelta = 0.5*(log(invLhs) - logSigmaSq + betaHat*c(rhs)) + logPi # log likelihood + prior for nonzero effects
logDelta[1] = logPi[1] # log likelihood + prior for zero effect
for (k in 1:ndist) {
probDelta[k] = 1.0/sum(exp(logDelta - logDelta[k])) # posterior probability for each distribution membership
}
delta = sample(1:ndist, 1, prob = probDelta) # indicator variable for the distribution membership
nsnpDist[delta] = nsnpDist[delta] + 1
if (delta > 1) {
beta[j] = rnorm(1, betaHat[delta], sqrt(invLhs[delta])) # given the distribution membership, the posterior is a normal distribution
ycorr = ycorr + X[,j]*(oldSample - beta[j]) # update ycorr with the new sample
ghat = ghat + X[,j]*beta[j]
ssq = ssq + beta[j]^2 / gamma[delta]
nnz = nnz + 1
} else {
if (oldSample) ycorr = ycorr + X[,j]*oldSample # update ycorr with the new sample which is zero
beta[j] = 0
}
}
beta_mcmc[iter,] = beta
Next, we sample the other parameters in the model, including mixing probabilities (\(\pi\)), SNP effect variance (\(\sigma^2_{\beta}\)), and residual variance (\(\sigma^2_e\)). The full conditional distribution for \(\pi\) is a beta distribution. For \(\sigma^2_{\beta}\) and \(\sigma^2_e\), they are scaled inverse chi-square distributions.
# Step 4. Sample the distribution membership probabilities from a Dirichlet distribution
pi = rdirichlet(1, nsnpDist + 1)
# Step 5. Sample the SNP effect variance from a scaled inverse chi-square distribution
sigmaSq = (ssq + nub*scaleb)/rchisq(1, nnz+nub)
# Step 6. Sample the residual variance from a scaled inverse chi-square distribution
vare = (crossprod(ycorr) + nue*scalee)/rchisq(1, n+nue)
Given the genetic values and residual variance, we can compute the SNP-based heritability. This is an appealing property of the MCMC approach – given the sampled values of SNP effects, you can easily compute any statistics of interest.
# Step 7. Compute the SNP-based heritability
varg = var(ghat)
h2 = varg/(varg + vare)
The final step in each iteration is to store the MCMC samples for the parameter values
keptIter <- rbind(keptIter,c(pi, nnz, sigmaSq, h2, vare, varg))
After completing all MCMC iterations, we compute posterior means as point estimates.
colnames(keptIter) <- c(paste0("Pi", 1:length(pi)),"Nnz","SigmaSq","h2","Vare","Varg")
postMean = apply(keptIter, 2, mean) # posterior mean of MCMC samples is used as the point estimate for each parameter
cat("\nPosterior mean:\n")
print(postMean)
return(list(par=keptIter, beta=beta_mcmc))
Load the data in R.
# data for training population
data_path="/data/module5/toy/"
X <- as.matrix(read.table(paste0(data_path,"xmat_trn.txt")))
y <- as.matrix(read.table(paste0(data_path,"yvec_trn.txt")))
# data for validation population
Xval <- as.matrix(read.table(paste0(data_path,"xmat_val.txt")))
yval <- as.matrix(read.table(paste0(data_path,"yvec_val.txt")))
We will first analyse this small data set using BayesC.
\[\beta_j \begin{cases} \sim N(0, \sigma^2_\beta) & \text{ with probability } \pi \\ \\ = 0 & \text{ with probability } 1 - \pi \end{cases} \] where parameter \(\pi\) is estimated from the data. When \(\pi = 0\), the prior collapses to a normal distribution, i.e., \(\beta_j \sim N(0, \sigma^2_\beta)\), which matches the assumption in BLUP.
We will run BayesC model using BayesR.R
, as BayesC is a
special case of BayesR. Let’s load the script in R.
source("/data/module5/scr/bayesr.R")
## Warning: package 'MCMCpack' was built under R version 4.4.1
Specify the model parameters:
gamma = c(0, 1)
startPi = c(0.8, 0.2)
This assumes that 20% of SNPs have nonzero effects drawn from a normal distribution, while 80% have zero effect. Now run the MCMC:
res_bayesc = bayesr(X = X, y = y, gamma = gamma, startPi = startPi)
##
## iter 100, nnz = 4, sigmaSq = 2.867, h2 = 0.502, vare = 4.368, varg = 4.397
##
## iter 200, nnz = 2, sigmaSq = 2.053, h2 = 0.508, vare = 3.978, varg = 4.107
##
## iter 300, nnz = 4, sigmaSq = 0.410, h2 = 0.536, vare = 3.832, varg = 4.429
##
## iter 400, nnz = 3, sigmaSq = 2.669, h2 = 0.456, vare = 4.224, varg = 3.547
##
## iter 500, nnz = 4, sigmaSq = 3.752, h2 = 0.538, vare = 3.921, varg = 4.569
##
## iter 600, nnz = 4, sigmaSq = 2.695, h2 = 0.547, vare = 3.543, varg = 4.277
##
## iter 700, nnz = 3, sigmaSq = 1.248, h2 = 0.505, vare = 4.146, varg = 4.237
##
## iter 800, nnz = 2, sigmaSq = 1.438, h2 = 0.501, vare = 3.611, varg = 3.631
##
## iter 900, nnz = 7, sigmaSq = 1.439, h2 = 0.520, vare = 3.716, varg = 4.025
##
## iter 1000, nnz = 2, sigmaSq = 2.237, h2 = 0.461, vare = 3.980, varg = 3.406
##
## Posterior mean:
## Pi1 Pi2 Nnz SigmaSq h2 Vare Varg
## 0.5642392 0.4357608 4.1890000 1.7819829 0.4954345 3.9057432 3.8466146
The output includes sampled values of key parameters every 100
iterations. For example, Nnz
is the number of nonzero
effects fitted in the model. Unlike BLUP, which assumes all SNPs
contribute equally to the trait, BayesC allows some SNPs to have zero
effect.
After MCMC, you can find the sampled values for the model parameters and SNP effects for each iteration in the result list. For example, you can summarise the posterior mean and standard deviation for each parameter by
colMeans(res_bayesc$par)
## Pi1 Pi2 Nnz SigmaSq h2 Vare Varg
## 0.5642392 0.4357608 4.1890000 1.7819829 0.4954345 3.9057432 3.8466146
apply(res_bayesc$par, 2, sd)
## Pi1 Pi2 Nnz SigmaSq h2 Vare Varg
## 0.20763892 0.20763892 2.00132064 1.52726282 0.03479936 0.31138084 0.42901071
The posterior mean gives a point estimate, and the standard eviation gives an estimate of uncertainty (posterior standard error).
You can plot parameter traces over iterations to check convergence. For example:
# Trace plot for SNP 1, which is the causal variant with effect size of 2
plot(1:nrow(res_bayesc$beta), res_bayesc$beta[,1], xlab="Iteration", ylab="beta[,1]")
abline(h = 2, col="red")
Question: Use this code to investigate each parameter. Do they fluctuate around the true value?
# Trace plot for SNP 2, which is a null SNP
plot(1:nrow(res_bayesc$beta), res_bayesc$beta[,2], xlab="Iteration", ylab="beta[,2]")
abline(h = 0, col="red")
# Trace plot for SNP 5, which is a causal variant with a smaller effect of 1 and in LD with other SNPs
plot(1:nrow(res_bayesc$beta), res_bayesc$beta[,5], xlab="Iteration", ylab="beta[,5]")
abline(h = 1, col="red")
Answer:
This indicates good mixing and convergence.
We can also plot the posterior distribution for each SNP effect. We discard the first 100 iterations of the program as “burn in”:
# Posterior distribution of SNP 1 effect
plot(density(res_bayesc$beta[100:1000,1]))
Question: Does the distribution appear to be normal? What about the distributions of the other parameters?
# Posterior distribution of SNP 2 effect
plot(density(res_bayesc$beta[100:1000,2]))
Take a moment to think about it before continuing.
Answer:
The posterior distribution of the SNP effect can be summarised to quantify the evidence of the SNP associated with the trait. From the posterior samples, we can calculate the Bayesian 90% credible interval, which represents 95% probability that the true parameter would lie within the interval, given the evidence from the data.
quantile(res_bayesc$beta[100:1000,1], probs=c(0.05, 00.95))
## 5% 95%
## 1.744243 2.456893
Question: What can we conclude from the Bayesian credible interval?
Take a moment to think about it before continuing.
Answer: The 90% credible interval shows the range of plausible effect sizes for a SNP, given the data and prior. If the interval excludes 0, there’s strong evidence the SNP is associated with the trait. For null SNPs, the interval will typically include 0 or be sharply peaked at zero.
Another way to measure SNP-trait association is to compute the posterior inclusion probability (PIP), the probability of the SNP being fitted as a non-zero effect in the model across MCMC samples. It has been widely used in the fine-mapping analysis.
For example, PIP for SNP 1 can be calculated as
mean(res_bayesc$beta[100:1000,1] != 0)
## [1] 1
This could be generalised for all SNPs:
pip = colMeans(res_bayesc$beta[100:1000, ] != 0)
plot(pip, type="h", xlab="SNP", ylab="Posterior inclusion probability")
Question: Which SNPs are likely to be the causal variants based on PIP?
Take a moment to think about it before continuing.
Answer: SNPs with high PIP (close to 1) are likely to be causal (e.g., SNP 1 and SNP 5 in your simulation). A plot of PIP across SNPs will highlight those with strong support. SNPs with PIP near 0 are unlikely to be associated.
To get the SNP effect estimates, we calculate the posterior means of SNP effects:
betaMean = colMeans(res_bayesc$beta)
Question: How to predict PGS using BayesC effect estimates? How does the prediction accuracy compare to that from C+PT or BLUP? What could cause the difference?
Take a moment to think about it before continuing.
Answer:
ghat_bayesc=Xval %*% betaMean
summary(lm(yval ~ ghat_bayesc))$r.squared
## [1] 0.5328998
The \(R^2\) from BayesC reflects posterior mean effects and typically outperforms BLUP or C+PT in sparse genetic architectures (few large-effect SNPs). The difference is due to BayesC allowing effect sizes to be exactly zero, avoiding overfitting noise.
BayesR prior assumes that some SNPs have zero effect, some have small effects, and some have large effects, by assuming a mixture of multiple normal distributions, including a point mass at zero.
\[ \beta_j \sim \sum_k \pi_k N(0, \gamma_k \sigma^2_\beta) \]
For example, we can set a 4-component mixture model:
gamma = c(0, 0.01, 0.1, 1)
startPi = c(0.5, 0.3, 0.15, 0.05)
bayesr.res = bayesr(X = X, y = y, gamma = gamma, startPi = startPi)
##
## iter 100, nnz = 8, sigmaSq = 1.501, h2 = 0.507, vare = 3.753, varg = 3.856
##
## iter 200, nnz = 9, sigmaSq = 2.367, h2 = 0.512, vare = 3.468, varg = 3.632
##
## iter 300, nnz = 4, sigmaSq = 6.756, h2 = 0.499, vare = 3.714, varg = 3.701
##
## iter 400, nnz = 10, sigmaSq = 5.577, h2 = 0.518, vare = 3.338, varg = 3.590
##
## iter 500, nnz = 9, sigmaSq = 2.954, h2 = 0.454, vare = 3.999, varg = 3.320
##
## iter 600, nnz = 7, sigmaSq = 4.409, h2 = 0.510, vare = 3.872, varg = 4.026
##
## iter 700, nnz = 7, sigmaSq = 1.771, h2 = 0.525, vare = 3.382, varg = 3.732
##
## iter 800, nnz = 8, sigmaSq = 3.197, h2 = 0.528, vare = 3.541, varg = 3.969
##
## iter 900, nnz = 10, sigmaSq = 4.457, h2 = 0.521, vare = 3.881, varg = 4.221
##
## iter 1000, nnz = 7, sigmaSq = 10.287, h2 = 0.512, vare = 3.477, varg = 3.654
##
## Posterior mean:
## Pi1 Pi2 Pi3 Pi4 Nnz SigmaSq h2 Vare
## 0.2691704 0.2966679 0.2031840 0.2309778 7.1430000 3.7617463 0.4966031 3.8988346
## Varg
## 3.8590990
betaMean = colMeans(bayesr.res$beta)
ghat_bayesr=Xval %*% betaMean
summary(lm(yval ~ ghat_bayesr))$r.squared
## [1] 0.5350486
Question: Is the prediction accuracy improved? Why or why not?
Take a moment to think about it before continuing.
Answer: Not necessarily, as over-parameterization can hurt performance. In this simulation, there are only two causal SNPs, so a 4-component mixture is unnecessary (overkill). In real traits with a polygenic architecture, BayesR is often superior.
GCTB (https://cnsgenomics.com/software/gctb/#Overview) is a
c++ software for performing Bayesian analysis of large-scale genomic
data. It takes the PLINK bed file as input and generate estimates for
model parameters and SNP effects for polygenic prediction. You can use
the following code to run BayesR analysis. It will take about 5 mins to
finish. If you are short of time, you can check the results that
produced from the same code at
/data/module5/ukb/bayesr.parRes
and
/data/module5/ukb/bayesr.snpRes
. You can also run with
chromosome 1 only by adding a flag --chr 1
for a quick
run.
The default MCMC chain length is 3000 with the first 1000 iterations as burn-in. Here, for the interest of time, we set a shorter chain length. In your own real trait analysis, it is recommended to use the default setting or even a longer chain.
bfile="/data/module5/ukb/gwas"
pheno="/data/module5/ukb/gwas.phen"
covar="/data/module5/ukb/gwas.cov"
gctb --bayes R \
--bfile $bfile \
--pheno $pheno \
--covar $covar \
--gamma 0,0.01,0.1,1 \
--pi 0.994,0.003,0.002,0.001 \
--chain-length 500 \
--burn-in 200 \
--out bayesr
Result file bayesr.parRes
shows the estimates of model
parameters. Result file bayesr.snpRes
shows the estimates
of SNP effects. If you didn’t make it to run the genome-wide analysis,
copy /data/module5/sim/bayesr.snpRes
to your current
folder.
cat bayesr.parRes
NumSnp1-NumSnp4
are the numbers of SNPs in the mixture
components 1 to 4 (component 1: zero effect; component 2: small effects
whose variance = 0.01 \(\times
\sigma^2_\beta\); component 3: medium effects whose variance =
0.1 \(\times \sigma^2_\beta\);
component 4: large effects whose variance = 1 \(\times \sigma^2_\beta\)). hsq
is the SNP-based heritability estimate. We set chain length to be 500
here for demonstration purpose. In practice, we recommend to run at
least 3,000 iterations. It is common to discard the first 20-40% of the
samples (e.g., 1,000) as burn-in.
Question: Are the estimates from the genome-wide BayesR analysis consistent with the true values in simulation? Note that we simulated 100 causal variants with a trait heritability of 0.5.
Let’s check the SNP effect results:
head bayesr.snpRes
We use PLINK to predict PGS:
target="/data/module5/ukb/target"
snpRes="bayesr.snpRes"
plink --bfile $target --score $snpRes 2 5 8 header sum center --out target.bayesr
Evaluate prediction accuracy:
phenFile="/data/module5/ukb/target.phen"
covFile="/data/module5/ukb/target.cov"
prsFile="target.bayesr.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile
Question: How does the prediction accuracy from BayesR compare to that from C+PT?