1 Objectives

In this practical, we will learn how to perform polygenic prediction using Bayesian methods with individual genotypes and phenotypes. As discussed in the lecture, BLUP can be regarded as a Bayesian model assuming a normal prior distribution for the SNP effects. With different prior assumptions, especially with a mixture prior, Bayesian methods are flexible in accounting for the genetic architecture of the trait. We will explore the principle of Bayesian methods using a small data set and R script, and then will employ GCTB (https://cnsgenomics.com/software/gctb/#Overview) to perform the UKB data analysis.

Note: the small data set can be freely downloaded. The UK Biobank data set cannot be downloaded because it contains real individual genotypes, although individual IDs are masked.

2 Data

2.1 Small data set

This toy data set includes:

  • A training population (discovery GWAS) of 325 individuals, genotyped for 10 SNPs.
  • A validation population of 31 individuals (hold-out sample).

The trait was simulated such that SNP 1 has an effect size of 2 and SNP 5 has an effect size of 1. The trait heritability is set at 0.5.

Data location:

ls /data/module5/toy/

2.2 Large data set

This data set is based on real genotypes at 273,604 SNPs from the UK Biobank, with a simulated trait affected by 100 causal variants and heritability 0.5. The samples are split into three independent cohorts:

  • Discovery population for conducting GWAS (n = 3,000).
  • Testing population for finding the best prediction model (n = 500).
  • Target population for prediction (n = 100).

Data location:

ls /data/module5/ukb/

3 Understanding the algorithm

This section helps you understand the R code used in the method.

Different Bayesian methods vary only in the prior specification for SNP effects. Here, we focus on two Bayesian methods: BayesC and BayesR. BayesC is a special case of BayesR with a two-component mixture. The algorithm for BayesR (i.e., Markov chain Monte Carlo sampling scheme) is implemented in the R script bayesr.R.

cat /data/module5/scr/bayesr.R

Do not run the R code below in this section.

BayesR assumes each SNP effect follows a mixture of normal distributions. The input parameters include

# Input parameters for bayesr. Do not run this.
bayesr = function(X, y, niter, gamma, startPi, startH2){
  ...
}
  • niter is the number of iterations for MCMC sampling
  • X is the genotype matrix
  • y is the vector of phenotypes
  • gamma is a vector of scaling factors for the variance of the mixture components
  • startPi is a vector of starting values for the mixture proportions \(\pi\)
  • startH2 is the starting value of SNP-based heritability.

The number of elements in gamma and startPi determines the number of mixture components, and they must match.

Let’s dive into more details to understand what the code does.

The first step is to declare and initialise variables. This step also adjusts y for the population mean and SNP effects (which are initially set to zero), so that ycorr represents the residuals.

  # Step 1. Declaration and initialisation of variables
  n = nrow(X)          # number of observations
  m = ncol(X)          # number of SNPs
  ndist = length(startPi)  # number of mixture distributions
  pi = startPi         # starting value for pi
  h2 = startH2         # starting value for heritability
  vary = var(y)        # phenotypic variance
  varg = vary*h2       # starting value for genetic variance
  vare = vary*(1-h2)   # starting value for residual variance
  sigmaSq = varg/(m*sum(gamma*pi))    # common factor of SNP effect variance
  nub = 4              # prior degrees of freedom for SNP effect variance
  nue = 4              # prior degrees of freedom for residual variance
  scaleb = (nub-2)/nub*sigmaSq  # prior scale parameter for SNP effect variance
  scalee = (nue-2)/nue*vare     # prior scale parameter for residual variance
  beta = array(0,m)    # vector of SNP effects
  beta_mcmc = matrix(0,niter,m) # MCMC samples of SNP effects
  mu = mean(y)         # overall mean
  xpx = apply(X, 2, crossprod)  ## diagonal elements of X'X
  probDelta = vector("numeric", ndist)
  logDelta = array(0,2)
  keptIter = NULL      # keep MCMC samples
  
  # adjust/correct y for population mean and all SNP effects (which all have zero as initial value)
  ycorr = y - mu 

Then the Gibbs sampling begins. Gibbs sampling iteratively draws samples from the full conditional distribution of each parameter (i.e., posterior distribution conditional on the values of other parameters). Over time, this converges to the joint posterior distribution.

The first step within each iteration is to sample the population mean (fixed effect) from its full conditional distribution, assuming a flat prior. This distribution is centered at the BLUE (best linear unbiased estimator) solution, with variance equal to the inverse of the coefficient matrix in the “mini” mixed model equation for the mean.

  # MCMC begins
  for (iter in 1:niter){
    # Step 2. Sample the mean from a normal posterior
    ycorr = ycorr + mu   # unadjust y with the old sample
    rhs = sum(ycorr)     # right hand side of the mixed model equation
    invLhs = 1/n         # inverse of left hand side of the mixed model equation
    muHat = invLhs*rhs   # BLUE solution
    mu = rnorm(1, muHat, sqrt(invLhs*vare))   # posterior is a normal distribution
    ycorr = ycorr - mu   # adjust y with the new sample

The next step is to sample the mixture component indicator variable and effect size for each SNP, conditional on the effects of all the other SNPs. We first sample the indicator variable delta for the mixture distribution membership (multinomial posterior), and then sample the SNP effect beta from the corresponding normal distribution or a point mass at zero.

    # Step 3. Sample each SNP effect from a multi-normal mixture distribution
    logPi = log(pi)
    logPiComp = log(1-pi)
    invSigmaSq = 1/(gamma*c(sigmaSq))
    logSigmaSq = log(gamma*c(sigmaSq))
    nsnpDist = rep(0, ndist)  # number of SNPs asigned to each mixture component
    ssq = 0  # weighted sum of squares of SNP effects 
    nnz = 0  # number of nonzero effects
    ghat = array(0,n)  # individual genetic value
    # loop over SNPs
    for (j in 1:m){
      oldSample = beta[j]
      rhs = crossprod(X[,j], ycorr) + xpx[j]*oldSample  # right hand side of the mixed model equation
      rhs = rhs/vare
      invLhs = 1.0/(xpx[j]/c(vare) + invSigmaSq)        # inverse of left hand side of the mixed model equation
      betaHat = invLhs*c(rhs)                           # BLUP solution
      
      # sample the mixture distribution membership
      logDelta = 0.5*(log(invLhs) - logSigmaSq + betaHat*c(rhs)) + logPi  # log likelihood + prior for nonzero effects
      logDelta[1] = logPi[1]                                              # log likelihood + prior for zero effect
      for (k in 1:ndist) {
        probDelta[k] = 1.0/sum(exp(logDelta - logDelta[k]))   # posterior probability for each distribution membership
      }
      
      delta = sample(1:ndist, 1, prob = probDelta)   # indicator variable for the distribution membership
      nsnpDist[delta] = nsnpDist[delta] + 1
      
      if (delta > 1) {
        beta[j] = rnorm(1, betaHat[delta], sqrt(invLhs[delta]))  # given the distribution membership, the posterior is a normal distribution
        ycorr = ycorr + X[,j]*(oldSample - beta[j])              # update ycorr with the new sample
        ghat = ghat + X[,j]*beta[j]                              
        ssq = ssq + beta[j]^2 / gamma[delta]
        nnz = nnz + 1
      } else {
        if (oldSample) ycorr = ycorr + X[,j]*oldSample           # update ycorr with the new sample which is zero
        beta[j] = 0
      }
    }   
    beta_mcmc[iter,] = beta

Next, we sample the other parameters in the model, including mixing probabilities (\(\pi\)), SNP effect variance (\(\sigma^2_{\beta}\)), and residual variance (\(\sigma^2_e\)). The full conditional distribution for \(\pi\) is a beta distribution. For \(\sigma^2_{\beta}\) and \(\sigma^2_e\), they are scaled inverse chi-square distributions.

    # Step 4. Sample the distribution membership probabilities from a Dirichlet distribution
    pi = rdirichlet(1, nsnpDist + 1)
    
    # Step 5. Sample the SNP effect variance from a scaled inverse chi-square distribution
    sigmaSq = (ssq + nub*scaleb)/rchisq(1, nnz+nub)
    
    # Step 6. Sample the residual variance from a scaled inverse chi-square distribution
    vare = (crossprod(ycorr) + nue*scalee)/rchisq(1, n+nue)

Given the genetic values and residual variance, we can compute the SNP-based heritability. This is an appealing property of the MCMC approach – given the sampled values of SNP effects, you can easily compute any statistics of interest.

    # Step 7. Compute the SNP-based heritability
    varg = var(ghat)
    h2  = varg/(varg + vare)

The final step in each iteration is to store the MCMC samples for the parameter values

    keptIter <- rbind(keptIter,c(pi, nnz, sigmaSq, h2, vare, varg))

After completing all MCMC iterations, we compute posterior means as point estimates.

  colnames(keptIter) <- c(paste0("Pi", 1:length(pi)),"Nnz","SigmaSq","h2","Vare","Varg")
  postMean = apply(keptIter, 2, mean)  # posterior mean of MCMC samples is used as the point estimate for each parameter
  cat("\nPosterior mean:\n")
  print(postMean)
  return(list(par=keptIter, beta=beta_mcmc))

4 Illustration of the principle

Load the data in R.

# data for training population
data_path="/data/module5/toy/"
X <- as.matrix(read.table(paste0(data_path,"xmat_trn.txt")))
y <- as.matrix(read.table(paste0(data_path,"yvec_trn.txt")))

# data for validation population
Xval <- as.matrix(read.table(paste0(data_path,"xmat_val.txt")))
yval <- as.matrix(read.table(paste0(data_path,"yvec_val.txt")))

4.1 A Bayesian model with a point-normal prior (BayesC)

We will first analyse this small data set using BayesC.

\[\beta_j \begin{cases} \sim N(0, \sigma^2_\beta) & \text{ with probability } \pi \\ \\ = 0 & \text{ with probability } 1 - \pi \end{cases} \] where parameter \(\pi\) is estimated from the data. When \(\pi = 0\), the prior collapses to a normal distribution, i.e., \(\beta_j \sim N(0, \sigma^2_\beta)\), which matches the assumption in BLUP.

We will run BayesC model using BayesR.R, as BayesC is a special case of BayesR. Let’s load the script in R.

source("/data/module5/scr/bayesr.R")
## Warning: package 'MCMCpack' was built under R version 4.4.1

Specify the model parameters:

gamma = c(0, 1)
startPi = c(0.8, 0.2)

This assumes that 20% of SNPs have nonzero effects drawn from a normal distribution, while 80% have zero effect. Now run the MCMC:

res_bayesc = bayesr(X = X, y = y, gamma = gamma, startPi = startPi)
## 
##  iter  100, nnz =    4, sigmaSq =  2.867, h2 =  0.502, vare =  4.368, varg =  4.397 
## 
##  iter  200, nnz =    2, sigmaSq =  2.053, h2 =  0.508, vare =  3.978, varg =  4.107 
## 
##  iter  300, nnz =    4, sigmaSq =  0.410, h2 =  0.536, vare =  3.832, varg =  4.429 
## 
##  iter  400, nnz =    3, sigmaSq =  2.669, h2 =  0.456, vare =  4.224, varg =  3.547 
## 
##  iter  500, nnz =    4, sigmaSq =  3.752, h2 =  0.538, vare =  3.921, varg =  4.569 
## 
##  iter  600, nnz =    4, sigmaSq =  2.695, h2 =  0.547, vare =  3.543, varg =  4.277 
## 
##  iter  700, nnz =    3, sigmaSq =  1.248, h2 =  0.505, vare =  4.146, varg =  4.237 
## 
##  iter  800, nnz =    2, sigmaSq =  1.438, h2 =  0.501, vare =  3.611, varg =  3.631 
## 
##  iter  900, nnz =    7, sigmaSq =  1.439, h2 =  0.520, vare =  3.716, varg =  4.025 
## 
##  iter 1000, nnz =    2, sigmaSq =  2.237, h2 =  0.461, vare =  3.980, varg =  3.406 
## 
## Posterior mean:
##       Pi1       Pi2       Nnz   SigmaSq        h2      Vare      Varg 
## 0.5642392 0.4357608 4.1890000 1.7819829 0.4954345 3.9057432 3.8466146

The output includes sampled values of key parameters every 100 iterations. For example, Nnz is the number of nonzero effects fitted in the model. Unlike BLUP, which assumes all SNPs contribute equally to the trait, BayesC allows some SNPs to have zero effect.

After MCMC, you can find the sampled values for the model parameters and SNP effects for each iteration in the result list. For example, you can summarise the posterior mean and standard deviation for each parameter by

colMeans(res_bayesc$par)
##       Pi1       Pi2       Nnz   SigmaSq        h2      Vare      Varg 
## 0.5642392 0.4357608 4.1890000 1.7819829 0.4954345 3.9057432 3.8466146
apply(res_bayesc$par, 2, sd)
##        Pi1        Pi2        Nnz    SigmaSq         h2       Vare       Varg 
## 0.20763892 0.20763892 2.00132064 1.52726282 0.03479936 0.31138084 0.42901071

The posterior mean gives a point estimate, and the standard eviation gives an estimate of uncertainty (posterior standard error).

You can plot parameter traces over iterations to check convergence. For example:

# Trace plot for SNP 1, which is the causal variant with effect size of 2
plot(1:nrow(res_bayesc$beta), res_bayesc$beta[,1], xlab="Iteration", ylab="beta[,1]") 
abline(h = 2, col="red")

Question: Use this code to investigate each parameter. Do they fluctuate around the true value?

# Trace plot for SNP 2, which is a null SNP
plot(1:nrow(res_bayesc$beta), res_bayesc$beta[,2], xlab="Iteration", ylab="beta[,2]") 
abline(h = 0, col="red")

# Trace plot for SNP 5, which is a causal variant with a smaller effect of 1 and in LD with other SNPs
plot(1:nrow(res_bayesc$beta), res_bayesc$beta[,5], xlab="Iteration", ylab="beta[,5]") 
abline(h = 1, col="red")




Answer:

  • For SNP 1, the trace plot fluctuates around 2, consistent with its true effect.
  • For SNP 5, the plot fluctuates around 1, with more variability due to LD with nearby SNPs.
  • For SNP 2 (null SNP), the plot fluctuates around 0.

This indicates good mixing and convergence.



We can also plot the posterior distribution for each SNP effect. We discard the first 100 iterations of the program as “burn in”:

# Posterior distribution of SNP 1 effect
plot(density(res_bayesc$beta[100:1000,1]))

Question: Does the distribution appear to be normal? What about the distributions of the other parameters?

# Posterior distribution of SNP 2 effect
plot(density(res_bayesc$beta[100:1000,2]))


Take a moment to think about it before continuing.




Answer:

  • For causal SNPs, the posterior distribution of effects is approximately normal but may be slightly skewed or truncated due to the mixture prior.
  • For null SNPs, the posterior is often a spike at zero, reflecting the point-mass component of the prior.
  • For other parameters like heritability and variance components, the posteriors are not necessarily symmetric, especially with small sample sizes.


The posterior distribution of the SNP effect can be summarised to quantify the evidence of the SNP associated with the trait. From the posterior samples, we can calculate the Bayesian 90% credible interval, which represents 95% probability that the true parameter would lie within the interval, given the evidence from the data.

quantile(res_bayesc$beta[100:1000,1], probs=c(0.05, 00.95))
##       5%      95% 
## 1.744243 2.456893

Question: What can we conclude from the Bayesian credible interval?


Take a moment to think about it before continuing.




Answer: The 90% credible interval shows the range of plausible effect sizes for a SNP, given the data and prior. If the interval excludes 0, there’s strong evidence the SNP is associated with the trait. For null SNPs, the interval will typically include 0 or be sharply peaked at zero.


Another way to measure SNP-trait association is to compute the posterior inclusion probability (PIP), the probability of the SNP being fitted as a non-zero effect in the model across MCMC samples. It has been widely used in the fine-mapping analysis.

For example, PIP for SNP 1 can be calculated as

mean(res_bayesc$beta[100:1000,1] != 0)
## [1] 1

This could be generalised for all SNPs:

pip = colMeans(res_bayesc$beta[100:1000, ] != 0)
plot(pip, type="h", xlab="SNP", ylab="Posterior inclusion probability")

Question: Which SNPs are likely to be the causal variants based on PIP?


Take a moment to think about it before continuing.




Answer: SNPs with high PIP (close to 1) are likely to be causal (e.g., SNP 1 and SNP 5 in your simulation). A plot of PIP across SNPs will highlight those with strong support. SNPs with PIP near 0 are unlikely to be associated.


To get the SNP effect estimates, we calculate the posterior means of SNP effects:

betaMean = colMeans(res_bayesc$beta)


Question: How to predict PGS using BayesC effect estimates? How does the prediction accuracy compare to that from C+PT or BLUP? What could cause the difference?


Take a moment to think about it before continuing.




Answer:

ghat_bayesc=Xval %*% betaMean
summary(lm(yval ~ ghat_bayesc))$r.squared
## [1] 0.5328998

The \(R^2\) from BayesC reflects posterior mean effects and typically outperforms BLUP or C+PT in sparse genetic architectures (few large-effect SNPs). The difference is due to BayesC allowing effect sizes to be exactly zero, avoiding overfitting noise.

4.2 Bayesian approach using multi-component mixture prior (BayesR)

BayesR prior assumes that some SNPs have zero effect, some have small effects, and some have large effects, by assuming a mixture of multiple normal distributions, including a point mass at zero.

\[ \beta_j \sim \sum_k \pi_k N(0, \gamma_k \sigma^2_\beta) \]

For example, we can set a 4-component mixture model:

gamma = c(0, 0.01, 0.1, 1)
startPi = c(0.5, 0.3, 0.15, 0.05)
bayesr.res = bayesr(X = X, y = y, gamma = gamma, startPi = startPi)
## 
##  iter  100, nnz =    8, sigmaSq =  1.501, h2 =  0.507, vare =  3.753, varg =  3.856 
## 
##  iter  200, nnz =    9, sigmaSq =  2.367, h2 =  0.512, vare =  3.468, varg =  3.632 
## 
##  iter  300, nnz =    4, sigmaSq =  6.756, h2 =  0.499, vare =  3.714, varg =  3.701 
## 
##  iter  400, nnz =   10, sigmaSq =  5.577, h2 =  0.518, vare =  3.338, varg =  3.590 
## 
##  iter  500, nnz =    9, sigmaSq =  2.954, h2 =  0.454, vare =  3.999, varg =  3.320 
## 
##  iter  600, nnz =    7, sigmaSq =  4.409, h2 =  0.510, vare =  3.872, varg =  4.026 
## 
##  iter  700, nnz =    7, sigmaSq =  1.771, h2 =  0.525, vare =  3.382, varg =  3.732 
## 
##  iter  800, nnz =    8, sigmaSq =  3.197, h2 =  0.528, vare =  3.541, varg =  3.969 
## 
##  iter  900, nnz =   10, sigmaSq =  4.457, h2 =  0.521, vare =  3.881, varg =  4.221 
## 
##  iter 1000, nnz =    7, sigmaSq = 10.287, h2 =  0.512, vare =  3.477, varg =  3.654 
## 
## Posterior mean:
##       Pi1       Pi2       Pi3       Pi4       Nnz   SigmaSq        h2      Vare 
## 0.2691704 0.2966679 0.2031840 0.2309778 7.1430000 3.7617463 0.4966031 3.8988346 
##      Varg 
## 3.8590990
betaMean = colMeans(bayesr.res$beta)
ghat_bayesr=Xval %*% betaMean
summary(lm(yval ~ ghat_bayesr))$r.squared
## [1] 0.5350486

Question: Is the prediction accuracy improved? Why or why not?


Take a moment to think about it before continuing.




Answer: Not necessarily, as over-parameterization can hurt performance. In this simulation, there are only two causal SNPs, so a 4-component mixture is unnecessary (overkill). In real traits with a polygenic architecture, BayesR is often superior.

5 UKB data analysis using GCTB

GCTB (https://cnsgenomics.com/software/gctb/#Overview) is a c++ software for performing Bayesian analysis of large-scale genomic data. It takes the PLINK bed file as input and generate estimates for model parameters and SNP effects for polygenic prediction. You can use the following code to run BayesR analysis. It will take about 5 mins to finish. If you are short of time, you can check the results that produced from the same code at /data/module5/ukb/bayesr.parRes and /data/module5/ukb/bayesr.snpRes. You can also run with chromosome 1 only by adding a flag --chr 1 for a quick run.

The default MCMC chain length is 3000 with the first 1000 iterations as burn-in. Here, for the interest of time, we set a shorter chain length. In your own real trait analysis, it is recommended to use the default setting or even a longer chain.

bfile="/data/module5/ukb/gwas"
pheno="/data/module5/ukb/gwas.phen"
covar="/data/module5/ukb/gwas.cov"
gctb --bayes R \
     --bfile $bfile \
     --pheno $pheno \
     --covar $covar \
     --gamma 0,0.01,0.1,1 \
     --pi 0.994,0.003,0.002,0.001 \
     --chain-length 500 \
     --burn-in 200 \
     --out bayesr

Result file bayesr.parRes shows the estimates of model parameters. Result file bayesr.snpRes shows the estimates of SNP effects. If you didn’t make it to run the genome-wide analysis, copy /data/module5/sim/bayesr.snpRes to your current folder.

cat bayesr.parRes

NumSnp1-NumSnp4 are the numbers of SNPs in the mixture components 1 to 4 (component 1: zero effect; component 2: small effects whose variance = 0.01 \(\times \sigma^2_\beta\); component 3: medium effects whose variance = 0.1 \(\times \sigma^2_\beta\); component 4: large effects whose variance = 1 \(\times \sigma^2_\beta\)). hsq is the SNP-based heritability estimate. We set chain length to be 500 here for demonstration purpose. In practice, we recommend to run at least 3,000 iterations. It is common to discard the first 20-40% of the samples (e.g., 1,000) as burn-in.

Question: Are the estimates from the genome-wide BayesR analysis consistent with the true values in simulation? Note that we simulated 100 causal variants with a trait heritability of 0.5.


Let’s check the SNP effect results:

head bayesr.snpRes

We use PLINK to predict PGS:

target="/data/module5/ukb/target"
snpRes="bayesr.snpRes"
plink --bfile $target --score $snpRes 2 5 8 header sum center --out target.bayesr

Evaluate prediction accuracy:

phenFile="/data/module5/ukb/target.phen" 
covFile="/data/module5/ukb/target.cov"
prsFile="target.bayesr.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile

Question: How does the prediction accuracy from BayesR compare to that from C+PT?