1 Objectives

In this practical you will learn how to perform polygenic prediction using Bayesian methods with individual genotypes and phenotypes. We will use the same data sets in the BLUP session, i.e., a small data set with 10 SNPs and simulated phenotypes for 325 individuals , and a larger data set of 277,719 SNPs and 3,000 GWAS individuals from the UK Biobank (UKB). As discussed in the lecture, BLUP can be regarded as a Bayesian model assuming a normal prior distribution for the SNP effects. With different prior assumptions, especially with a mixture prior, Bayesian methods are flexible in accounting for the genetic architecture of the trait. We will explore the principle of various Bayesian methods using the toy example and R script, and then will employ GCTB to perform the UKB data analysis.

Note that the small data set is free to download to your own laptop if you want. However, the UKB data set is not allowed to download because it contains real individual genotypes from the UK Biobank (although individual IDs has been masked).

2 Data

2.1 Small data set

This is a toy example data set consists of a training population (aka GWAS population, discovery population, or reference population) of 325 individuals. Each of these individuals had been genotyped for 10 SNPs. The trait was simulated in a way that the first SNP has an effect size of 2 and the 5th SNP has an effect size of 1 on the phenotype. The trait heritability is 0.5. The validation population (aka target population) is a hold-out sample of 31 individuals.

Path to Data set 1:

ls /data/module5/toy/

2.2 Large data set

This is a larger data set based on real genotypes at 273,604 SNPs from the UK Biobank. We simulated a trait with 100 causal variants with heritability of 0.5. The entire data set was partitioned into three independent populations:

Discovery population for conducting GWAS (n = 3,000).
Testing population for finding the best prediction model (n = 500).
Target population to be predicted (n = 1,00).

Path to Data set 2:

ls /data/module5/ukb/

3 Understanding the algorithm

This section is designed to help you understand the R code used in the method. You may skip it for now and return to it later if you prefer.

The following code is highlighted for explanatory purpose as part of the method. No need to run the code in this section.

Different Bayesian methods differ only in the prior specification for SNP effects. Here, we will focus on two Bayesian methods, BayesC and BayesR, where BayesC is a special case of BayesR with two-component mixture. The algorithm for BayesR (i.e., Markov chain Monte Carlo sampling scheme) has been implemented in a R script, bayesr.R. Let’s load it in R and have a look.

source("/data/module5/scr/bayesr.R")

The script includes a function called bayesr. BayesR is a general model where each SNP effects is assumed to follow a mixture of normal distributions. The input parameters are

# Input parameters for bayesr. Do not run this.
bayesr = function(X, y, niter, gamma, startPi, startH2){
  ...
}

where niter is the number of iterations for MCMC sampling, X is the genotype matrix, y is the vector of phenotypes, gamma is a vector of scaling factor for the variance of a mixture distribution, startPi is a vector of starting values for the \(\pi\) parameter, startH2 is the starting value of SNP-based heritability. Note that the number of elements in gamma and startPi define the number of mixture components, and therefore need to be matched.

Let’s look into more details what the code is doing. Do not run the code below in your R.

The first step is to declar and initialise variables. In addition, adjusting y for population mean and all SNP effects, so that ycorr is like the residuals.

  # Step 1. Declaration and initialisation of variables
  n = nrow(X)          # number of observations
  m = ncol(X)          # number of SNPs
  ndist = length(startPi)  # number of mixture distributions
  pi = startPi         # starting value for pi
  h2 = startH2         # starting value for heritability
  vary = var(y)        # phenotypic variance
  varg = vary*h2       # starting value for genetic variance
  vare = vary*(1-h2)   # starting value for residual variance
  sigmaSq = varg/(m*sum(gamma*pi))    # common factor of SNP effect variance
  nub = 4              # prior degrees of freedom for SNP effect variance
  nue = 4              # prior degrees of freedom for residual variance
  scaleb = (nub-2)/nub*sigmaSq  # prior scale parameter for SNP effect variance
  scalee = (nue-2)/nue*vare     # prior scale parameter for residual variance
  beta = array(0,m)    # vector of SNP effects
  beta_mcmc = matrix(0,niter,m) # MCMC samples of SNP effects
  mu = mean(y)         # overall mean
  xpx = apply(X, 2, crossprod)  ## diagonal elements of X'X
  probDelta = vector("numeric", ndist)
  logDelta = array(0,2)
  keptIter = NULL      # keep MCMC samples
  
  # adjust/correct y for population mean and all SNP effects (which all have zero as initial value)
  ycorr = y - mu

Then the Gibbs sampling begins. The principle of Gibbs sampling is that we sample each parameter from its full conditional distribution (posterior distribution conditional on the values of other parameters). By doing this iteratively, we obtain samples of all parameters from a stationary distribution that converges to the joint posterior distribution. Within each iteration, the first step is to sample the population mean (fixed effect) from its full conditional distribution, given a constant prior. As you can see from the code, the full conditional distribution is centered at the BLUE solution, with the variance equal to the inverse of the coefficient matrix in the “mini” mixed model equation for the mean.

  # MCMC begins
  for (iter in 1:niter){
    # Step 2. Sample the mean from a normal posterior
    ycorr = ycorr + mu   # unadjust y with the old sample
    rhs = sum(ycorr)     # right hand side of the mixed model equation
    invLhs = 1/n         # inverse of left hand side of the mixed model equation
    muHat = invLhs*rhs   # BLUE solution
    mu = rnorm(1, muHat, sqrt(invLhs*vare))   # posterior is a normal distribution
    ycorr = ycorr - mu   # adjust y with the new sample

The next step is to sample the distribution membership and effect size for each SNPs, conditional on the effects of all the other SNPs. We first sample the indicator variable delta for the mixture distribution membership, and then sample the SNP effect beta from the normal distribution or a point mass at zero. The likelihood is a normal distribution, so the full conditional distribution given the membership is also a normal distribution. The full conditional distribution for the membership is a multinomial distribution.

    # Step 3. Sample each SNP effect from a multi-normal mixture distribution
    logPi = log(pi)
    logPiComp = log(1-pi)
    invSigmaSq = 1/(gamma*c(sigmaSq))
    logSigmaSq = log(gamma*c(sigmaSq))
    nsnpDist = rep(0, ndist)  # number of SNPs asigned to each mixture component
    ssq = 0  # weighted sum of squares of SNP effects 
    nnz = 0  # number of nonzero effects
    ghat = array(0,n)  # individual genetic value
    # loop over SNPs
    for (j in 1:m){
      oldSample = beta[j]
      rhs = crossprod(X[,j], ycorr) + xpx[j]*oldSample  # right hand side of the mixed model equation
      rhs = rhs/vare
      invLhs = 1.0/(xpx[j]/c(vare) + invSigmaSq)        # inverse of left hand side of the mixed model equation
      betaHat = invLhs*c(rhs)                           # BLUP solution
      
      # sample the mixture distribution membership
      logDelta = 0.5*(log(invLhs) - logSigmaSq + betaHat*c(rhs)) + logPi  # log likelihood + prior for nonzero effects
      logDelta[1] = logPi[1]                                              # log likelihood + prior for zero effect
      for (k in 1:ndist) {
        probDelta[k] = 1.0/sum(exp(logDelta - logDelta[k]))   # posterior probability for each distribution membership
      }
      
      delta = sample(1:ndist, 1, prob = probDelta)   # indicator variable for the distribution membership
      nsnpDist[delta] = nsnpDist[delta] + 1
      
      if (delta > 1) {
        beta[j] = rnorm(1, betaHat[delta], sqrt(invLhs[delta]))  # given the distribution membership, the posterior is a normal distribution
        ycorr = ycorr + X[,j]*(oldSample - beta[j])              # update ycorr with the new sample
        ghat = ghat + X[,j]*beta[j]                              
        ssq = ssq + beta[j]^2 / gamma[delta]
        nnz = nnz + 1
      } else {
        if (oldSample) ycorr = ycorr + X[,j]*oldSample           # update ycorr with the new sample which is zero
        beta[j] = 0
      }
    }   
    beta_mcmc[iter,] = beta

Next, we sample the other parameters in the model, including mixing probabilities (\(\pi\)), SNP effect variance (\(\sigma^2_{\beta}\)), and residual variance (\(\sigma^2_e\)). The full conditional distribution for \(\pi\) is a beta distribution. For \(\sigma^2_{\beta}\) and \(\sigma^2_e\), they are scaled inverse chi-square distributions.

    # Step 4. Sample the distribution membership probabilities from a Dirichlet distribution
    pi = rdirichlet(1, nsnpDist + 1)
    
    # Step 5. Sample the SNP effect variance from a scaled inverse chi-square distribution
    sigmaSq = (ssq + nub*scaleb)/rchisq(1, nnz+nub)
    
    # Step 6. Sample the residual variance from a scaled inverse chi-square distribution
    vare = (crossprod(ycorr) + nue*scalee)/rchisq(1, n+nue)

Given the genetic values and residual variance, we can compute the SNP-based heritability by definition. This is an appealling property of the MCMC approach that given the sampled values of SNP effects, you can easily compute any statistics of interest.

    # Step 7. Compute the SNP-based heritability
    varg = var(ghat)
    h2  = varg/(varg + vare)

The final step in each iteration is to store the MCMC samples for the parameter values

    keptIter <- rbind(keptIter,c(pi, nnz, sigmaSq, h2, vare, varg))

After MCMC sampling, we can calculate the point estimate and its variance from the MCMC samples.

  colnames(keptIter) <- c(paste0("Pi", 1:length(pi)),"Nnz","SigmaSq","h2","Vare","Varg")
  postMean = apply(keptIter, 2, mean)  # posterior mean of MCMC samples is used as the point estimate for each parameter
  cat("\nPosterior mean:\n")
  print(postMean)
  return(list(par=keptIter, beta=beta_mcmc))

4 Illustration of principle

Read in the data by R.

# data for training population
data_path="/data/module5/toy/"
X <- as.matrix(read.table(paste0(data_path,"xmat_trn.txt")))
y <- as.matrix(read.table(paste0(data_path,"yvec_trn.txt")))

# data for validation population
Xval <- as.matrix(read.table(paste0(data_path,"xmat_val.txt")))
yval <- as.matrix(read.table(paste0(data_path,"yvec_val.txt")))

4.1 A Bayesian model with point-normal prior (BayesC)

For the first exercise, we will analyse this small data set using BayesC.

\[\beta_j \begin{cases} \sim N(0, \sigma^2_\beta) & \text{ with probability } \pi \\ \\ = 0 & \text{ with probability } 1 - \pi \end{cases} \] where parameter \(\pi\) is estimated from the data. It can be seen that when \(\pi = 0\), the prior collapses to a normal distribution, i.e., \(\beta_j \sim N(0, \sigma^2_\beta)\), which is the assumption in BLUP.

We will analyse the data with a script written in R, BayesR.R. Let’s load it in R.

source("/data/module5/scr/bayesr.R")

The script includes a function called bayesr. BayesR is a general model where each SNP effects is assumed to follow a mixture distribution of normal distributions. BayesC is a special case of BayesR. The input parameters are

# Input parameters for bayesr. Do not run this.
bayesr = function(X, y, niter, gamma, startPi, startH2)

We can use this program to run BayesC by specifying the mixture model as

gamma = c(0, 1)
startPi = c(0.8, 0.2)

This means the SNP effects are, a priori, assumed to be normally distributed with a probability of 0.2 and have a point mass at zero with a probability of 0.8. Now we can run the analysis using command

res_bayesc = bayesr(X = X, y = y, gamma = gamma, startPi = startPi)

## 
##  iter  100, nnz =    7, sigmaSq =  1.850, h2 =  0.484, vare =  3.974, varg =  3.735 
## 
##  iter  200, nnz =    4, sigmaSq =  1.039, h2 =  0.561, vare =  3.261, varg =  4.168 
## 
##  iter  300, nnz =    2, sigmaSq =  2.065, h2 =  0.470, vare =  4.010, varg =  3.560 
## 
##  iter  400, nnz =    2, sigmaSq =  1.212, h2 =  0.498, vare =  3.722, varg =  3.696 
## 
##  iter  500, nnz =    3, sigmaSq =  1.599, h2 =  0.498, vare =  3.568, varg =  3.534 
## 
##  iter  600, nnz =    7, sigmaSq =  0.655, h2 =  0.475, vare =  4.254, varg =  3.849 
## 
##  iter  700, nnz =    6, sigmaSq =  1.393, h2 =  0.496, vare =  4.016, varg =  3.958 
## 
##  iter  800, nnz =    5, sigmaSq =  1.830, h2 =  0.489, vare =  4.622, varg =  4.423 
## 
##  iter  900, nnz =    6, sigmaSq =  0.675, h2 =  0.529, vare =  3.375, varg =  3.793 
## 
##  iter 1000, nnz =    4, sigmaSq =  2.855, h2 =  0.465, vare =  4.327, varg =  3.763 
## 
## Posterior mean:
##       Pi1       Pi2       Nnz   SigmaSq        h2      Vare      Varg 
## 0.5441817 0.4558183 4.4730000 1.7018875 0.4968014 3.8873645 3.8516138

Here it outputs the sampled values for some of the key model parameters in every 100 iterations. For example, Nnz is the number of nonzero effects fitted in the model. As you can see, in contrast to BLUP which assumes all SNPs contribute equally to the trait, BayesC knocks out some of the SNPs by allowing their effect size to be exactly zero. For this reason, Nnz is less than the number of SNPs. Of course, there may be different SNPs fitted as nonzero across iterations, due to sampling variation.

After the MCMC is finished, you can find the sampled values for the model parameters and SNP effects for each iteration in the result list. For example, you can look at the posterior mean and standard deviation for each parameter by

colMeans(res_bayesc$par)

##       Pi1       Pi2       Nnz   SigmaSq        h2      Vare      Varg 
## 0.5441817 0.4558183 4.4730000 1.7018875 0.4968014 3.8873645 3.8516138

apply(res_bayesc$par, 2, sd)

##        Pi1        Pi2        Nnz    SigmaSq         h2       Vare       Varg 
## 0.21126203 0.21126203 1.98070548 1.27101871 0.03563397 0.30474265 0.44320081

The posterior mean is often used as point estimate of the parameter, and the posterior standard error can be approximated by the standard deviation of MCMC samples.

You can use the plotting facilities in R to investigate changes in the parameters over iterations. For example, to look at the effect of the first marker across iterations, you would enter the command

# Trace plot for SNP 1, which is the causal variant with effect size of 2
#png("beta1_vs_iter.png")
plot(1:nrow(res_bayesc$beta), res_bayesc$beta[,1], xlab="Iteration", ylab="beta[,1]") 
abline(h = 2, col="red")

#dev.off()

Question: Use this command to investigate each of the parameters in turn. Do they appear to be fluctuating around the correct values?

# Trace plot for SNP 2, which is a null SNP
#png("beta2_vs_iter.png")
plot(1:nrow(res_bayesc$beta), res_bayesc$beta[,2], xlab="Iteration", ylab="beta[,1]") 
abline(h = 0, col="red")

#dev.off()

# Trace plot for SNP 5, which is a causal variant with a smaller effect of 1 and in LD with other SNPs
#png("beta5_vs_iter.png")
plot(1:nrow(res_bayesc$beta), res_bayesc$beta[,5], xlab="Iteration", ylab="beta[,1]") 
abline(h = 1, col="red")

#dev.off()

We can also plot the posterior distribution, for example for the effect of the first SNP. We would discard the first 100 iterations of the program as “burn in”:

# Posterior distribution of SNP 1 effect
#png("dist_beta1.png")
plot(density(res_bayesc$beta[100:1000,1]))

#dev.off()

Question: Does the distribution appear to be normal? What about the distributions of the other parameters?

# Posterior distribution of SNP 2 effect
#png("dist_beta2.png")
plot(density(res_bayesc$beta[100:1000,2]))

#dev.off()

Question: The posterior distribution of the SNP effect can be summarised to quantify the evidence of the SNP associated with the trait. For example, write a R code to cacluate the probability of SNP 1 having a nonzero effect, which is known as PIP (posterior inclusion probability).

mean(res_bayesc$beta[100:1000,1] != 0)

## [1] 1

From the posterior samples, we can also calculate the Bayesian 90% credible interval, which represents 95% probability that the true parameter would lie within the interval, given the evidence from the data. What can we conclude from the Bayesian credible interval?

quantile(res_bayesc$beta[100:1000,1], probs=c(0.05, 00.95))

##       5%      95% 
## 1.716366 2.469117

With MCMC samples, it’s straightforward to compute the posterior inclusion probability (PIP). PIP is a Bayesian measurement for SNP-trait association, that is, the posterior probability that the SNP is associated with trait. It has been widely used in the fine-mapping analysis. It can be calculated as the frequency of the SNP being fitted as a non-zero effect in the model across MCMC samples.

Question: Can you tell which SNPs are likely to be the causal variants based on their PIP?

delta = (res_bayesc$beta != 0)  # indicator variable for each SNP in each MCMC cycle
pip = colMeans(delta)  # frequency of the indicator variable being one across MCMC cycles
plot(pip, type="h", xlab="SNP", ylab="Posterior inclusion probability")

To get the SNP effect estimates, we would calculate the posterior mean of SNP effects:

betaMean = colMeans(res_bayesc$beta)

Question: What’s the prediction accuracy using BayesC? How does it compare to that from C+PT or BLUP? What could cause the difference?

ghat_bayesc=Xval %*% betaMean
summary(lm(yval ~ ghat_bayesc))$r.squared

## [1] 0.5319675

4.2 Bayesian approach using multi-component mixture prior (BayesR)

BayesR prior assumes that some SNPs have zero effect, some have small effects, and some have large effects, by assuming a mixture of multiple normal distributions, including a point mass at zero.

\[ \beta_j \sim \sum_k \pi_k N(0, \gamma_k \sigma^2_\beta) \]

For example, we can set a 4-component mixture model:

gamma = c(0, 0.01, 0.1, 1)
startPi = c(0.5, 0.3, 0.15, 0.05)
bayesr.res = bayesr(X = X, y = y, gamma = gamma, startPi = startPi)

## 
##  iter  100, nnz =    8, sigmaSq =  1.234, h2 =  0.490, vare =  3.394, varg =  3.266 
## 
##  iter  200, nnz =   10, sigmaSq =  1.631, h2 =  0.547, vare =  3.795, varg =  4.574 
## 
##  iter  300, nnz =    8, sigmaSq =  1.551, h2 =  0.510, vare =  3.834, varg =  3.989 
## 
##  iter  400, nnz =    8, sigmaSq =  2.449, h2 =  0.465, vare =  4.246, varg =  3.689 
## 
##  iter  500, nnz =    2, sigmaSq =  2.763, h2 =  0.552, vare =  3.957, varg =  4.875 
## 
##  iter  600, nnz =    3, sigmaSq =  5.617, h2 =  0.566, vare =  3.620, varg =  4.727 
## 
##  iter  700, nnz =   10, sigmaSq =  1.563, h2 =  0.544, vare =  3.233, varg =  3.852 
## 
##  iter  800, nnz =    5, sigmaSq =  3.690, h2 =  0.546, vare =  3.669, varg =  4.421 
## 
##  iter  900, nnz =   10, sigmaSq =  2.164, h2 =  0.495, vare =  3.720, varg =  3.648 
## 
##  iter 1000, nnz =    5, sigmaSq = 10.054, h2 =  0.516, vare =  3.758, varg =  4.007 
## 
## Posterior mean:
##       Pi1       Pi2       Pi3       Pi4       Nnz   SigmaSq        h2      Vare 
## 0.2836194 0.2907440 0.2293404 0.1962962 7.0450000 3.9312372 0.4993304 3.9175322 
##      Varg 
## 3.9214288

Question: Is the prediction accuracy improved? Why or why not?

betaMean = colMeans(bayesr.res$beta)
ghat_bayesr=Xval %*% betaMean
summary(lm(yval ~ ghat_bayesr))$r.squared

## [1] 0.5302406

Take a moment to think about it before continuing.

Bayesian methods differ to each other mainly in the assumed prior distribution for SNP effects. The best method is the one that assumes a distribution closest to the true distribution of causal effects. In our simulation, we only two causal effects so fitting a multi-component model is sort of a “overkill”.

However, you may want to play around with the number of mixture components and see how sensitive the results are.

5 UKB data analysis using GCTB

GCTB (https://cnsgenomics.com/software/gctb/#Overview) is a c++ software for performing Bayesian analysis of large-scale genomic data. It takes the PLINK bed file as input and generate estimates for model parameters and SNP effects for polygenic prediction. You can use the following code to run BayesR analysis. It will take about 5 mins or more to finish, depending on how busy the cluster is, and about 4GB RAM. If you are short of time, you can check the results that produced from the same code at /data/module5/ukb/bayesr.parRes and /data/module5/ukb/bayesr.snpRes. You can also run with chromosome 1 only by adding a flag --chr 1 for a quick run.

The default MCMC chain length is 3000 with the first 1000 iterations as burn-in. Here, for the interest of time, we set a shorter chain length. In your own real trait analysis, it is recommended to use the default setting or even longer chain.

bfile="/data/module5/ukb/gwas"
pheno="/data/module5/ukb/gwas.phen"
covar="/data/module5/ukb/gwas.cov"
gctb --bayes R \
     --bfile $bfile \
     --pheno $pheno \
     --covar $covar \
     --gamma 0,0.01,0.1,1 \
     --pi 0.994,0.003,0.002,0.001 \
     --chain-length 1000 \
     --burn-in 200 \
     --out bayesr

Result file bayesr.parRes shows the estimates of model parameters. Result file bayesr.snpRes shows the estimates of SNP effects. If you didn’t make it to run the genome-wide analysis, copy /data/module5/sim/bayesr.snpRes to your current folder.

cat bayesr.parRes

NumSnp1-NumSnp4 are the numbers of SNPs in the mixture components 1 to 4 (component 1: zero effect; component 2: small effects whose variance = 0.01 \(\times \sigma^2_\beta\); component 3: medium effects whose variance = 0.1 \(\times \sigma^2_\beta\); component 4: large effects whose variance = 1 \(\times \sigma^2_\beta\)). hsq is the SNP-based heritability estimate. We set chain length to be 1000 here for demonstration purpose. In practice, we recommend to run at least 3,000 iterations. It is common to discard the first 20-40% of the samples (e.g., 1,000) as burn-in.

Question: Are the estimates from the genome-wide BayesR analysis consistent with the true values in simulation? Note that we simulated 100 causal variants with a trait heritability of 0.5.

Let’s also have a look at the SNP effect results:

head bayesr.snpRes

We then use PLINK to perform prediction:

target="/data/module5/ukb/target"
snpRes="bayesr.snpRes"
plink --bfile $target --score $snpRes 2 5 8 header sum center --out target.bayesr

Use the R script to evaluate the prediction accuracy

phenFile="/data/module5/ukb/target.phen" 
covFile="/data/module5/ukb/target.cov"
prsFile="target.bayesr.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile

Question 6: How does the prediction accuracy from BayesR compare to that from C+PT?

Bayesian methods using individual-level data

Jian Zeng

2024-06-28