Objectives

In this practical you will perform genomic prediction using Bayesian methods. We will use the same data sets in the BLUP session, i.e., a small data set with 10 SNPs and simulated phenotypes for 325 bulls , and a larger data set of 3,000 GWAS individuals and 277,719 SNPs. We will use R script and GCTB to do these analyses.

Scripts

The scripts (source code) used for this practical include

ls /data/module5/scr/bayesr.R
ls /data/module5/scr/get_pred_r2.R

Analysis of the small data set using R script

Read the data in R.

nmarkers <- 10      #number of markers
nrecords <- 325     #number of records

# data for training population
X <- matrix(scan("/data/module5/toy/xmat_trn.inp"), ncol=nmarkers, byrow = TRUE)
y <- matrix(scan("/data/module5/toy/yvec_trn.inp"), byrow = TRUE)

# data for validation population
Xval <- matrix(scan("/data/module5/toy/xmat_val.inp"), ncol=nmarkers, byrow = TRUE)
yval <- matrix(scan("/data/module5/toy/yvec_val.inp"), byrow = TRUE)

SNP-BLUP as benchmark

The code for running BLUP is shown here. You can run BLUP and use the prediction accuracy from BLUP as benchmark to compare those from the Bayesian methods below.

# set value for lambda 
lambda    <- 10

# need a vector of ones and a identity matrix with the size of the number of SNPs
ones <- array(1,c(nrecords))
ident_mat <-diag(nmarkers)

# build the left hand side of the mixed model equations
coeff <- array(0,c(nmarkers+1,nmarkers+1))
coeff[1:1, 1:1] <- t(ones)%*%ones
coeff[1:1,2:(nmarkers+1)] <- t(ones)%*%X
coeff[2:(nmarkers+1), 2:(nmarkers+1)] <- t(X)%*%X + ident_mat*lambda 
coeff[2:(nmarkers+1), 1:1] <- t(coeff[1:1,2:(nmarkers+1)])

# build the right hand side of the mixed model equations
rhs <- array(0, c(nmarkers+1,1))
rhs[1,1]=t(ones)%*%y
rhs[2:(nmarkers+1),1]=t(X)%*%y

# get BLUP solution
solution_vec <- solve(coeff,rhs)

# get the predicted genetic value (GEBV)
ghat_blup=Xval%*%solution_vec[-1]

# prediction R-square
summary(lm(yval~ghat_blup))$r.squared
## [1] 0.5450557

A Bayesian model with point-normal prior (BayesC)

For the first exercise, we will analyse this small data set using BayesC.

\[\beta_j \begin{cases} \sim N(0, \sigma^2_\beta) & \text{ with probability } \pi \\ \\ = 0 & \text{ with probability } 1 - \pi \end{cases} \] where parameter \(\pi\) is estimated from the data. It can be seen that when \(\pi = 0\), the prior collapses to a normal distribution, i.e., \(\beta_j \sim N(0, \sigma^2_\beta)\), which is the assumption in BLUP.

We will analyse the data with a script written in R language, BayesR.R. Let us load it in R.

source("/data/module5/scr/bayesr.R")

The script includes a function called bayesr. BayesR is a general model where each SNP effects is assumed to follow a mixture distribution of normal distributions. BayesC is a special case of BayesR. The input parameters are

# Input parameters for bayesr. Do not run this.
bayesr = function(X, y, niter, gamma, startPi, startH2)

where niter is the number of iterations for MCMC sampling, X is the genotype matrix, y is the vector of phenotypes, gamma is a vector of scaling factor for the variance of a mixture distribution, startPi is a vector of starting values for the \(\pi\) parameter, startH2 is the starting value of SNP-based heritability. Note that the number of elements in gamma and startPi define the number of mixture components, and therefore need to be matched.

We can use this program to run BayesC by specifying the mixture model as

gamma = c(0, 1)
startPi = c(0.8, 0.2)

This means the SNP effects are assumed to be normally distributed with a probability of 0.2 and have a point mass at zero with a probability of 0.8. Now we can run the analysis using command

bayesc.res = bayesr(X = X, y = y, gamma = gamma, startPi = startPi)
## 
##  iter  100, nnz =    8, sigmaSq =  0.517, h2 =  0.471, vare =  4.517, varg =  4.025 
## 
##  iter  200, nnz =    6, sigmaSq =  2.912, h2 =  0.499, vare =  3.820, varg =  3.808 
## 
##  iter  300, nnz =    9, sigmaSq =  1.018, h2 =  0.489, vare =  4.649, varg =  4.451 
## 
##  iter  400, nnz =    2, sigmaSq =  1.066, h2 =  0.470, vare =  4.118, varg =  3.659 
## 
##  iter  500, nnz =    2, sigmaSq =  0.976, h2 =  0.484, vare =  4.167, varg =  3.907 
## 
##  iter  600, nnz =    3, sigmaSq =  0.975, h2 =  0.532, vare =  3.802, varg =  4.324 
## 
##  iter  700, nnz =    3, sigmaSq =  1.253, h2 =  0.515, vare =  3.825, varg =  4.054 
## 
##  iter  800, nnz =    3, sigmaSq =  1.436, h2 =  0.527, vare =  3.673, varg =  4.099 
## 
##  iter  900, nnz =    3, sigmaSq =  0.941, h2 =  0.498, vare =  3.306, varg =  3.284 
## 
##  iter 1000, nnz =    2, sigmaSq =  3.796, h2 =  0.467, vare =  4.010, varg =  3.513 
## 
## Posterior mean:
##       Pi1       Pi2       Nnz   SigmaSq        h2      Vare      Varg 
## 0.5510368 0.4489632 4.4090000 1.8306909 0.4967579 3.9198919 3.8804407

Here it outputs the sampled values for some of the key model parameters in every 100 iterations. For example, Nnz is the number of nonzero effects fitted in the model. As you can see, in contrast to BLUP which assumes all SNPs contribute equally to the trait, BayesC knocks out some of the SNPs by allowing their effect size to be exactly zero. For this reason, Nnz is less than the number of SNPs. Of course, there may be different SNPs fitted as nonzero across iterations, due to sampling variation.

After the MCMC is finished, you can find the sampled values for the model parameters and SNP effects for each iteration in the result list. For example, you can look at the posterior mean and standard deviation for each parameter by

colMeans(bayesc.res$par)
##       Pi1       Pi2       Nnz   SigmaSq        h2      Vare      Varg 
## 0.5510368 0.4489632 4.4090000 1.8306909 0.4967579 3.9198919 3.8804407
apply(bayesc.res$par, 2, sd)
##        Pi1        Pi2        Nnz    SigmaSq         h2       Vare       Varg 
## 0.21723245 0.21723245 2.05424900 1.57415466 0.03376933 0.31119532 0.42613486

The posterior mean is often used as point estimate of the parameter, and the posterior standard error can be approximated by the standard deviation of MCMC samples.

You can use the plotting facilities in R to investigate changes in the parameters over iterations. For example, to look at the effect of the first marker across iterations, you would enter the command

#png("beta1_vs_iter.png")
plot(1:nrow(bayesc.res$beta), bayesc.res$beta[,1], xlab="Iteration", ylab="beta[,1]") 
abline(h = 2, col="red")

#dev.off()

Question: Use this command to investigate each of the parameters in turn. Do they appear to be fluctuating about the correct values?

#png("beta2_vs_iter.png")
plot(1:nrow(bayesc.res$beta), bayesc.res$beta[,2], xlab="Iteration", ylab="beta[,1]") 
abline(h = 0, col="red")

#dev.off()
#png("beta5_vs_iter.png")
plot(1:nrow(bayesc.res$beta), bayesc.res$beta[,5], xlab="Iteration", ylab="beta[,1]") 
abline(h = 1, col="red")

#dev.off()

We can also plot the posterior distribution, for example for the effect of the first SNP. We would discard the first 100 iterations of the program as “burn in”:

#png("dist_beta1.png")
plot(density(bayesc.res$beta[100:1000,1]))

#dev.off()

Question: Does the distribution appear to be normal? What about the distributions of the other parameters?

#png("dist_beta2.png")
plot(density(bayesc.res$beta[100:1000,2]))

#dev.off()

Question: The posterior distribution of the SNP effect can be summarised to quantify the evidence of the SNP associated with the trait. For example, write a R code to cacluate the probability of SNP 1 having a nonzero effect, which is known as PIP (posterior inclusion probability).

mean(bayesc.res$beta[100:1000,1] != 0)
## [1] 1

From the posterior samples, we can also calculate the Bayesian 90% credible interval, which represents 95% probability that the true parameter would lie within the interval, given the evidence from the data. What can we conclude from the Bayesian credible interval?

quantile(bayesc.res$beta[100:1000,1], probs=c(0.05, 00.95))
##       5%      95% 
## 1.747282 2.480154

To get the SNP effect estimates, we would calculate the posterior mean of SNP effects:

betaMean = colMeans(bayesc.res$beta)

Question: What’s the prediction accuracy using BayesC? How does it compare to that from C+PT or BLUP? What could cause the difference?

ghat_bayesc=Xval %*% betaMean
summary(lm(yval ~ ghat_bayesc))$r.squared
## [1] 0.5350523

Bayesian approach using multi-component mixture prior (BayesR)

BayesR prior assumes that some SNPs have zero effect, some have small effects, and some have large effects, by assuming a mixture of multiple normal distributions, including a point mass at zero.

\[ \beta_j \sim \sum_k \pi_k N(0, \gamma_k \sigma^2_\beta) \]

For example, we can set a 4-component mixture model:

gamma = c(0, 0.01, 0.1, 1)
startPi = c(0.5, 0.3, 0.15, 0.05)
bayesr.res = bayesr(X = X, y = y, gamma = gamma, startPi = startPi)
## 
##  iter  100, nnz =    9, sigmaSq =  1.971, h2 =  0.519, vare =  3.933, varg =  4.238 
## 
##  iter  200, nnz =    9, sigmaSq =  1.184, h2 =  0.449, vare =  4.550, varg =  3.714 
## 
##  iter  300, nnz =    8, sigmaSq =  1.727, h2 =  0.519, vare =  3.720, varg =  4.012 
## 
##  iter  400, nnz =    7, sigmaSq =  2.398, h2 =  0.515, vare =  4.144, varg =  4.399 
## 
##  iter  500, nnz =   10, sigmaSq =  4.597, h2 =  0.475, vare =  4.192, varg =  3.794 
## 
##  iter  600, nnz =    6, sigmaSq = 14.933, h2 =  0.491, vare =  3.686, varg =  3.562 
## 
##  iter  700, nnz =    9, sigmaSq =  3.926, h2 =  0.507, vare =  3.547, varg =  3.647 
## 
##  iter  800, nnz =    8, sigmaSq =  2.241, h2 =  0.483, vare =  3.700, varg =  3.461 
## 
##  iter  900, nnz =    6, sigmaSq =  1.598, h2 =  0.464, vare =  4.391, varg =  3.805 
## 
##  iter 1000, nnz =    8, sigmaSq =  2.533, h2 =  0.455, vare =  4.462, varg =  3.729 
## 
## Posterior mean:
##       Pi1       Pi2       Pi3       Pi4       Nnz   SigmaSq        h2      Vare 
## 0.2682418 0.2946106 0.2179885 0.2191591 7.1290000 5.5532313 0.5011064 3.8989628 
##      Varg 
## 3.9279471

Question: Is the prediction accuracy improved? Why or why not?

betaMean = colMeans(bayesr.res$beta)
ghat_bayesr=Xval %*% betaMean
summary(lm(yval ~ ghat_bayesr))$r.squared
## [1] 0.535233

You may want to play around with the number of mixture components and see how sensitive the results are.

Analysis of a larger data set using GCTB

GCTB (https://cnsgenomics.com/software/gctb/#Overview) is a c++ software for performing Bayesian analysis of large-scale genomic data. It takes the PLINK bed file as input and generate estimates for model parameters and SNP effects for polygenic prediction. You can use the following code to run BayesR analysis. It will take about 5 mins or more to finish, depending on how busy the cluster is, and about 4GB RAM. If you are short of time, you can check the results that produced from the same code at /data/module5/sim/bayesr.parRes and /data/module5/sim/bayesr.snpRes. You can also run with chromosome 1 only by adding a flag --chr 1 for a quick run.

bfile="/data/module5/sim/gwas"
pheno="/data/module5/sim/gwas.phen"
covar="/data/module5/sim/gwas.cov"
gctb --bayes R \
     --bfile $bfile \
     --pheno $pheno \
     --covar $covar \
     --gamma 0,0.01,0.1,1 \
     --pi 0.994,0.003,0.002,0.001 \
     --chain-length 1000 \
     --burn-in 200 \
     --out bayesr

Result file bayesr.parRes shows the estimates of model parameters. Result file bayesr.snpRes shows the estimates of SNP effects. If you didn’t make it to run the genome-wide analysis, copy /data/module5/sim/bayesr.snpRes to your current folder.

cat bayesr.parRes

NumSnp1-NumSnp4 are the numbers of SNPs in the mixture components 1 to 4 (component 1: zero effect; component 2: small effects whose variance = 0.01 \(\times \sigma^2_\beta\); component 3: medium effects whose variance = 0.1 \(\times \sigma^2_\beta\); component 4: large effects whose variance = 1 \(\times \sigma^2_\beta\)). hsq is the SNP-based heritability estimate. We set chain length to be 1000 here for demonstration purpose. In practice, we recommend to run at least 3,000 iterations. It is common to discard the first 20-40% of the samples (e.g., 1,000) as burn-in.

Question: Are the estimates from the genome-wide BayesR analysis consistent with the true values in simulation? Note that we simulated 100 causal variants with a trait heritability of 0.5.

Let’s also have a look at the SNP effect results:

head bayesr.snpRes

We then use PLINK to perform prediction:

target="/data/module5/sim/target"
snpRes="bayesr.snpRes"
plink --bfile $target --score $snpRes 2 5 8 header sum center --out target.bayesr

Use the R script to evaluate the prediction accuracy

phenFile="/data/module5/sim/target.phen" 
covFile="/data/module5/sim/target.cov"
prsFile="target.bayesr.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile

Question 6: How does the prediction accuracy from BayesR compare to that from C+PT?