In this practical you will perform genomic prediction using Bayesian methods. We will use the same data sets in the BLUP session, i.e., a small data set with 10 SNPs and simulated phenotypes for 325 bulls , and a larger data set of 3,000 GWAS individuals and 277,719 SNPs. We will use R script and GCTB to do these analyses.
The scripts (source code) used for this practical include
ls /data/module5/scr/bayesr.R
ls /data/module5/scr/get_pred_r2.R
Read the data in R.
nmarkers <- 10 #number of markers
nrecords <- 325 #number of records
# data for training population
X <- matrix(scan("/data/module5/toy/xmat_trn.inp"), ncol=nmarkers, byrow = TRUE)
y <- matrix(scan("/data/module5/toy/yvec_trn.inp"), byrow = TRUE)
# data for validation population
Xval <- matrix(scan("/data/module5/toy/xmat_val.inp"), ncol=nmarkers, byrow = TRUE)
yval <- matrix(scan("/data/module5/toy/yvec_val.inp"), byrow = TRUE)
The code for running BLUP is shown here. You can run BLUP and use the prediction accuracy from BLUP as benchmark to compare those from the Bayesian methods below.
# set value for lambda
lambda <- 10
# need a vector of ones and a identity matrix with the size of the number of SNPs
ones <- array(1,c(nrecords))
ident_mat <-diag(nmarkers)
# build the left hand side of the mixed model equations
coeff <- array(0,c(nmarkers+1,nmarkers+1))
coeff[1:1, 1:1] <- t(ones)%*%ones
coeff[1:1,2:(nmarkers+1)] <- t(ones)%*%X
coeff[2:(nmarkers+1), 2:(nmarkers+1)] <- t(X)%*%X + ident_mat*lambda
coeff[2:(nmarkers+1), 1:1] <- t(coeff[1:1,2:(nmarkers+1)])
# build the right hand side of the mixed model equations
rhs <- array(0, c(nmarkers+1,1))
rhs[1,1]=t(ones)%*%y
rhs[2:(nmarkers+1),1]=t(X)%*%y
# get BLUP solution
solution_vec <- solve(coeff,rhs)
# get the predicted genetic value (GEBV)
ghat_blup=Xval%*%solution_vec[-1]
# prediction R-square
summary(lm(yval~ghat_blup))$r.squared
## [1] 0.5450557
For the first exercise, we will analyse this small data set using BayesC.
\[\beta_j \begin{cases} \sim N(0, \sigma^2_\beta) & \text{ with probability } \pi \\ \\ = 0 & \text{ with probability } 1 - \pi \end{cases} \] where parameter \(\pi\) is estimated from the data. It can be seen that when \(\pi = 0\), the prior collapses to a normal distribution, i.e., \(\beta_j \sim N(0, \sigma^2_\beta)\), which is the assumption in BLUP.
We will analyse the data with a script written in R language,
BayesR.R
. Let us load it in R.
source("/data/module5/scr/bayesr.R")
The script includes a function called bayesr
. BayesR is
a general model where each SNP effects is assumed to follow a mixture
distribution of normal distributions. BayesC is a special case of
BayesR. The input parameters are
# Input parameters for bayesr. Do not run this.
bayesr = function(X, y, niter, gamma, startPi, startH2)
where niter
is the number of iterations for MCMC
sampling, X
is the genotype matrix, y
is the
vector of phenotypes, gamma
is a vector of scaling factor
for the variance of a mixture distribution, startPi
is a
vector of starting values for the \(\pi\) parameter, startH2
is
the starting value of SNP-based heritability. Note that the number of
elements in gamma
and startPi
define the
number of mixture components, and therefore need to be matched.
We can use this program to run BayesC by specifying the mixture model as
gamma = c(0, 1)
startPi = c(0.8, 0.2)
This means the SNP effects are assumed to be normally distributed with a probability of 0.2 and have a point mass at zero with a probability of 0.8. Now we can run the analysis using command
bayesc.res = bayesr(X = X, y = y, gamma = gamma, startPi = startPi)
##
## iter 100, nnz = 8, sigmaSq = 0.517, h2 = 0.471, vare = 4.517, varg = 4.025
##
## iter 200, nnz = 6, sigmaSq = 2.912, h2 = 0.499, vare = 3.820, varg = 3.808
##
## iter 300, nnz = 9, sigmaSq = 1.018, h2 = 0.489, vare = 4.649, varg = 4.451
##
## iter 400, nnz = 2, sigmaSq = 1.066, h2 = 0.470, vare = 4.118, varg = 3.659
##
## iter 500, nnz = 2, sigmaSq = 0.976, h2 = 0.484, vare = 4.167, varg = 3.907
##
## iter 600, nnz = 3, sigmaSq = 0.975, h2 = 0.532, vare = 3.802, varg = 4.324
##
## iter 700, nnz = 3, sigmaSq = 1.253, h2 = 0.515, vare = 3.825, varg = 4.054
##
## iter 800, nnz = 3, sigmaSq = 1.436, h2 = 0.527, vare = 3.673, varg = 4.099
##
## iter 900, nnz = 3, sigmaSq = 0.941, h2 = 0.498, vare = 3.306, varg = 3.284
##
## iter 1000, nnz = 2, sigmaSq = 3.796, h2 = 0.467, vare = 4.010, varg = 3.513
##
## Posterior mean:
## Pi1 Pi2 Nnz SigmaSq h2 Vare Varg
## 0.5510368 0.4489632 4.4090000 1.8306909 0.4967579 3.9198919 3.8804407
Here it outputs the sampled values for some of the key model
parameters in every 100 iterations. For example, Nnz
is the
number of nonzero effects fitted in the model. As you can see, in
contrast to BLUP which assumes all SNPs contribute equally to the trait,
BayesC knocks out some of the SNPs by allowing their effect size to be
exactly zero. For this reason, Nnz
is less than the number
of SNPs. Of course, there may be different SNPs fitted as nonzero across
iterations, due to sampling variation.
After the MCMC is finished, you can find the sampled values for the model parameters and SNP effects for each iteration in the result list. For example, you can look at the posterior mean and standard deviation for each parameter by
colMeans(bayesc.res$par)
## Pi1 Pi2 Nnz SigmaSq h2 Vare Varg
## 0.5510368 0.4489632 4.4090000 1.8306909 0.4967579 3.9198919 3.8804407
apply(bayesc.res$par, 2, sd)
## Pi1 Pi2 Nnz SigmaSq h2 Vare Varg
## 0.21723245 0.21723245 2.05424900 1.57415466 0.03376933 0.31119532 0.42613486
The posterior mean is often used as point estimate of the parameter, and the posterior standard error can be approximated by the standard deviation of MCMC samples.
You can use the plotting facilities in R to investigate changes in the parameters over iterations. For example, to look at the effect of the first marker across iterations, you would enter the command
#png("beta1_vs_iter.png")
plot(1:nrow(bayesc.res$beta), bayesc.res$beta[,1], xlab="Iteration", ylab="beta[,1]")
abline(h = 2, col="red")
#dev.off()
Question: Use this command to investigate each of the parameters in turn. Do they appear to be fluctuating about the correct values?
#png("beta2_vs_iter.png")
plot(1:nrow(bayesc.res$beta), bayesc.res$beta[,2], xlab="Iteration", ylab="beta[,1]")
abline(h = 0, col="red")
#dev.off()
#png("beta5_vs_iter.png")
plot(1:nrow(bayesc.res$beta), bayesc.res$beta[,5], xlab="Iteration", ylab="beta[,1]")
abline(h = 1, col="red")
#dev.off()
We can also plot the posterior distribution, for example for the effect of the first SNP. We would discard the first 100 iterations of the program as “burn in”:
#png("dist_beta1.png")
plot(density(bayesc.res$beta[100:1000,1]))
#dev.off()
Question: Does the distribution appear to be normal? What about the distributions of the other parameters?
#png("dist_beta2.png")
plot(density(bayesc.res$beta[100:1000,2]))
#dev.off()
Question: The posterior distribution of the SNP effect can be summarised to quantify the evidence of the SNP associated with the trait. For example, write a R code to cacluate the probability of SNP 1 having a nonzero effect, which is known as PIP (posterior inclusion probability).
mean(bayesc.res$beta[100:1000,1] != 0)
## [1] 1
From the posterior samples, we can also calculate the Bayesian 90% credible interval, which represents 95% probability that the true parameter would lie within the interval, given the evidence from the data. What can we conclude from the Bayesian credible interval?
quantile(bayesc.res$beta[100:1000,1], probs=c(0.05, 00.95))
## 5% 95%
## 1.747282 2.480154
To get the SNP effect estimates, we would calculate the posterior mean of SNP effects:
betaMean = colMeans(bayesc.res$beta)
Question: What’s the prediction accuracy using BayesC? How does it compare to that from C+PT or BLUP? What could cause the difference?
ghat_bayesc=Xval %*% betaMean
summary(lm(yval ~ ghat_bayesc))$r.squared
## [1] 0.5350523
BayesR prior assumes that some SNPs have zero effect, some have small effects, and some have large effects, by assuming a mixture of multiple normal distributions, including a point mass at zero.
\[ \beta_j \sim \sum_k \pi_k N(0, \gamma_k \sigma^2_\beta) \]
For example, we can set a 4-component mixture model:
gamma = c(0, 0.01, 0.1, 1)
startPi = c(0.5, 0.3, 0.15, 0.05)
bayesr.res = bayesr(X = X, y = y, gamma = gamma, startPi = startPi)
##
## iter 100, nnz = 9, sigmaSq = 1.971, h2 = 0.519, vare = 3.933, varg = 4.238
##
## iter 200, nnz = 9, sigmaSq = 1.184, h2 = 0.449, vare = 4.550, varg = 3.714
##
## iter 300, nnz = 8, sigmaSq = 1.727, h2 = 0.519, vare = 3.720, varg = 4.012
##
## iter 400, nnz = 7, sigmaSq = 2.398, h2 = 0.515, vare = 4.144, varg = 4.399
##
## iter 500, nnz = 10, sigmaSq = 4.597, h2 = 0.475, vare = 4.192, varg = 3.794
##
## iter 600, nnz = 6, sigmaSq = 14.933, h2 = 0.491, vare = 3.686, varg = 3.562
##
## iter 700, nnz = 9, sigmaSq = 3.926, h2 = 0.507, vare = 3.547, varg = 3.647
##
## iter 800, nnz = 8, sigmaSq = 2.241, h2 = 0.483, vare = 3.700, varg = 3.461
##
## iter 900, nnz = 6, sigmaSq = 1.598, h2 = 0.464, vare = 4.391, varg = 3.805
##
## iter 1000, nnz = 8, sigmaSq = 2.533, h2 = 0.455, vare = 4.462, varg = 3.729
##
## Posterior mean:
## Pi1 Pi2 Pi3 Pi4 Nnz SigmaSq h2 Vare
## 0.2682418 0.2946106 0.2179885 0.2191591 7.1290000 5.5532313 0.5011064 3.8989628
## Varg
## 3.9279471
Question: Is the prediction accuracy improved? Why or why not?
betaMean = colMeans(bayesr.res$beta)
ghat_bayesr=Xval %*% betaMean
summary(lm(yval ~ ghat_bayesr))$r.squared
## [1] 0.535233
You may want to play around with the number of mixture components and see how sensitive the results are.
GCTB (https://cnsgenomics.com/software/gctb/#Overview) is a
c++ software for performing Bayesian analysis of large-scale genomic
data. It takes the PLINK bed file as input and generate estimates for
model parameters and SNP effects for polygenic prediction. You can use
the following code to run BayesR analysis. It will take about 5 mins or
more to finish, depending on how busy the cluster is, and about 4GB RAM.
If you are short of time, you can check the results that produced from
the same code at /data/module5/sim/bayesr.parRes
and
/data/module5/sim/bayesr.snpRes
. You can also run with
chromosome 1 only by adding a flag --chr 1
for a quick
run.
bfile="/data/module5/sim/gwas"
pheno="/data/module5/sim/gwas.phen"
covar="/data/module5/sim/gwas.cov"
gctb --bayes R \
--bfile $bfile \
--pheno $pheno \
--covar $covar \
--gamma 0,0.01,0.1,1 \
--pi 0.994,0.003,0.002,0.001 \
--chain-length 1000 \
--burn-in 200 \
--out bayesr
Result file bayesr.parRes
shows the estimates of model
parameters. Result file bayesr.snpRes
shows the estimates
of SNP effects. If you didn’t make it to run the genome-wide analysis,
copy /data/module5/sim/bayesr.snpRes
to your current
folder.
cat bayesr.parRes
NumSnp1-NumSnp4
are the numbers of SNPs in the mixture
components 1 to 4 (component 1: zero effect; component 2: small effects
whose variance = 0.01 \(\times
\sigma^2_\beta\); component 3: medium effects whose variance =
0.1 \(\times \sigma^2_\beta\);
component 4: large effects whose variance = 1 \(\times \sigma^2_\beta\)). hsq
is the SNP-based heritability estimate. We set chain length to be 1000
here for demonstration purpose. In practice, we recommend to run at
least 3,000 iterations. It is common to discard the first 20-40% of the
samples (e.g., 1,000) as burn-in.
Question: Are the estimates from the genome-wide BayesR analysis consistent with the true values in simulation? Note that we simulated 100 causal variants with a trait heritability of 0.5.
Let’s also have a look at the SNP effect results:
head bayesr.snpRes
We then use PLINK to perform prediction:
target="/data/module5/sim/target"
snpRes="bayesr.snpRes"
plink --bfile $target --score $snpRes 2 5 8 header sum center --out target.bayesr
Use the R script to evaluate the prediction accuracy
phenFile="/data/module5/sim/target.phen"
covFile="/data/module5/sim/target.cov"
prsFile="target.bayesr.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile
Question 6: How does the prediction accuracy from BayesR compare to that from C+PT?