In this practical you will perform polygenic prediction using SBLUP and SBayesR, the variations of BLUP and BayesR that only require GWAS summary statistics. We will use the same data sets in the session of individual-level BLUP and BayesR analysis.
The scripts (source code) used for this practical include
ls /data/module5/scr/sbayesr.R
ls /data/module5/scr/get_pred_r2.R
Load data in R.
<- 10 #number of markers
nmarkers <- 325 #number of records
nrecords
# data for training population
<- matrix(scan("/data/module5/toy/xmat_trn.inp"), ncol=nmarkers, byrow = TRUE)
X <- matrix(scan("/data/module5/toy/yvec_trn.inp"), byrow = TRUE)
y
# data for validation population
<- matrix(scan("/data/module5/toy/xmat_val.inp"), ncol=nmarkers, byrow = TRUE)
Xval <- matrix(scan("/data/module5/toy/yvec_val.inp"), byrow = TRUE) yval
First, let us assume the genotype matrix is standardised (scaled) to have mean zero and variance one in each column. Run GWAS using the standardised genotypes:
= X
X_unscaled = apply(X, 2, scale)
X_scaled # run GWAS on the scaled genotypes
= apply(X_scaled, 2, function(x){summary(lm(y~x))$coefficients[2,1:2]})
fit = fit[1,] b_scaled
= cor(X) R
Run SBLUP using marginal SNP effects from GWAS and LD correlation matrix:
= 10
lambda = diag(nmarkers)
I = R + I*lambda/nrecords
coeff = b_scaled
rhs = solve(coeff, rhs) beta_sblup
Run BLUP using the individual-level data as benchmark:
= 10
lambda = diag(nmarkers+1)
I = cbind(1, X_scaled)
X = crossprod(X) + I*lambda
coeff = crossprod(X, y)
rhs = solve(coeff, rhs)[-1] beta_blup
Question Are BLUP and SBLUP solutions the same?
cor(beta_blup, beta_sblup)
## [1] 1
In practice, GWAS is often done with unscaled genotypes (i.e., coded as 0, 1, 2). In this case, we need to first scale the marginal effects and then unscale the joint effect estimates to put them back to the scale of per-allele effects.
# run GWAS on the 0/1/2 genotypes
= apply(X_unscaled, 2, function(x){summary(lm(y~x))$coefficients[2,1:2]})
fit = fit[1,]
b = fit[2,]
se # calculate the scale factor for each SNP (i.e., sqrt(heterozygosity/vary)),
# assuming each SNP effect is vanishingly small
= sqrt(1/(nrecords*se^2))
scale # scale the marginal effects
= b*scale b_scaled
Perform SBLUP using scaled marginal effects
= 10
lambda = diag(nmarkers)
I = R + I*lambda/nrecords
coeff = b_scaled
rhs = solve(coeff, rhs)
beta_sblup_scaled = beta_sblup_scaled/scale beta_sblup_unscaled
Then we can predict GEBV of selection candidates.
# get the predicted genetic value (GEBV)
=Xval%*%beta_sblup_unscaled
ghat
# prediction R-square
summary(lm(yval~ghat))$r.squared
## [1] 0.4759358
In practice, we often use a sparse LD correlation matrix that removes LD correlations that are almost zero. Let’s see how this would influence the BLUP performance.
# arbitrarily set a LD threshold such that R becomes sparse
= R
R_sp abs(R_sp)<0.2] = 0 R_sp[
= 10
lambda = diag(nmarkers)
I = R_sp + I*lambda/nrecords
coeff = b_scaled
rhs = solve(coeff, rhs)
beta_sblup_scaled = beta_sblup_scaled/scale beta_sblup_unscaled_sp
plot(beta_sblup_unscaled, beta_sblup_unscaled_sp)
abline(a=0, b=1)
SBayesR method is implemented in sbayesr.R
. We also
include the BayesR method in bayesr.R
. Have a look at the
code and see what the differences are.
source("/data/module5/scr/sbayesr.R")
Run SBayesR. Note that here we are using the marginal effects from GWAS with unstandardised genotypes as input data, and the scaling factors are computed using standard errors.
= sbayesr(b, se, nrecords, R) sr.res
##
## iter 100, nnz = 8, sigmaSq = 0.888, h2 = 0.954, vare = 1.000
##
## iter 200, nnz = 8, sigmaSq = 1.314, h2 = 0.859, vare = 1.000
##
## iter 300, nnz = 3, sigmaSq = 4.070, h2 = 0.913, vare = 1.000
##
## iter 400, nnz = 4, sigmaSq = 0.492, h2 = 0.938, vare = 1.000
##
## iter 500, nnz = 3, sigmaSq = 3.921, h2 = 0.888, vare = 1.000
##
## iter 600, nnz = 4, sigmaSq = 2.423, h2 = 1.022, vare = 1.000
##
## iter 700, nnz = 9, sigmaSq = 3.864, h2 = 0.925, vare = 1.000
##
## iter 800, nnz = 8, sigmaSq = 1.120, h2 = 1.046, vare = 1.000
##
## iter 900, nnz = 7, sigmaSq = 1.850, h2 = 0.857, vare = 1.000
##
## iter 1000, nnz = 3, sigmaSq = 1.094, h2 = 0.971, vare = 1.000
##
## Posterior mean:
## Pi1 Pi2 Pi3 Pi4 Nnz SigmaSq h2 Vare
## 0.4637807 0.2475237 0.1396782 0.1490173 4.5000000 2.3073678 0.9583929 1.0000000
= colMeans(sr.res$beta) beta.sr
Run BayesR as benchmark:
source("/data/module5/scr/bayesr.R")
= bayesr(X_unscaled, y) r.res
##
## iter 100, nnz = 5, sigmaSq = 2.204, h2 = 0.439, vare = 3.835, varg = 2.998
##
## iter 200, nnz = 9, sigmaSq = 2.372, h2 = 0.429, vare = 4.095, varg = 3.078
##
## iter 300, nnz = 8, sigmaSq = 3.690, h2 = 0.441, vare = 4.187, varg = 3.300
##
## iter 400, nnz = 10, sigmaSq = 2.293, h2 = 0.505, vare = 3.709, varg = 3.789
##
## iter 500, nnz = 4, sigmaSq = 2.137, h2 = 0.459, vare = 4.279, varg = 3.630
##
## iter 600, nnz = 8, sigmaSq = 3.220, h2 = 0.514, vare = 3.505, varg = 3.713
##
## iter 700, nnz = 4, sigmaSq = 2.091, h2 = 0.509, vare = 3.636, varg = 3.769
##
## iter 800, nnz = 4, sigmaSq = 2.966, h2 = 0.547, vare = 4.081, varg = 4.932
##
## iter 900, nnz = 8, sigmaSq = 2.215, h2 = 0.544, vare = 3.788, varg = 4.527
##
## iter 1000, nnz = 5, sigmaSq = 2.055, h2 = 0.514, vare = 3.689, varg = 3.895
##
## Posterior mean:
## Pi1 Pi2 Pi3 Pi4 Nnz SigmaSq h2 Vare
## 0.2956888 0.2864830 0.2164192 0.2014090 6.8600000 3.9276307 0.5000334 3.9104262
## Varg
## 3.9242224
= colMeans(r.res$beta) beta.r
Question: Are BayesR and SBayesR SNP effect estimates the same? What could possibly cause the difference? (hint: the scaling process implies an assumption of a negligible effect size for each individual SNP.)
cor(beta.r, beta.sr)
## [1] 0.9762267
Question: Can you tell which SNPs are likely to be the causal variants based on the posterior inclusion probability (PIP)? PIP is a Bayesian measurement for SNP-trait association, which is calculated as the frequency of the SNP being fitted as a non-zero effect in the model across MCMC samples.
= (sr.res$beta != 0)
delta = colMeans(delta)
pip plot(pip, type="h", xlab="SNP", ylab="Posterior inclusion probability")
Command manual can be found at https://yanglab.westlake.edu.cn/software/gcta/#SBLUP.
It requires a LD reference sample with genotypes. Here
test
population (n=500) is used as the LD reference sample.
This step will take about 5 minutes.
bfile="/data/module5/sim/test"
ma="/data/module5/sim/gwas.ma"
gcta --bfile $bfile \
--cojo-file $ma \
--cojo-sblup 273604 \
--cojo-wind 1000 \
--thread-num 8 \
--out gwas
The SNP results are stored in file simu.sblup.cojo
. You
can use these SBLUP estimates for polygenic prediction.
target="/data/module5/sim/target"
sblup="gwas.sblup.cojo"
plink --bfile $target --score $sblup 1 2 4 sum center --out target.sblup
phenFile="/data/module5/sim/target.phen"
covFile="/data/module5/sim/target.cov"
prsFile="target.sblup.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile
Question: How does it compare to the BLUP result using individual-level data?
A tutorial for using GCTB to run SBayesR can be found at https://cnsgenomics.com/software/gctb/#Tutorial.
GCTB also uses .ma
file to input GWAS summary data.
Since the computing time and memory consumption of SBayesR is independent of GWAS sample size and only summary-level data is required, we are able to run SBayesR with a much larger GWAS data set. Find the GWAS summary data file generated from using the simulated data based on the full UK Biobank samples for GWAS (n=350,000):
ls /data/module5/sim/ukb_matched.ma
DO NOT run this step in the practical because it can take a while and quite a bit of space. Instead, use the data that we have generated for you as shown in the next step.
GCTB provides options to compute different types of LD matrix. In our SBayesR paper (https://doi.org/10.1038/s41467-019-12653-0), we use shrunk LD matrix, which can be built by the following command.
# Build shrunk LD matrix for each chromosome
# DO NOT RUN THIS IN CLASS
# bfile="/data/module5/sim/gwas"
# genmap="/data/module5/sim/interpolated_genetic_map.txt"
> mldm.txt
for i in `seq 1 22`
do
gctb --bfile $bfile \
--chr $i \
--make-shrunk-ldm \
--gen-map $genmap \
--out chr$i
echo "chr$i.ldm.shrunk" >> mldm.txt
done
In practice, if you are using summary statistics from GWAS of European ancestry, we recommend use of Banded matrix or Shrunk sparse matrix, which are computed using 1M HapMap3 SNPs in the UKB and can be download from our website (https://cnsgenomics.com/software/gctb/#Download).
SBayesR can be carried out using command below. This may take about 5
minutes or more, depending on how busy the cluster is, and about 20GB
RAM. Our cluster may not be able to accommodate this analysis for all
students at the same time. You can check the results that produced from
the same code at /data/module5/sim/sbayesr.parRes
and
/data/module5/sim/sbayesr.snpRes
.
ma="/data/module5/sim/ukb_matched.imputed.ma"
mldm="/data/module5/sim/ldm/mldm.txt"
gctb --sbayes R \
--mldm $mldm \
--gwas-summary $ma \
--gamma 0,0.01,0.1,1 \
--pi 0.994,0.003,0.002,0.001 \
--chain-length 1000 \
--burn-in 200 \
--out sbayesr
Result file sbayesr.parRes
shows the posterior estimates
of model parameters. Result file sbayesr.snpRes
shows the
estimates of joint SNP effects.
If interested, you should be able to run SBayesR using chromosome 1 only using
ma="/data/module5/sim/ukb_matched.imputed.ma"
ldm="/data/module5/sim/ldm/chr1.ldm.shrunk"
gctb --sbayes R \
--ldm $ldm \
--gwas-summary $ma \
--gamma 0,0.01,0.1,1 \
--pi 0.994,0.003,0.002,0.001 \
--chain-length 1000 \
--burn-in 200 \
--out sbayesr.chr1
You can run polygenic prediction using the effect estimates from
SBayesR. Firstly, copy /data/module5/sim/sbayesr.snpRes
to
your current folder, which is the result from genome-wide analysis. You
are welcome to try running genome-wide analysis after the class when the
server is not busy. Then, run
target="/data/module5/sim/target"
snpRes="sbayesr.snpRes"
plink --bfile $target --score $snpRes 2 5 8 header sum center --out target.sbayesr
phenFile="/data/module5/sim/target.phen"
covFile="/data/module5/sim/target.cov"
prsFile="target.sbayesr.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile
Question: How many SNPs are fitted with non-zero effects? Is the number consistent with the number of causal variants (100) in the simulation? Is the prediction accuracy better than C+PT and SBLUP?
cat sbayesr.parRes
## Mean SD
## NumSnp1 261737.093750 5.634700
## NumSnp2 3.287500 3.892922
## NumSnp3 1.800000 2.619160
## NumSnp4 67.800003 3.779550
## Vg1 0.000000 0.000000
## Vg2 0.000241 0.000383
## Vg3 0.001843 0.003195
## Vg4 0.997916 0.003236
## SigmaSq 2.047411 0.551730
## ResVar 98.869164 0.879301
## GenVar 89.399185 1.834236
## hsq 0.471580 0.009676
NumSnp1-NumSnp4
are the numbers of SNPs in the mixture
components 1 to 4 (component 1: zero effect; component 2: small effects
whose variance = 0.01 \(\times
\sigma^2_\beta\); component 3: medium effects whose variance =
0.1 \(\times \sigma^2_\beta\);
component 4: large effects whose variance = 1 \(\times \sigma^2_\beta\)). hsq
is the SNP-based heritability estimate.
A work by Jian Zeng