In this practical you will perform polygenic prediction using SBLUP and SBayesR, the variations of BLUP and BayesR that only require GWAS summary statistics. We will use the same data sets in the session of individual-level BLUP and BayesR analysis.
The scripts (source code) used for this practical include
ls /data/module5/scr/sbayesr.R
ls /data/module5/scr/get_pred_r2.RLoad data in R.
nmarkers <- 10 #number of markers
nrecords <- 325 #number of records
# data for training population
X <- matrix(scan("/data/module5/toy/xmat_trn.inp"), ncol=nmarkers, byrow = TRUE)
y <- matrix(scan("/data/module5/toy/yvec_trn.inp"), byrow = TRUE)
# data for validation population
Xval <- matrix(scan("/data/module5/toy/xmat_val.inp"), ncol=nmarkers, byrow = TRUE)
yval <- matrix(scan("/data/module5/toy/yvec_val.inp"), byrow = TRUE)First, let us assume the genotype matrix is standardised (scaled) to have mean zero and variance one in each column. Run GWAS using the standardised genotypes:
X_unscaled = X
X_scaled = apply(X, 2, scale)
# run GWAS on the scaled genotypes
fit = apply(X_scaled, 2, function(x){summary(lm(y~x))$coefficients[2,1:2]})
b_scaled = fit[1,]R = cor(X)Run SBLUP using marginal SNP effects from GWAS and LD correlation matrix:
lambda = 10
I = diag(nmarkers)
coeff = R + I*lambda/nrecords
rhs = b_scaled
beta_sblup = solve(coeff, rhs)Run BLUP using the individual-level data as benchmark:
lambda = 10
I = diag(nmarkers+1)
X = cbind(1, X_scaled)
coeff = crossprod(X) + I*lambda
rhs = crossprod(X, y)
beta_blup = solve(coeff, rhs)[-1]Question Are BLUP and SBLUP solutions the same?
cor(beta_blup, beta_sblup)## [1] 1
In practice, GWAS is often done with unscaled genotypes (i.e., coded as 0, 1, 2). In this case, we need to first scale the marginal effects and then unscale the joint effect estimates to put them back to the scale of per-allele effects.
# run GWAS on the 0/1/2 genotypes
fit = apply(X_unscaled, 2, function(x){summary(lm(y~x))$coefficients[2,1:2]})
b = fit[1,]
se = fit[2,]
# calculate the scale factor for each SNP (i.e., sqrt(heterozygosity/vary)),
# assuming each SNP effect is vanishingly small
scale = sqrt(1/(nrecords*se^2))
# scale the marginal effects
b_scaled = b*scalePerform SBLUP using scaled marginal effects
lambda = 10
I = diag(nmarkers)
coeff = R + I*lambda/nrecords
rhs = b_scaled
beta_sblup_scaled = solve(coeff, rhs)
beta_sblup_unscaled = beta_sblup_scaled/scaleThen we can predict GEBV of selection candidates.
# get the predicted genetic value (GEBV)
ghat=Xval%*%beta_sblup_unscaled
# prediction R-square
summary(lm(yval~ghat))$r.squared## [1] 0.4759358
In practice, we often use a sparse LD correlation matrix that removes LD correlations that are almost zero. Let’s see how this would influence the BLUP performance.
# arbitrarily set a LD threshold such that R becomes sparse
R_sp = R
R_sp[abs(R_sp)<0.2] = 0lambda = 10
I = diag(nmarkers)
coeff = R_sp + I*lambda/nrecords
rhs = b_scaled
beta_sblup_scaled = solve(coeff, rhs)
beta_sblup_unscaled_sp = beta_sblup_scaled/scaleplot(beta_sblup_unscaled, beta_sblup_unscaled_sp)
abline(a=0, b=1)SBayesR method is implemented in sbayesr.R. We also
include the BayesR method in bayesr.R. Have a look at the
code and see what the differences are.
source("/data/module5/scr/sbayesr.R")Run SBayesR. Note that here we are using the marginal effects from GWAS with unstandardised genotypes as input data, and the scaling factors are computed using standard errors.
sr.res = sbayesr(b, se, nrecords, R)##
## iter 100, nnz = 8, sigmaSq = 0.888, h2 = 0.954, vare = 1.000
##
## iter 200, nnz = 8, sigmaSq = 1.314, h2 = 0.859, vare = 1.000
##
## iter 300, nnz = 3, sigmaSq = 4.070, h2 = 0.913, vare = 1.000
##
## iter 400, nnz = 4, sigmaSq = 0.492, h2 = 0.938, vare = 1.000
##
## iter 500, nnz = 3, sigmaSq = 3.921, h2 = 0.888, vare = 1.000
##
## iter 600, nnz = 4, sigmaSq = 2.423, h2 = 1.022, vare = 1.000
##
## iter 700, nnz = 9, sigmaSq = 3.864, h2 = 0.925, vare = 1.000
##
## iter 800, nnz = 8, sigmaSq = 1.120, h2 = 1.046, vare = 1.000
##
## iter 900, nnz = 7, sigmaSq = 1.850, h2 = 0.857, vare = 1.000
##
## iter 1000, nnz = 3, sigmaSq = 1.094, h2 = 0.971, vare = 1.000
##
## Posterior mean:
## Pi1 Pi2 Pi3 Pi4 Nnz SigmaSq h2 Vare
## 0.4637807 0.2475237 0.1396782 0.1490173 4.5000000 2.3073678 0.9583929 1.0000000
beta.sr = colMeans(sr.res$beta)Run BayesR as benchmark:
source("/data/module5/scr/bayesr.R")r.res = bayesr(X_unscaled, y)##
## iter 100, nnz = 5, sigmaSq = 2.204, h2 = 0.439, vare = 3.835, varg = 2.998
##
## iter 200, nnz = 9, sigmaSq = 2.372, h2 = 0.429, vare = 4.095, varg = 3.078
##
## iter 300, nnz = 8, sigmaSq = 3.690, h2 = 0.441, vare = 4.187, varg = 3.300
##
## iter 400, nnz = 10, sigmaSq = 2.293, h2 = 0.505, vare = 3.709, varg = 3.789
##
## iter 500, nnz = 4, sigmaSq = 2.137, h2 = 0.459, vare = 4.279, varg = 3.630
##
## iter 600, nnz = 8, sigmaSq = 3.220, h2 = 0.514, vare = 3.505, varg = 3.713
##
## iter 700, nnz = 4, sigmaSq = 2.091, h2 = 0.509, vare = 3.636, varg = 3.769
##
## iter 800, nnz = 4, sigmaSq = 2.966, h2 = 0.547, vare = 4.081, varg = 4.932
##
## iter 900, nnz = 8, sigmaSq = 2.215, h2 = 0.544, vare = 3.788, varg = 4.527
##
## iter 1000, nnz = 5, sigmaSq = 2.055, h2 = 0.514, vare = 3.689, varg = 3.895
##
## Posterior mean:
## Pi1 Pi2 Pi3 Pi4 Nnz SigmaSq h2 Vare
## 0.2956888 0.2864830 0.2164192 0.2014090 6.8600000 3.9276307 0.5000334 3.9104262
## Varg
## 3.9242224
beta.r = colMeans(r.res$beta)Question: Are BayesR and SBayesR SNP effect estimates the same? What could possibly cause the difference? (hint: the scaling process implies an assumption of a negligible effect size for each individual SNP.)
cor(beta.r, beta.sr)## [1] 0.9762267
Question: Can you tell which SNPs are likely to be the causal variants based on the posterior inclusion probability (PIP)? PIP is a Bayesian measurement for SNP-trait association, which is calculated as the frequency of the SNP being fitted as a non-zero effect in the model across MCMC samples.
delta = (sr.res$beta != 0)
pip = colMeans(delta)
plot(pip, type="h", xlab="SNP", ylab="Posterior inclusion probability")Command manual can be found at https://yanglab.westlake.edu.cn/software/gcta/#SBLUP.
It requires a LD reference sample with genotypes. Here
test population (n=500) is used as the LD reference sample.
This step will take about 5 minutes.
bfile="/data/module5/sim/test"
ma="/data/module5/sim/gwas.ma"
gcta --bfile $bfile \
--cojo-file $ma \
--cojo-sblup 273604 \
--cojo-wind 1000 \
--thread-num 8 \
--out gwasThe SNP results are stored in file simu.sblup.cojo. You
can use these SBLUP estimates for polygenic prediction.
target="/data/module5/sim/target"
sblup="gwas.sblup.cojo"
plink --bfile $target --score $sblup 1 2 4 sum center --out target.sblup
phenFile="/data/module5/sim/target.phen"
covFile="/data/module5/sim/target.cov"
prsFile="target.sblup.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFileQuestion: How does it compare to the BLUP result using individual-level data?
A tutorial for using GCTB to run SBayesR can be found at https://cnsgenomics.com/software/gctb/#Tutorial.
GCTB also uses .ma file to input GWAS summary data.
Since the computing time and memory consumption of SBayesR is independent of GWAS sample size and only summary-level data is required, we are able to run SBayesR with a much larger GWAS data set. Find the GWAS summary data file generated from using the simulated data based on the full UK Biobank samples for GWAS (n=350,000):
ls /data/module5/sim/ukb_matched.maDO NOT run this step in the practical because it can take a while and quite a bit of space. Instead, use the data that we have generated for you as shown in the next step.
GCTB provides options to compute different types of LD matrix. In our SBayesR paper (https://doi.org/10.1038/s41467-019-12653-0), we use shrunk LD matrix, which can be built by the following command.
# Build shrunk LD matrix for each chromosome
# DO NOT RUN THIS IN CLASS
# bfile="/data/module5/sim/gwas"
# genmap="/data/module5/sim/interpolated_genetic_map.txt"
> mldm.txt
for i in `seq 1 22`
do
gctb --bfile $bfile \
--chr $i \
--make-shrunk-ldm \
--gen-map $genmap \
--out chr$i
echo "chr$i.ldm.shrunk" >> mldm.txt
doneIn practice, if you are using summary statistics from GWAS of European ancestry, we recommend use of Banded matrix or Shrunk sparse matrix, which are computed using 1M HapMap3 SNPs in the UKB and can be download from our website (https://cnsgenomics.com/software/gctb/#Download).
SBayesR can be carried out using command below. This may take about 5
minutes or more, depending on how busy the cluster is, and about 20GB
RAM. Our cluster may not be able to accommodate this analysis for all
students at the same time. You can check the results that produced from
the same code at /data/module5/sim/sbayesr.parRes and
/data/module5/sim/sbayesr.snpRes.
ma="/data/module5/sim/ukb_matched.imputed.ma"
mldm="/data/module5/sim/ldm/mldm.txt"
gctb --sbayes R \
--mldm $mldm \
--gwas-summary $ma \
--gamma 0,0.01,0.1,1 \
--pi 0.994,0.003,0.002,0.001 \
--chain-length 1000 \
--burn-in 200 \
--out sbayesrResult file sbayesr.parRes shows the posterior estimates
of model parameters. Result file sbayesr.snpRes shows the
estimates of joint SNP effects.
If interested, you should be able to run SBayesR using chromosome 1 only using
ma="/data/module5/sim/ukb_matched.imputed.ma"
ldm="/data/module5/sim/ldm/chr1.ldm.shrunk"
gctb --sbayes R \
--ldm $ldm \
--gwas-summary $ma \
--gamma 0,0.01,0.1,1 \
--pi 0.994,0.003,0.002,0.001 \
--chain-length 1000 \
--burn-in 200 \
--out sbayesr.chr1You can run polygenic prediction using the effect estimates from
SBayesR. Firstly, copy /data/module5/sim/sbayesr.snpRes to
your current folder, which is the result from genome-wide analysis. You
are welcome to try running genome-wide analysis after the class when the
server is not busy. Then, run
target="/data/module5/sim/target"
snpRes="sbayesr.snpRes"
plink --bfile $target --score $snpRes 2 5 8 header sum center --out target.sbayesr
phenFile="/data/module5/sim/target.phen"
covFile="/data/module5/sim/target.cov"
prsFile="target.sbayesr.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFileQuestion: How many SNPs are fitted with non-zero effects? Is the number consistent with the number of causal variants (100) in the simulation? Is the prediction accuracy better than C+PT and SBLUP?
cat sbayesr.parRes## Mean SD
## NumSnp1 261737.093750 5.634700
## NumSnp2 3.287500 3.892922
## NumSnp3 1.800000 2.619160
## NumSnp4 67.800003 3.779550
## Vg1 0.000000 0.000000
## Vg2 0.000241 0.000383
## Vg3 0.001843 0.003195
## Vg4 0.997916 0.003236
## SigmaSq 2.047411 0.551730
## ResVar 98.869164 0.879301
## GenVar 89.399185 1.834236
## hsq 0.471580 0.009676
NumSnp1-NumSnp4 are the numbers of SNPs in the mixture
components 1 to 4 (component 1: zero effect; component 2: small effects
whose variance = 0.01 \(\times
\sigma^2_\beta\); component 3: medium effects whose variance =
0.1 \(\times \sigma^2_\beta\);
component 4: large effects whose variance = 1 \(\times \sigma^2_\beta\)). hsq
is the SNP-based heritability estimate.
A work by Jian Zeng