We have learned the Bayesian methods for polygenic prediction using individual genotypes and phenotypes. Recently, methodologies have evolved to require only summary statistics from GWAS. In this practical, you will see how the individual- and summary-data-based methods are connected and where are the differences (e.g., BayesR vs. SBayesR). In addition to without a need of accessing to individual-level data, methods using summary statistics (sumstats) are also more computationally efficient, allowing us to perform the same analysis with a fraction of computing resource. In our latest research, we exploits eigen decomposition approach to further boost the computation efficiency. This new method (SBayesRC) allows us to fit multi-million SNPs simultaneously and incorporate functional annotations. Similar to the previous practical, we will use a small data set to illustrate the principles of SBayesR, and then use GCTB to run a large data set from UK Biobank (UKB) with SBayesR and SBayesRC.
Note: the small data set can be freely downloaded. The UK Biobank data set cannot be downloaded because it contains real individual genotypes, although individual IDs are masked.
This toy data set includes:
The trait was simulated such that SNP 1 has an effect size of 2 and SNP 5 has an effect size of 1. The trait heritability is set at 0.5.
Data location:
ls /data/module5/toy/
This data set is based on real genotypes at 273,604 SNPs from the UK Biobank, with a simulated trait affected by 100 causal variants and heritability 0.5. The samples are split into three independent cohorts:
Data location:
ls /data/module5/ukb/
This section helps you understand the R code used in the method.
Do not run the R code below in this section.
The algorithm for SBayesR (Summary-data-based BayesR) is implemented
in sbayesr.R
.
cat /data/module5/scr/sbayesr.R
Here we focus on the code that is different from
individual-data-based BayesR implemented in bayesr.R
.
Compared to BayesR, SBayesR includes an additional step at the beginning to scale all GWAS marginal effects so that they are in the per-genotype-SD unit with phenotypic variance equal to 1. GWAS is usually performed using the 0/1/2 genotypes, whereas the algorithm assumes standardised genotypes, as shown in the lecture.
# Step 1. scale the GWAS SNP effects
scale = sqrt(1/(n*se^2 + b^2)) # scale factors for marginal SNP effects
b_scaled = b*scale # scaled marginal SNP effects (in units of genotype sd / phenotype sd)
vary = 1 # phenotypic variance = 1 after the scaling
Once MCMC begins, the next step is to sample SNP effects. Unlike BayesR, SBayesR does not include fixed effects, because they have already been adjusted in the GWAS. For each SNP, if its effect is not zero, we sample it from a normal distribution where the mean is the BLUP solution to the “per-SNP mixed model equation”. The only difference in this part between SBayesR and BayesR is the use of summary data equivalents:
rhs = n*bcorr[i] + n*oldSample # right hand side of the mixed model equation
rhs = rhs/vare
invLhs = 1.0/(n/c(vare) + invSigmaSq) # inverse of left hand side of the mixed model equation
betaHat = invLhs*c(rhs) # BLUP solution
# if the effect is not zero, we sample it from a normal distribution
beta[i] = rnorm(1, betaHat[delta], sqrt(invLhs[delta])) # given the distribution membership, the posterior is a normal distribution
###################################################################
# In contrast, this is what we have in individual-data-based BayesR
rhs = crossprod(X[,j], ycorr) + xpx[j]*oldSample # right hand side of the mixed model equation
rhs = rhs/vare
invLhs = 1.0/(xpx[j]/c(vare) + invSigmaSq) # inverse of left hand side of the mixed model equation
betaHat = invLhs*c(rhs) # BLUP solution
###################################################################
After sampling the SNP effect \(\beta_j\), instead of adjusting \(y\) (as in BayesR), we update \(b_{corr}\) in SBayesR. This strategy is known as “right-hand side updating”, because it updates the vector of \(X'y\) (\(=nb\)), the right-hand side of the mixed model equations.
if (delta > 1) {
# ...
bcorr = bcorr + R[,i]*(oldSample - beta[i]) # update bhatcorr with the new sample
# ...
} else {
if (oldSample) bcorr = bcorr + R[,i]*oldSample # update bhatcorr with the new sample which is zero
# ...
}
###################################################################
# In contrast, this is what we have in individual-data-based BayesR
if (delta > 1) {
# ...
ycorr = ycorr + X[,j]*(oldSample - beta[j]) # update ycorr with the new sample
# ...
} else {
if (oldSample) ycorr = ycorr + X[,j]*oldSample # update ycorr with the new sample which is zero
# ...
}
###################################################################
This approach is efficient because \(b_{corr}\) is only updated for SNPs in LD with SNP \(i\), which is typically a smaller subset within an LD block.
The Gibbs samplers for \(\pi\) and \(\sigma^2_{\beta}\) are the same as in BayesR. To compute genetic variance, we use
\[\sigma^2_g = \beta'R\beta = \beta'(b - b_{corr})\] because \(b_{corr} = b - R\beta\). Since phenotypic variance is set to 1, SNP-based heritability equals the genetic variance.
# Step 5. Compute the SNP-based heritability
bRb = crossprod(beta, (b-bcorr)) # compute genetic variance = beta' R beta
varg = bRb
h2 = varg
We can also estimate the residual variance. However, unlike BayesR, the residual variance is not guaranteed to be positive. Issues like LD mismatches between GWAS and reference samples, variation in per-SNP sample size, or errors in the GWAS summary statistics can result in a negative residual variance estimate. When this happens, the algorithm cannot proceed, so we force the residual variance to equal 1 (phenotypic variance).
sse = (vary - varg)*n
vare = (sse + nue*scalee)/rchisq(1, n+nue)
if (vare <= 0) vare = vary # sometimes sse can be negative and would cause a problem
The algorithm for SBayesRC (an extention of SBayesR to incorporate
functional annotations) is implemented in sbayesrc.R
.
cat /data/module5/scr/sbayesrc.R
Here are the key differences between SBayesR and SBayesRC.
First, we need to initiate additional variables related to the SNP annotations.
# annotation related variables
snpPi = matrix(rep(pi, m), byrow=TRUE, nrow=m, ncol=ndist) # per-SNP pi given the SNP annotations
alpha_mcmc = matrix(0, niter, ncol(anno)*(ndist-1)) # MCMC samples of annotation effects
p = matrix(nrow = m, ncol = ndist-1) # per-SNP conditional probability p
z = matrix(nrow = m, ncol = ndist-1) # per-SNP conditional distribution membership indicator
alpha = matrix(0, nrow = ncol(anno), ncol = ndist-1) # vector of annotation effects
sigmaSqAlpha = rep(1, ndist-1) # variance of annotation effects
sigmaSqAlpha_mcmc = matrix(0, niter, ndist-1) # MCMC samples of the variance of annotation effects
nAnno = ncol(anno) # number of annotations
annoDiag = apply(anno, 2, crossprod) # diagonal values of A'A where A is the annotation matrix
Second, when sampling SNP effects, we record the conditional distribution membership for the SNP:
if (delta > 1) {
# ...
for(j in 1:(delta-1)){ # set one to the "step up" indicator variable
z[i,j] = 1
}
} else {
# ...
}
Next, there is an extra step to sample the annotation effects. It may
look complicated, but the sampling scheme is similar to how we sample
the SNP effects with individual-level data. The main difference is that
in stead of using the genotype matrix as X
and the
phenotypes as y
, here we use the annotation matrix as
X
and a latent variable (sampled from truncated normal
distribution) as y
whose mean is a linear combination of
SNP annotations. The effect of an annotation in this model shows how
much changes in the prior probability of the SNP with a nonzero
effect.
# Step 4. Sample the annotation effects on the SNP distribution mixing probabilities
for (j in 1:(ndist-1)) { # have ndist-1 number of conditional distributions
if (j==1) {
idx = 1:m # for the first conditional distribution, data are all SNPs
annoDiagj = annoDiag # diagonal values of A'A
} else {
idx = which(z[,j-1] > 0) # for the subsequent conditional distributions, data are the SNPs in the previous distribution
if (length(idx) > 1) {
annoDiagj = apply(as.matrix(anno[idx,]), 2, crossprod) # recompute A'A depending on the SNP memberships
}
}
if (length(idx)) {
zj = z[idx,j]
mu = anno[idx,] %*% matrix(alpha[,j]) # linear combination of annotations, which will be the mean of truncated normal distribution
lj = array(0, length(zj)) # latent variable
if (sum(zj==0)) lj[zj==0] = rtruncnorm(sum(zj==0), mean = mu[zj==0], sd = 1, a = -Inf, b = 0) # sample latent variable
if (sum(zj==1)) lj[zj==1] = rtruncnorm(sum(zj==1), mean = mu[zj==1], sd = 1, a = 0, b = Inf) # sample latent variable
# sample annotation effects using Gibbs sampler (similar to the SNP effect sampler in the individual-level model)
lj = lj - c(mu) # adjust the latent variable by all annotation effects
# sample intercept with a flat prior
oldSample = alpha[1,j]
rhs = sum(lj) + m*oldSample
invLhs = 1.0/m
ahat = invLhs*rhs
alpha[1,j] = rnorm(1, ahat, sqrt(invLhs))
lj = lj + (oldSample - alpha[1,j])
# sample each annotation effect with a normal prior
if (nAnno > 1) {
for (k in 2:nAnno) {
oldSample = alpha[k,j]
rhs = crossprod(anno[idx,k], lj) + annoDiagj[k]*oldSample;
invLhs = 1.0/(annoDiagj[k] + 1.0/sigmaSqAlpha[j])
ahat = invLhs*rhs
alpha[k,j] = rnorm(1, ahat, sqrt(invLhs))
lj = lj + anno[idx,k]*(oldSample - alpha[k,j])
}
# sample annotation effect variance from a scaled inverse chi-square distribution
sigmaSqAlpha[j] = (sum(alpha[-1,j]^2) + 2)/rchisq(1, nAnno-1+2)
}
}
}
alpha_mcmc[iter,] = c(alpha) # store MCMC samples of annotation effects
sigmaSqAlpha_mcmc[iter,] = sigmaSqAlpha
Given the sampled annotation effects, we can then compute the conditional probabilities for each SNP which determine how likely the SNP is to belong to higher-effect mixture components.
# Step 5. Compute per-SNP conditional probabilities from the annotation effects
p = apply(alpha, 2, function(x){pnorm(anno %*% x)})
Given the conditional probabilities, we can compute the joint probabilities (\(\pi\)) for each SNP belong to the mixture distribution components:
# Step 6. Compute the joint probabilities (pi) from the conditional probabilities (p)
for (k in 1:ndist) {
if (k < (ndist-1)) {
snpPi[,k] = 1.0 - p[,k]
} else {
snpPi[,k] = 1
}
if (k > 1) {
for (kk in 1:(k-1)) {
snpPi[,k] = snpPi[,k]*p[,kk]
}
}
}
#for example, we want
#snpPi[,1] = 1 - p[,1]
#snpPi[,2] = (1-p[,2])*p[,1]
#snpPi[,3] = (1-p[,3])*p[,1]*p[,2]
#snpPi[,4] = p[,1]*p[,2]*p[,3]
These \(\pi\) values are then used in the next iteration of SNP effect sampling, just like in SBayesR.
Load data in R.
# data for training population
data_path="/data/module5/toy/"
X <- as.matrix(read.table(paste0(data_path,"xmat_trn.txt")))
y <- as.matrix(read.table(paste0(data_path,"yvec_trn.txt")))
# data for validation population
Xval <- as.matrix(read.table(paste0(data_path,"xmat_val.txt")))
yval <- as.matrix(read.table(paste0(data_path,"yvec_val.txt")))
SBayesR method is implemented in sbayesr.R
.
source("/data/module5/scr/sbayesr.R")
## Warning: package 'MCMCpack' was built under R version 4.4.1
First, we need to obtain GWAS effect estimates (using 0/1/2 genotypes):
# run GWAS on the 0/1/2 genotypes
fit = apply(X, 2, function(x){summary(lm(y~x))$coefficients[2,1:2]})
b = fit[1,]
se = fit[2,]
Compute LD correlation matrix:
R = cor(X)
Then we scale the GWAS effects to be in per-genotype-standard-deviation unit:
nind = nrow(X) # GWAS sample size
scale = sqrt(1/(nind*se^2 + b^2)) # calculate the scale factor for each SNP
b_scaled = b*scale # scale the marginal effects
Now we are ready to run SBayesR:
res_sbayesr = sbayesr(b, se, nind, R)
##
## iter 100, nnz = 5, sigmaSq = 1.708, h2 = 0.464, vare = 0.487
##
## iter 200, nnz = 2, sigmaSq = 3.075, h2 = 0.434, vare = 0.642
##
## iter 300, nnz = 5, sigmaSq = 1.136, h2 = 0.488, vare = 0.538
##
## iter 400, nnz = 7, sigmaSq = 1.237, h2 = 0.524, vare = 0.503
##
## iter 500, nnz = 5, sigmaSq = 0.614, h2 = 0.475, vare = 0.526
##
## iter 600, nnz = 6, sigmaSq = 5.741, h2 = 0.510, vare = 0.508
##
## iter 700, nnz = 3, sigmaSq = 1.305, h2 = 0.506, vare = 0.464
##
## iter 800, nnz = 5, sigmaSq = 0.961, h2 = 0.482, vare = 0.533
##
## iter 900, nnz = 3, sigmaSq = 2.066, h2 = 0.504, vare = 0.462
##
## iter 1000, nnz = 6, sigmaSq = 1.701, h2 = 0.514, vare = 0.533
##
## Posterior mean:
## Pi1 Pi2 Pi3 Pi4 Nnz SigmaSq h2 Vare
## 0.4339769 0.2486935 0.1943400 0.1229896 4.8420000 2.1536359 0.5034002 0.4974042
beta_sbayesr = colMeans(res_sbayesr$beta)
Run BayesR as benchmark:
source("/data/module5/scr/bayesr.R")
res_bayesr = bayesr(X, y)
##
## iter 100, nnz = 10, sigmaSq = 1.418, h2 = 0.451, vare = 4.020, varg = 3.304
##
## iter 200, nnz = 7, sigmaSq = 2.304, h2 = 0.508, vare = 3.825, varg = 3.948
##
## iter 300, nnz = 8, sigmaSq = 4.110, h2 = 0.516, vare = 3.888, varg = 4.150
##
## iter 400, nnz = 5, sigmaSq = 3.808, h2 = 0.546, vare = 4.057, varg = 4.882
##
## iter 500, nnz = 6, sigmaSq = 4.083, h2 = 0.478, vare = 3.769, varg = 3.452
##
## iter 600, nnz = 6, sigmaSq = 2.811, h2 = 0.489, vare = 3.989, varg = 3.818
##
## iter 700, nnz = 2, sigmaSq = 2.374, h2 = 0.467, vare = 3.775, varg = 3.313
##
## iter 800, nnz = 8, sigmaSq = 3.129, h2 = 0.485, vare = 3.627, varg = 3.410
##
## iter 900, nnz = 8, sigmaSq = 1.915, h2 = 0.490, vare = 4.257, varg = 4.094
##
## iter 1000, nnz = 6, sigmaSq = 3.063, h2 = 0.448, vare = 4.323, varg = 3.509
##
## Posterior mean:
## Pi1 Pi2 Pi3 Pi4 Nnz SigmaSq h2 Vare
## 0.2782621 0.2839210 0.2125146 0.2253023 7.1020000 4.3112584 0.4962306 3.9160819
## Varg
## 3.8714410
beta_bayesr = colMeans(res_bayesr$beta)
Question: Are BayesR and SBayesR SNP effect estimates the same? What could possibly cause the difference?
cor(beta_bayesr, beta_sbayesr)
## [1] 0.9192422
plot(beta_bayesr, beta_sbayesr)
abline(a=0, b=1)
Answer: The small differences are mostly due to variation in MCMC sampling, known as Monte Carlo variance.
The posterior inclusion probability (PIP) are also expected to be similar between SBayesR and BayesR:
delta_bayesr = (res_bayesr$beta != 0) # indicator variable for each SNP in each MCMC cycle
pip_bayesr = colMeans(delta_bayesr) # frequency of the indicator variable being one across MCMC cycles
delta_sbayesr = (res_sbayesr$beta != 0)
pip_sbayesr = colMeans(delta_sbayesr)
plot(pip_bayesr, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="BayesR")
plot(pip_sbayesr, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="SBayesR")
Due to extensive LD among SNPs, it’s often difficult to identify causal variants, especially those with smaller effects. For example, SNP 5 only has a PIP of about 0.5. Functional annotations can provide orthogonal information to LD, helping to better identify causal variants. Here, we demonstrate how this works in principle. Suppose both causal variants are located in coding sequences, while the other SNPs are in non-coding regions. We can incorporate this annotation information using SBayesRC.
A simplified version of SBayesRC algorithm is implemented in
sbayesrc.R
. It uses LD correlation matrix rather than
eigen-decomposition data as described in the original paper.
source("/data/module5/scr/sbayesrc.R")
## Warning: package 'truncnorm' was built under R version 4.4.1
SBayesRC requires a table of SNP annotations. Suppose SNP 1, 3, and 5 (where SNP 1 and 5 are causal variants) are non-synonymous variants. We can add a binary annotate for all SNPs indicating whether they are non-synonymous. For illustration purpose, this annotation is designed to be informative, as it covers the two causal variants but also includes a null SNP, making the scenario slightly more realistic.
The annotation table has the size of number of SNPs (rows) by number of annotations (could be multiple annotations) plus a column of one as the first column (an intercept for the generalised linear model that links annotations to SNP effects).
int = rep(1,10) # a vector of one as intercept
nonsynonymous = c(1,0,1,0,1,0,0,0,0,0) # whether the SNP is non-synoymous
anno = cbind(int, nonsynonymous)
print(anno)
## int nonsynonymous
## [1,] 1 1
## [2,] 1 0
## [3,] 1 1
## [4,] 1 0
## [5,] 1 1
## [6,] 1 0
## [7,] 1 0
## [8,] 1 0
## [9,] 1 0
## [10,] 1 0
We are now ready to run SBayesRC using this annotation matrix. Although we provide annotation information, no prior weights are assigned. The method learns the impact of annotations jointly with SNP effects from the data. This is a unified Bayesian approach, unlike stepwise approaches that estimate annotation enrichment before fitting the model.
res_sbayesrc = sbayesrc(b, se, nind, R, anno)
##
## iter 100, nnz = 3, sigmaSq = 2.455, h2 = 0.461, vare = 0.574
##
## iter 200, nnz = 2, sigmaSq = 0.795, h2 = 0.494, vare = 0.456
##
## iter 300, nnz = 2, sigmaSq = 2.727, h2 = 0.505, vare = 0.521
##
## iter 400, nnz = 2, sigmaSq = 1.679, h2 = 0.533, vare = 0.506
##
## iter 500, nnz = 3, sigmaSq = 3.371, h2 = 0.499, vare = 0.520
##
## iter 600, nnz = 2, sigmaSq = 2.286, h2 = 0.460, vare = 0.496
##
## iter 700, nnz = 3, sigmaSq = 0.354, h2 = 0.542, vare = 0.480
##
## iter 800, nnz = 2, sigmaSq = 2.429, h2 = 0.463, vare = 0.528
##
## iter 900, nnz = 3, sigmaSq = 1.789, h2 = 0.540, vare = 0.473
##
## iter 1000, nnz = 3, sigmaSq = 5.327, h2 = 0.518, vare = 0.463
##
## Posterior mean of model parameters:
## Pi1 Pi2 Pi3 Pi4 Nnz SigmaSq
## 0.745410357 0.031739057 0.222850587 0.006225115 2.518000000 2.649073414
## h2 Vare
## 0.499710242 0.499996485
##
## Annotation effects:
## p2 p3 p4
## int -1.739302 20.7690782 -4.7389488
## nonsynonymous 2.412028 0.2955737 -0.2428682
##
## Conditional probabilities:
## p2 p3 p4
## int 0.0410 1 0
## nonsynonymous 0.7494 1 0
##
## Joint probabilities:
## Pi1 Pi2 Pi3 Pi4
## int 0.9590 0 0.0410 0
## nonsynonymous 0.2506 0 0.7494 0
Let’s have a look at the output. First, you may have noticed that
Nnz
(the number of nonzero effects) substantially deceased
compared to (S)BayesR about, which is now close to the number of
simulated causal variants. Second, the biggest annotation effect is
observed in the cell of p2
and nonsynonymous
.
This mean the annotation has greatly increased the prior probability of
having a nonzero effects for the annotated SNPs. This can also be seen
from the Conditional probabilities
result, where the cell
of p2
and nonsynonymous
is close to one,
indicating the posterior probability that a annotated SNP will move from
zero effect to a nonzero effect distribution is almost one! From the
Joint probabilities
result, we can see that nearly all
annotated SNPs are in the medium effect size distribution
(Pi3
).
Question: Putting all these results together, what would you conclude?
Answer: These results suggest that the annotation is highly informative. It plays a crucial role in distinguishing causal variants from non-causal ones, significantly enhancing the model’s ability to identify true signals.
To evaluate this, let’s check the PIP of SNPs from SBayesRC.
delta_sbayesrc = (res_sbayesrc$beta != 0) # indicator variable for each SNP in each MCMC cycle
pip_sbayesrc = colMeans(delta_sbayesrc) # frequency of the indicator variable being one across MCMC cycles
plot(pip_sbayesrc, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="SBayesRC")
Question: How would you interpret the result?
Answer: Both SNP 1 and 5 (the true causal variants) stand out with high PIPs, indicating strong posterior probability of being, partly due to their functional annotations. Notably, SNP 5 did not show strong association evidence in the (S)BayesR analysis above, which does not incorporate annotations. Although SNP 3 shares the same annotation, it does not have a high PIP. This is because posterior probability reflects a combination of likelihood from GWAS and the prior information from functional annotation.
Let’s also check how incorporating annotations affects polygenic prediction.
beta_sbayesrc = colMeans(res_sbayesrc$beta)
ghat_sbayesrc=Xval %*% beta_sbayesrc
# prediction accuracy
print(summary(lm(yval ~ ghat_sbayesrc))$r.squared)
## [1] 0.5189804
# prediction bias
print(summary(lm(yval ~ ghat_sbayesrc))$coeff)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7227881 0.5388751 1.341291 1.902368e-01
## ghat_sbayesrc 1.0796755 0.1930191 5.593621 4.882015e-06
A tutorial for using GCTB to run SBayesR can be found at https://cnsgenomics.com/software/gctb/#SBayesRCTutorial.
In our previous papers, we use shrunk LD matrix for SBayesR. Here, we will use SBayesR and SBayesRC with eigen decomposition on the LD matrix (low-rank model). This method has been published recently in Nature Genetics (Zheng et al 2024). It has a more robust performance.
GCTB also uses .ma
file to input GWAS summary data.
Since the computing time and memory consumption of SBayesR is independent of GWAS sample size and only summary-level data is required, we are able to run SBayesR with a much larger GWAS data set. Find the GWAS summary data file generated from using the simulated data based on the more samples in UK Biobank for GWAS (n=340,000):
ls /data/module5/ukb/ukb_matched.ma
We will partition the whole genome into apprixmately independent LD
blocks each with at least 4cM long. The LD block information is shown in
file ‘/data/module5/ukb/ref4cM_v37.pos’. Less less
to take
a look at the file. We will compute the LD correlation matrix for each
block. This process can be paralleled with multiple threads.
## SBayesR-eigen
bfile="/data/module5/ukb/gwas"
blockinfo="/data/module5/ukb/ref4cM_v37.pos"
gctb --bfile $bfile \
--make-block-ldm \
--block-info $blockinfo \
--thread 4 \
--out ldm
After running this command, it will generate a folder call ‘ldm’. It has the binary data for LD matrices and two text files: ‘ldm.info’ and ‘snp.info’. These text file stores LD block and SNP information. Feel free to have a look at these files.
The next step is to perform eigen decomposition on the generated LD matrices, required by the low-rank model. The command below will do this job.
gctb --ldm ldm --make-ldm-eigen --thread 4 --out ldm
gwas="/data/module5/ukb/ukb_matched.ma"
gctb --ldm-eigen ldm --gwas-summary $gwas --impute-summary --thread 4 --out ukb_matched
It will generate a new file ‘ukb_matched.imputed.ma’ which will be used for the SBayesR analysis in the next step.
Putting the LD eigen decomposition data and GWAS summary data together, we are ready to run SBayesR with the low-rank model.
gctb --ldm-eigen ldm \
--gwas-summary ukb_matched.imputed.ma \
--sbayes R \
--thread 4 \
--out sbayesr
Result file sbayesr.parRes
shows the posterior estimates
of model parameters. Result file sbayesr.snpRes
shows the
estimates of joint SNP effects.
You can run polygenic prediction using the effect estimates from
SBayesR. Firstly, copy /data/module5/sim/sbayesr.snpRes
to
your current folder, which is the result from genome-wide analysis. You
are welcome to try running genome-wide analysis after the class when the
server is not busy. Then, run
target="/data/module5/ukb/target"
snpRes="sbayesr.snpRes"
plink --bfile $target --score $snpRes 2 5 8 header sum center --out target.sbayesr
phenFile="/data/module5/ukb/target.phen"
covFile="/data/module5/ukb/target.cov"
prsFile="target.sbayesr.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile
Question: Is the SNP-based heritability estimate what we expect? Is the prediction accuracy better than C+PT and SBLUP?
cat sbayesr.parRes
## Parameter Mean SD
## NumSnp1 269822.343750 338.656158
## NumSnp2 2413.254639 282.314545
## NumSnp3 232.940002 44.786907
## NumSnp4 31.829992 17.337288
## NumSnp5 90.369995 9.907162
## Vg1 0.000000 0.000000
## Vg2 0.024740 0.002852
## Vg3 0.021521 0.004173
## Vg4 0.034450 0.019483
## Vg5 0.919289 0.018361
## SigmaSq 0.005169 0.000193
## hsq 0.510662 0.002460
## ResVar 1.006251 0.002156
NumSnp1-NumSnp4
are the numbers of SNPs in the mixture
components 1 to 4 (component 1: zero effect; component 2: small effects
whose variance = 0.01 \(\times
\sigma^2_\beta\); component 3: medium effects whose variance =
0.1 \(\times \sigma^2_\beta\);
component 4: large effects whose variance = 1 \(\times \sigma^2_\beta\)). hsq
is the SNP-based heritability estimate. Vg1-4
is the
proportion of the SNP-based heritability explained by each mixture
component.
Our simulation data do not include functional annotation information.
Here, we will use the data from UKB height and 96 functional annotations
from S-LDSC model to demonstrate how to run SBayesRC using GCTB. The
command is similar to running SBayesR. Just replace
--sbayes R
with --sbayes RC
and add annotation
file by --annot [filename]
. Note that you can also run
SBayesRC using a R package that we have developed: https://github.com/zhilizheng/SBayesRC.
There are 7.4M SNPs in the summary statistics for UKB height. The first step is to match and impute the missing SNPs in the LD reference. We will use the LD reference data generated above.
The default MCMC chain length is 3000 with the first 1000 iterations as burn-in. Here, for the interest of time, we set a shorter chain length because it takes more time to run model incorporating annotations. In your own real trait analysis, it is recommended to use the default setting or even longer chain.
gwas="/data/module5/ukb/HT.ma"
annot="/data/module5/ukb/annotations.txt"
## Match and impute SNPs
gctb --ldm-eigen ldm --gwas-summary $gwas --impute-summary --thread 4 --out HT
## Run SBayesRC
gctb --ldm-eigen ldm \
--gwas-summary HT.imputed.ma \
--annot $annot \
--sbayes RC \
--thread 4 \
--chain-length 1000 \
--burn-in 200 \
--out HT_sbayesrc
Besides the standard output files as you have seen in SBayesR, it
will also generate files sbayesrc.parSetRes
and
sbayesrc.enrich
which show the results for parameter sets
regarding to functional annotations.
For example, you can check the SNP-based heritability enrichment across functional annotations in R code:
enrich = read.table("HT_sbayesrc.enrich", header=T)
# plot the per-SNP heritability enrichment for all annotations
library(ggplot2)
enrich$Annotation = factor(enrich$Annotation, levels=enrich$Annotation[order(enrich$Mean, decreasing = T)])
p1 = ggplot(enrich, aes(x=Annotation, y=Mean, fill=Annotation)) + geom_bar(stat="identity", position="dodge") +
geom_errorbar(aes(ymin=Mean-SD, ymax=Mean+SD, color=Annotation)) + geom_hline(yintercept=1, linetype="dashed") +
xlab("Functional annotation") + ylab("Per-SNP heritability enrichment") +
guides(fill="none", color="none") +
theme(axis.text = element_text(size=5, angle = 90, vjust = 0.5, hjust=1))
ggsave("height_h2_enrich_all.pdf", p1, width=12, height=6)
p1
Highlight some important functional annotations:
# highlight some annotations
selected = c("Ancient_Sequence_Age_Human_Promoter", "Human_Promoter_Villar_ExAC", "Intron_UCSC", "Coding_UCSC", "Conserved_LindbladToh", "TSS_Hoffman", "UTR_5_UCSC", "Enhancer_Andersson", "UTR_3_UCSC", "Transcr_Hoffman", "Repressed_Hoffman")
p2 = ggplot(subset(enrich, Annotation %in% selected), aes(x=Annotation, y=Mean, fill=Annotation)) + geom_bar(stat="identity", position="dodge") +
geom_errorbar(aes(ymin=Mean-SD, ymax=Mean+SD, color=Annotation)) + geom_hline(yintercept=1, linetype="dashed") +
xlab("Functional annotation") + ylab("Per-SNP heritability enrichment") +
guides(fill="none", color="none") +
theme(axis.text = element_text(angle = 90, vjust = 0.5, hjust=1))
ggsave("height_h2_enrich_highlight.pdf", p2, width=12, height=6)
p2
With SBayesRC result, we can gain some insights into the functional architecture of the trait:
library(stringr)
library(dplyr)
library(tidyr)
# Read results
res = read.table("HT_sbayesrc.parSetRes", header=T)
names(res) = c("Par","Anno","Est","SE","PP") # the columns are parameter, annotation, estimate, standard error, and posterior probability of greater than zero
# Extract Pi parameter estimates
pi = res[str_detect(res$Par, "AnnoJointProb"), ]
pi = pi %>% separate(Par, into = c("Pi", "Component"), sep = "_pi")
# Identify the `Est` value for the `Intercept` row within each component
intercept_values <- pi %>%
filter(Anno == "Intercept") %>%
select(Component, Est) %>%
rename(Intercept_Est = Est)
# Merge these intercept values back to the original data frame
pi_with_intercept <- pi %>% left_join(intercept_values, by = "Component")
# Subtract the intercept value from the `Est` values
pi_with_intercept <- pi_with_intercept %>% mutate(Adjusted_Est = Est - Intercept_Est)
# Create a new data frame with the adjusted values
adjusted_pi <- pi_with_intercept %>% select(Component, Anno, Adjusted_Est, SE, PP)
# plot the result focusing on the selected annotation and omit component 5 because there is no signal in there
facet_labels <- c("1" = "Zero", "2" = "Small effect", "3" = "Moderate effect", "4" = "Large effect")
p3 = ggplot(subset(adjusted_pi, Anno %in% selected & Component != 5), aes(x=Anno, y=Adjusted_Est, fill=Anno)) + geom_bar(stat="identity", position="dodge") +
geom_errorbar(aes(ymin=Adjusted_Est-SE, ymax=Adjusted_Est+SE, color=Anno)) + geom_hline(yintercept=0, linetype="dashed") +
facet_grid(.~Component, scales="free", labeller = labeller(Component = facet_labels)) +
xlab("Functional annotation") + ylab("Deviation of Pi to average") +
coord_flip() +
guides(fill="none", color="none") +
theme(axis.text = element_text(angle = 0, vjust = 0.5, hjust=1))
ggsave("height_pi_enrich_highlight.pdf", p3, width=12, height=6)
p3
Despite large standard errors, evolutionary conserved regions, coding sequences, promoters and transcription start sites (TSS) tend to be enriched in causal variants with large effect sizes, whereas repressed regions and introns tend to be depleted in causal variants. Such information learned from the data will, in turn, help with better estimate the SNP effects and therefore improve prediction. Since we cannot provide validation with real genotypes and phenotypes, we are not doing prediction for this UKB height data in this practical exercise.