We have learned the Bayesian methods for polygenic prediction using individual genotypes and phenotypes. Recently, methodologies have evolved to require only summary statistics from GWAS. In this practical, you will see how the individual- and summary-data-based methods are connected and where are the differences (e.g., BayesR vs. SBayesR). In addition to without a need of accessing to individual-level data, methods using summary statistics (sumstats) are also more computationally efficient, allowing us to perform the same analysis with a fraction of computing resource. In our latest research, we exploits eigen decomposition approach to further boost the computation efficiency. This new method (SBayesRC) allows us to fit multi-million SNPs simultaneously and incorporate functional annotations. Similar to the previous practical, we will use a small data set to illustrate the principles of SBayesR, and then use GCTB to run a large data set from UK Biobank (UKB) with SBayesR and SBayesRC.
Note: the small data set can be freely downloaded. The UK Biobank data set cannot be downloaded because it contains real individual genotypes, although individual IDs are masked.
This toy data set includes:
The trait was simulated such that SNP 1 has an effect size of 2 and SNP 5 has an effect size of 1. The trait heritability is set at 0.5.
Data location:
ls /data/module5/toy/This data set is based on real genotypes at 273,604 SNPs from the UK Biobank, with a simulated trait affected by 100 causal variants and heritability 0.5. The samples are split into three independent cohorts:
Data location:
ls /data/module5/ukb/This section helps you understand the R code used in the method.
Do not run the R code below in this section.
The algorithm for SBayesR (Summary-data-based BayesR) is implemented
in sbayesr.R.
cat /data/module5/scr/sbayesr.RHere we focus on the code that is different from
individual-data-based BayesR implemented in bayesr.R.
Compared to BayesR, SBayesR includes an additional step at the beginning to scale all GWAS marginal effects so that they are in the per-genotype-SD unit with phenotypic variance equal to 1. GWAS is usually performed using the 0/1/2 genotypes, whereas the algorithm assumes standardised genotypes, as shown in the lecture.
  # Step 1. scale the GWAS SNP effects
  scale = sqrt(1/(n*se^2 + b^2))  # scale factors for marginal SNP effects
  b_scaled  = b*scale             # scaled marginal SNP effects (in units of genotype sd / phenotype sd)
  vary  = 1                       # phenotypic variance = 1 after the scalingOnce MCMC begins, the next step is to sample SNP effects. Unlike BayesR, SBayesR does not include fixed effects, because they have already been adjusted in the GWAS. For each SNP, if its effect is not zero, we sample it from a normal distribution where the mean is the BLUP solution to the “per-SNP mixed model equation”. The only difference in this part between SBayesR and BayesR is the use of summary data equivalents:
      rhs = n*bcorr[i] + n*oldSample          # right hand side of the mixed model equation
      rhs = rhs/vare
      invLhs = 1.0/(n/c(vare) + invSigmaSq)   # inverse of left hand side of the mixed model equation
      betaHat = invLhs*c(rhs)                 # BLUP solution
      
      # if the effect is not zero, we sample it from a normal distribution 
      beta[i] = rnorm(1, betaHat[delta], sqrt(invLhs[delta]))   # given the distribution membership, the posterior is a normal distribution
      
      ###################################################################
      # In contrast, this is what we have in individual-data-based BayesR
      rhs = crossprod(X[,j], ycorr) + xpx[j]*oldSample  # right hand side of the mixed model equation
      rhs = rhs/vare
      invLhs = 1.0/(xpx[j]/c(vare) + invSigmaSq)        # inverse of left hand side of the mixed model equation
      betaHat = invLhs*c(rhs)                           # BLUP solution
      ###################################################################After sampling the SNP effect \(\beta_j\), instead of adjusting \(y\) (as in BayesR), we update \(b_{corr}\) in SBayesR. This strategy is known as “right-hand side updating”, because it updates the vector of \(X'y\) (\(=nb\)), the right-hand side of the mixed model equations.
      if (delta > 1) {
        # ...
        bcorr = bcorr + R[,i]*(oldSample - beta[i])         # update bhatcorr with the new sample
        # ...
      } else {
        if (oldSample) bcorr = bcorr + R[,i]*oldSample      # update bhatcorr with the new sample which is zero
        # ...
      }
      ###################################################################
      # In contrast, this is what we have in individual-data-based BayesR
      if (delta > 1) {
        # ...
        ycorr = ycorr + X[,j]*(oldSample - beta[j])              # update ycorr with the new sample
        # ...
      } else {
        if (oldSample) ycorr = ycorr + X[,j]*oldSample           # update ycorr with the new sample which is zero
        # ...
      }
      ###################################################################This approach is efficient because \(b_{corr}\) is only updated for SNPs in LD with SNP \(i\), which is typically a smaller subset within an LD block.
The Gibbs samplers for \(\pi\) and \(\sigma^2_{\beta}\) are the same as in BayesR. To compute genetic variance, we use
\[\sigma^2_g = \beta'R\beta = \beta'(b - b_{corr})\] because \(b_{corr} = b - R\beta\). Since phenotypic variance is set to 1, SNP-based heritability equals the genetic variance.
    # Step 5. Compute the SNP-based heritability
    bRb = crossprod(beta, (b-bcorr))  # compute genetic variance = beta' R beta
    varg = bRb
    h2  = vargWe can also estimate the residual variance. However, unlike BayesR, the residual variance is not guaranteed to be positive. Issues like LD mismatches between GWAS and reference samples, variation in per-SNP sample size, or errors in the GWAS summary statistics can result in a negative residual variance estimate. When this happens, the algorithm cannot proceed, so we force the residual variance to equal 1 (phenotypic variance).
    sse = (vary - varg)*n
    vare = (sse + nue*scalee)/rchisq(1, n+nue)
    if (vare <= 0) vare = vary  # sometimes sse can be negative and would cause a problemThe algorithm for SBayesRC (an extention of SBayesR to incorporate
functional annotations) is implemented in sbayesrc.R.
cat /data/module5/scr/sbayesrc.RHere are the key differences between SBayesR and SBayesRC.
First, we need to initiate additional variables related to the SNP annotations.
  # annotation related variables
  snpPi = matrix(rep(pi, m), byrow=TRUE, nrow=m, ncol=ndist)  # per-SNP pi given the SNP annotations
  alpha_mcmc = matrix(0, niter, ncol(anno)*(ndist-1))         # MCMC samples of annotation effects
  p = matrix(nrow = m, ncol = ndist-1)                        # per-SNP conditional probability p
  z = matrix(nrow = m, ncol = ndist-1)                        # per-SNP conditional distribution membership indicator 
  alpha = matrix(0, nrow = ncol(anno), ncol = ndist-1)        # vector of annotation effects
  sigmaSqAlpha = rep(1, ndist-1)                              # variance of annotation effects
  sigmaSqAlpha_mcmc = matrix(0, niter, ndist-1)               # MCMC samples of the variance of annotation effects
  nAnno = ncol(anno)                                          # number of annotations
  annoDiag = apply(anno, 2, crossprod)                        # diagonal values of A'A where A is the annotation matrixSecond, when sampling SNP effects, we record the conditional distribution membership for the SNP:
      if (delta > 1) {
        # ...
        for(j in 1:(delta-1)){  # set one to the "step up" indicator variable
          z[i,j] = 1
        }
      } else {
        # ...
      }Next, there is an extra step to sample the annotation effects. It may
look complicated, but the sampling scheme is similar to how we sample
the SNP effects with individual-level data. The main difference is that
in stead of using the genotype matrix as X and the
phenotypes as y, here we use the annotation matrix as
X and a latent variable (sampled from truncated normal
distribution) as y whose mean is a linear combination of
SNP annotations. The effect of an annotation in this model shows how
much changes in the prior probability of the SNP with a nonzero
effect.
    # Step 4. Sample the annotation effects on the SNP distribution mixing probabilities
    for (j in 1:(ndist-1)) {  # have ndist-1 number of conditional distributions
      if (j==1) {
        idx = 1:m             # for the first conditional distribution, data are all SNPs
        annoDiagj = annoDiag  # diagonal values of A'A
      } else {
        idx = which(z[,j-1] > 0)  # for the subsequent conditional distributions, data are the SNPs in the previous distribution
        if (length(idx) > 1) {
          annoDiagj = apply(as.matrix(anno[idx,]), 2, crossprod)  # recompute A'A depending on the SNP memberships
        }
      }
      
      if (length(idx)) {
        zj = z[idx,j]
        mu = anno[idx,] %*% matrix(alpha[,j])  # linear combination of annotations, which will be the mean of truncated normal distribution
        lj = array(0, length(zj))   # latent variable
        if (sum(zj==0)) lj[zj==0] = rtruncnorm(sum(zj==0), mean = mu[zj==0], sd = 1, a = -Inf, b = 0)   # sample latent variable
        if (sum(zj==1)) lj[zj==1] = rtruncnorm(sum(zj==1), mean = mu[zj==1], sd = 1, a = 0, b =  Inf)   # sample latent variable
        # sample annotation effects using Gibbs sampler (similar to the SNP effect sampler in the individual-level model)
        lj = lj - c(mu)  # adjust the latent variable by all annotation effects
        # sample intercept with a flat prior
        oldSample = alpha[1,j]
        rhs = sum(lj) + m*oldSample
        invLhs = 1.0/m
        ahat = invLhs*rhs
        alpha[1,j] = rnorm(1, ahat, sqrt(invLhs))
        lj = lj + (oldSample - alpha[1,j])
        # sample each annotation effect with a normal prior
        if (nAnno > 1) {
          for (k in 2:nAnno) {
            oldSample = alpha[k,j]
            rhs = crossprod(anno[idx,k], lj) + annoDiagj[k]*oldSample;
            invLhs = 1.0/(annoDiagj[k] + 1.0/sigmaSqAlpha[j])
            ahat = invLhs*rhs
            alpha[k,j] = rnorm(1, ahat, sqrt(invLhs))
            lj = lj + anno[idx,k]*(oldSample - alpha[k,j])
          }
          # sample annotation effect variance from a scaled inverse chi-square distribution
          sigmaSqAlpha[j] = (sum(alpha[-1,j]^2) + 2)/rchisq(1, nAnno-1+2)
        }
      }
    }
    alpha_mcmc[iter,] = c(alpha)  # store MCMC samples of annotation effects 
    sigmaSqAlpha_mcmc[iter,] = sigmaSqAlphaGiven the sampled annotation effects, we can then compute the conditional probabilities for each SNP which determine how likely the SNP is to belong to higher-effect mixture components.
    # Step 5. Compute per-SNP conditional probabilities from the annotation effects
    p = apply(alpha, 2, function(x){pnorm(anno %*% x)})Given the conditional probabilities, we can compute the joint probabilities (\(\pi\)) for each SNP belong to the mixture distribution components:
    # Step 6. Compute the joint probabilities (pi) from the conditional probabilities (p)
    for (k in 1:ndist) {
      if (k < (ndist-1)) {
        snpPi[,k] = 1.0 - p[,k]
      } else {
        snpPi[,k] = 1
      }
      if (k > 1) {
        for (kk in 1:(k-1)) {
          snpPi[,k] = snpPi[,k]*p[,kk]
        }
      }
    }
    #for example, we want
    #snpPi[,1] = 1 - p[,1]
    #snpPi[,2] = (1-p[,2])*p[,1]
    #snpPi[,3] = (1-p[,3])*p[,1]*p[,2]
    #snpPi[,4] = p[,1]*p[,2]*p[,3]These \(\pi\) values are then used in the next iteration of SNP effect sampling, just like in SBayesR.
Load data in R.
# data for training population
data_path="/data/module5/toy/"
X <- as.matrix(read.table(paste0(data_path,"xmat_trn.txt")))
y <- as.matrix(read.table(paste0(data_path,"yvec_trn.txt")))
# data for validation population
Xval <- as.matrix(read.table(paste0(data_path,"xmat_val.txt")))
yval <- as.matrix(read.table(paste0(data_path,"yvec_val.txt")))SBayesR method is implemented in sbayesr.R.
source("/data/module5/scr/sbayesr.R")## Warning: package 'MCMCpack' was built under R version 4.4.1First, we need to obtain GWAS effect estimates (using 0/1/2 genotypes):
# run GWAS on the 0/1/2 genotypes
fit = apply(X, 2, function(x){summary(lm(y~x))$coefficients[2,1:2]})
b = fit[1,]
se = fit[2,] Compute LD correlation matrix:
R = cor(X)Then we scale the GWAS effects to be in per-genotype-standard-deviation unit:
nind = nrow(X)  # GWAS sample size
scale = sqrt(1/(nind*se^2 + b^2))  # calculate the scale factor for each SNP
b_scaled = b*scale  # scale the marginal effectsNow we are ready to run SBayesR:
res_sbayesr = sbayesr(b, se, nind, R)## 
##  iter  100, nnz =    5, sigmaSq =  1.708, h2 =  0.464, vare =  0.487
## 
##  iter  200, nnz =    2, sigmaSq =  3.075, h2 =  0.434, vare =  0.642
## 
##  iter  300, nnz =    5, sigmaSq =  1.136, h2 =  0.488, vare =  0.538
## 
##  iter  400, nnz =    7, sigmaSq =  1.237, h2 =  0.524, vare =  0.503
## 
##  iter  500, nnz =    5, sigmaSq =  0.614, h2 =  0.475, vare =  0.526
## 
##  iter  600, nnz =    6, sigmaSq =  5.741, h2 =  0.510, vare =  0.508
## 
##  iter  700, nnz =    3, sigmaSq =  1.305, h2 =  0.506, vare =  0.464
## 
##  iter  800, nnz =    5, sigmaSq =  0.961, h2 =  0.482, vare =  0.533
## 
##  iter  900, nnz =    3, sigmaSq =  2.066, h2 =  0.504, vare =  0.462
## 
##  iter 1000, nnz =    6, sigmaSq =  1.701, h2 =  0.514, vare =  0.533
## 
## Posterior mean:
##       Pi1       Pi2       Pi3       Pi4       Nnz   SigmaSq        h2      Vare 
## 0.4339769 0.2486935 0.1943400 0.1229896 4.8420000 2.1536359 0.5034002 0.4974042beta_sbayesr = colMeans(res_sbayesr$beta)Run BayesR as benchmark:
source("/data/module5/scr/bayesr.R")res_bayesr = bayesr(X, y)## 
##  iter  100, nnz =   10, sigmaSq =  1.418, h2 =  0.451, vare =  4.020, varg =  3.304 
## 
##  iter  200, nnz =    7, sigmaSq =  2.304, h2 =  0.508, vare =  3.825, varg =  3.948 
## 
##  iter  300, nnz =    8, sigmaSq =  4.110, h2 =  0.516, vare =  3.888, varg =  4.150 
## 
##  iter  400, nnz =    5, sigmaSq =  3.808, h2 =  0.546, vare =  4.057, varg =  4.882 
## 
##  iter  500, nnz =    6, sigmaSq =  4.083, h2 =  0.478, vare =  3.769, varg =  3.452 
## 
##  iter  600, nnz =    6, sigmaSq =  2.811, h2 =  0.489, vare =  3.989, varg =  3.818 
## 
##  iter  700, nnz =    2, sigmaSq =  2.374, h2 =  0.467, vare =  3.775, varg =  3.313 
## 
##  iter  800, nnz =    8, sigmaSq =  3.129, h2 =  0.485, vare =  3.627, varg =  3.410 
## 
##  iter  900, nnz =    8, sigmaSq =  1.915, h2 =  0.490, vare =  4.257, varg =  4.094 
## 
##  iter 1000, nnz =    6, sigmaSq =  3.063, h2 =  0.448, vare =  4.323, varg =  3.509 
## 
## Posterior mean:
##       Pi1       Pi2       Pi3       Pi4       Nnz   SigmaSq        h2      Vare 
## 0.2782621 0.2839210 0.2125146 0.2253023 7.1020000 4.3112584 0.4962306 3.9160819 
##      Varg 
## 3.8714410beta_bayesr = colMeans(res_bayesr$beta)Question: Are BayesR and SBayesR SNP effect estimates the same? What could possibly cause the difference?
cor(beta_bayesr, beta_sbayesr)## [1] 0.9192422plot(beta_bayesr, beta_sbayesr)
abline(a=0, b=1)
Answer: The small differences are mostly due to variation in MCMC sampling, known as Monte Carlo variance.
The posterior inclusion probability (PIP) are also expected to be similar between SBayesR and BayesR:
delta_bayesr = (res_bayesr$beta != 0)  # indicator variable for each SNP in each MCMC cycle
pip_bayesr = colMeans(delta_bayesr)    # frequency of the indicator variable being one across MCMC cycles
delta_sbayesr = (res_sbayesr$beta != 0)
pip_sbayesr = colMeans(delta_sbayesr)
plot(pip_bayesr, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="BayesR")plot(pip_sbayesr, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="SBayesR")Due to extensive LD among SNPs, it’s often difficult to identify causal variants, especially those with smaller effects. For example, SNP 5 only has a PIP of about 0.5. Functional annotations can provide orthogonal information to LD, helping to better identify causal variants. Here, we demonstrate how this works in principle. Suppose both causal variants are located in coding sequences, while the other SNPs are in non-coding regions. We can incorporate this annotation information using SBayesRC.
A simplified version of SBayesRC algorithm is implemented in
sbayesrc.R. It uses LD correlation matrix rather than
eigen-decomposition data as described in the original paper.
source("/data/module5/scr/sbayesrc.R")## Warning: package 'truncnorm' was built under R version 4.4.1SBayesRC requires a table of SNP annotations. Suppose SNP 1, 3, and 5 (where SNP 1 and 5 are causal variants) are non-synonymous variants. We can add a binary annotate for all SNPs indicating whether they are non-synonymous. For illustration purpose, this annotation is designed to be informative, as it covers the two causal variants but also includes a null SNP, making the scenario slightly more realistic.
The annotation table has the size of number of SNPs (rows) by number of annotations (could be multiple annotations) plus a column of one as the first column (an intercept for the generalised linear model that links annotations to SNP effects).
int = rep(1,10)   # a vector of one as intercept
nonsynonymous = c(1,0,1,0,1,0,0,0,0,0)  # whether the SNP is non-synoymous
anno = cbind(int, nonsynonymous)
print(anno)##       int nonsynonymous
##  [1,]   1             1
##  [2,]   1             0
##  [3,]   1             1
##  [4,]   1             0
##  [5,]   1             1
##  [6,]   1             0
##  [7,]   1             0
##  [8,]   1             0
##  [9,]   1             0
## [10,]   1             0We are now ready to run SBayesRC using this annotation matrix. Although we provide annotation information, no prior weights are assigned. The method learns the impact of annotations jointly with SNP effects from the data. This is a unified Bayesian approach, unlike stepwise approaches that estimate annotation enrichment before fitting the model.
res_sbayesrc = sbayesrc(b, se, nind, R, anno)## 
##  iter  100, nnz =    3, sigmaSq =  2.455, h2 =  0.461, vare =  0.574 
## 
##  iter  200, nnz =    2, sigmaSq =  0.795, h2 =  0.494, vare =  0.456 
## 
##  iter  300, nnz =    2, sigmaSq =  2.727, h2 =  0.505, vare =  0.521 
## 
##  iter  400, nnz =    2, sigmaSq =  1.679, h2 =  0.533, vare =  0.506 
## 
##  iter  500, nnz =    3, sigmaSq =  3.371, h2 =  0.499, vare =  0.520 
## 
##  iter  600, nnz =    2, sigmaSq =  2.286, h2 =  0.460, vare =  0.496 
## 
##  iter  700, nnz =    3, sigmaSq =  0.354, h2 =  0.542, vare =  0.480 
## 
##  iter  800, nnz =    2, sigmaSq =  2.429, h2 =  0.463, vare =  0.528 
## 
##  iter  900, nnz =    3, sigmaSq =  1.789, h2 =  0.540, vare =  0.473 
## 
##  iter 1000, nnz =    3, sigmaSq =  5.327, h2 =  0.518, vare =  0.463 
## 
## Posterior mean of model parameters:
##         Pi1         Pi2         Pi3         Pi4         Nnz     SigmaSq 
## 0.745410357 0.031739057 0.222850587 0.006225115 2.518000000 2.649073414 
##          h2        Vare 
## 0.499710242 0.499996485 
## 
## Annotation effects:
##                      p2         p3         p4
## int           -1.739302 20.7690782 -4.7389488
## nonsynonymous  2.412028  0.2955737 -0.2428682
## 
## Conditional probabilities:
##                   p2 p3 p4
## int           0.0410  1  0
## nonsynonymous 0.7494  1  0
## 
## Joint probabilities:
##                  Pi1 Pi2    Pi3 Pi4
## int           0.9590   0 0.0410   0
## nonsynonymous 0.2506   0 0.7494   0Let’s have a look at the output. First, you may have noticed that
Nnz (the number of nonzero effects) substantially deceased
compared to (S)BayesR about, which is now close to the number of
simulated causal variants. Second, the biggest annotation effect is
observed in the cell of p2 and nonsynonymous.
This mean the annotation has greatly increased the prior probability of
having a nonzero effects for the annotated SNPs. This can also be seen
from the Conditional probabilities result, where the cell
of p2 and nonsynonymous is close to one,
indicating the posterior probability that a annotated SNP will move from
zero effect to a nonzero effect distribution is almost one! From the
Joint probabilities result, we can see that nearly all
annotated SNPs are in the medium effect size distribution
(Pi3).
Question: Putting all these results together, what would you conclude?
Answer: These results suggest that the annotation is highly informative. It plays a crucial role in distinguishing causal variants from non-causal ones, significantly enhancing the model’s ability to identify true signals.
To evaluate this, let’s check the PIP of SNPs from SBayesRC.
delta_sbayesrc = (res_sbayesrc$beta != 0)  # indicator variable for each SNP in each MCMC cycle
pip_sbayesrc = colMeans(delta_sbayesrc)    # frequency of the indicator variable being one across MCMC cycles
plot(pip_sbayesrc, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="SBayesRC")Question: How would you interpret the result?
Answer: Both SNP 1 and 5 (the true causal variants) stand out with high PIPs, indicating strong posterior probability of being, partly due to their functional annotations. Notably, SNP 5 did not show strong association evidence in the (S)BayesR analysis above, which does not incorporate annotations. Although SNP 3 shares the same annotation, it does not have a high PIP. This is because posterior probability reflects a combination of likelihood from GWAS and the prior information from functional annotation.
Let’s also check how incorporating annotations affects polygenic prediction.
beta_sbayesrc = colMeans(res_sbayesrc$beta)
ghat_sbayesrc=Xval %*% beta_sbayesrc
# prediction accuracy
print(summary(lm(yval ~ ghat_sbayesrc))$r.squared)## [1] 0.5189804# prediction bias
print(summary(lm(yval ~ ghat_sbayesrc))$coeff)##                Estimate Std. Error  t value     Pr(>|t|)
## (Intercept)   0.7227881  0.5388751 1.341291 1.902368e-01
## ghat_sbayesrc 1.0796755  0.1930191 5.593621 4.882015e-06A tutorial for using GCTB to run SBayesR can be found at https://cnsgenomics.com/software/gctb/#SBayesRCTutorial.
In our previous papers, we use shrunk LD matrix for SBayesR. Here, we will use SBayesR and SBayesRC with eigen decomposition on the LD matrix (low-rank model). This method has been published recently in Nature Genetics (Zheng et al 2024). It has a more robust performance.
GCTB also uses .ma file to input GWAS summary data.
Since the computing time and memory consumption of SBayesR is independent of GWAS sample size and only summary-level data is required, we are able to run SBayesR with a much larger GWAS data set. Find the GWAS summary data file generated from using the simulated data based on the more samples in UK Biobank for GWAS (n=340,000):
ls /data/module5/ukb/ukb_matched.maWe will partition the whole genome into apprixmately independent LD
blocks each with at least 4cM long. The LD block information is shown in
file ‘/data/module5/ukb/ref4cM_v37.pos’. Less less to take
a look at the file. We will compute the LD correlation matrix for each
block. This process can be paralleled with multiple threads.
## SBayesR-eigen
bfile="/data/module5/ukb/gwas"
blockinfo="/data/module5/ukb/ref4cM_v37.pos"
gctb --bfile $bfile \
     --make-block-ldm \
     --block-info $blockinfo \
     --thread 4 \
     --out ldmAfter running this command, it will generate a folder call ‘ldm’. It has the binary data for LD matrices and two text files: ‘ldm.info’ and ‘snp.info’. These text file stores LD block and SNP information. Feel free to have a look at these files.
The next step is to perform eigen decomposition on the generated LD matrices, required by the low-rank model. The command below will do this job.
gctb --ldm ldm --make-ldm-eigen --thread 4 --out ldmgwas="/data/module5/ukb/ukb_matched.ma"
gctb --ldm-eigen ldm --gwas-summary $gwas --impute-summary --thread 4 --out ukb_matchedIt will generate a new file ‘ukb_matched.imputed.ma’ which will be used for the SBayesR analysis in the next step.
Putting the LD eigen decomposition data and GWAS summary data together, we are ready to run SBayesR with the low-rank model.
gctb --ldm-eigen ldm \
     --gwas-summary ukb_matched.imputed.ma \
     --sbayes R \
     --thread 4 \
     --out sbayesrResult file sbayesr.parRes shows the posterior estimates
of model parameters. Result file sbayesr.snpRes shows the
estimates of joint SNP effects.
You can run polygenic prediction using the effect estimates from
SBayesR. Firstly, copy /data/module5/sim/sbayesr.snpRes to
your current folder, which is the result from genome-wide analysis. You
are welcome to try running genome-wide analysis after the class when the
server is not busy. Then, run
target="/data/module5/ukb/target"
snpRes="sbayesr.snpRes"
plink --bfile $target --score $snpRes 2 5 8 header sum center --out target.sbayesr
phenFile="/data/module5/ukb/target.phen" 
covFile="/data/module5/ukb/target.cov"
prsFile="target.sbayesr.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFileQuestion: Is the SNP-based heritability estimate what we expect? Is the prediction accuracy better than C+PT and SBLUP?
cat sbayesr.parRes##  Parameter    Mean            SD             
##    NumSnp1    269822.343750   338.656158     
##    NumSnp2    2413.254639     282.314545     
##    NumSnp3    232.940002      44.786907      
##    NumSnp4    31.829992       17.337288      
##    NumSnp5    90.369995       9.907162       
##        Vg1    0.000000        0.000000       
##        Vg2    0.024740        0.002852       
##        Vg3    0.021521        0.004173       
##        Vg4    0.034450        0.019483       
##        Vg5    0.919289        0.018361       
##    SigmaSq    0.005169        0.000193       
##        hsq    0.510662        0.002460       
##     ResVar    1.006251        0.002156NumSnp1-NumSnp4 are the numbers of SNPs in the mixture
components 1 to 4 (component 1: zero effect; component 2: small effects
whose variance = 0.01 \(\times
\sigma^2_\beta\); component 3: medium effects whose variance =
0.1 \(\times \sigma^2_\beta\);
component 4: large effects whose variance = 1 \(\times \sigma^2_\beta\)). hsq
is the SNP-based heritability estimate. Vg1-4 is the
proportion of the SNP-based heritability explained by each mixture
component.
Our simulation data do not include functional annotation information.
Here, we will use the data from UKB height and 96 functional annotations
from S-LDSC model to demonstrate how to run SBayesRC using GCTB. The
command is similar to running SBayesR. Just replace
--sbayes R with --sbayes RC and add annotation
file by --annot [filename]. Note that you can also run
SBayesRC using a R package that we have developed: https://github.com/zhilizheng/SBayesRC.
There are 7.4M SNPs in the summary statistics for UKB height. The first step is to match and impute the missing SNPs in the LD reference. We will use the LD reference data generated above.
The default MCMC chain length is 3000 with the first 1000 iterations as burn-in. Here, for the interest of time, we set a shorter chain length because it takes more time to run model incorporating annotations. In your own real trait analysis, it is recommended to use the default setting or even longer chain.
gwas="/data/module5/ukb/HT.ma"
annot="/data/module5/ukb/annotations.txt"
## Match and impute SNPs
gctb --ldm-eigen ldm --gwas-summary $gwas --impute-summary --thread 4 --out HT
## Run SBayesRC
gctb --ldm-eigen ldm \
     --gwas-summary HT.imputed.ma \
     --annot $annot \
     --sbayes RC \
     --thread 4 \
     --chain-length 1000 \
     --burn-in 200 \
     --out HT_sbayesrcBesides the standard output files as you have seen in SBayesR, it
will also generate files sbayesrc.parSetRes and
sbayesrc.enrich which show the results for parameter sets
regarding to functional annotations.
For example, you can check the SNP-based heritability enrichment across functional annotations in R code:
enrich = read.table("HT_sbayesrc.enrich", header=T)
# plot the per-SNP heritability enrichment for all annotations
library(ggplot2)
enrich$Annotation = factor(enrich$Annotation, levels=enrich$Annotation[order(enrich$Mean, decreasing = T)])
p1 = ggplot(enrich, aes(x=Annotation, y=Mean, fill=Annotation)) + geom_bar(stat="identity", position="dodge") + 
  geom_errorbar(aes(ymin=Mean-SD, ymax=Mean+SD, color=Annotation)) + geom_hline(yintercept=1, linetype="dashed") +
  xlab("Functional annotation") + ylab("Per-SNP heritability enrichment") +
  guides(fill="none", color="none") +
  theme(axis.text = element_text(size=5, angle = 90, vjust = 0.5, hjust=1))
ggsave("height_h2_enrich_all.pdf", p1, width=12, height=6)
p1Highlight some important functional annotations:
# highlight some annotations
selected = c("Ancient_Sequence_Age_Human_Promoter", "Human_Promoter_Villar_ExAC", "Intron_UCSC", "Coding_UCSC", "Conserved_LindbladToh", "TSS_Hoffman", "UTR_5_UCSC", "Enhancer_Andersson", "UTR_3_UCSC", "Transcr_Hoffman", "Repressed_Hoffman")
p2 = ggplot(subset(enrich, Annotation %in% selected), aes(x=Annotation, y=Mean, fill=Annotation)) + geom_bar(stat="identity", position="dodge") + 
  geom_errorbar(aes(ymin=Mean-SD, ymax=Mean+SD, color=Annotation)) + geom_hline(yintercept=1, linetype="dashed") +
  xlab("Functional annotation") + ylab("Per-SNP heritability enrichment") +
  guides(fill="none", color="none") +
  theme(axis.text = element_text(angle = 90, vjust = 0.5, hjust=1))
ggsave("height_h2_enrich_highlight.pdf", p2, width=12, height=6)
p2With SBayesRC result, we can gain some insights into the functional architecture of the trait:
library(stringr)
library(dplyr)
library(tidyr)
# Read results
res = read.table("HT_sbayesrc.parSetRes", header=T)
names(res) = c("Par","Anno","Est","SE","PP")  # the columns are parameter, annotation, estimate, standard error, and posterior probability of greater than zero
# Extract Pi parameter estimates
pi = res[str_detect(res$Par, "AnnoJointProb"), ]
pi = pi %>% separate(Par, into = c("Pi", "Component"), sep = "_pi")
# Identify the `Est` value for the `Intercept` row within each component
intercept_values <- pi %>% 
  filter(Anno == "Intercept") %>% 
  select(Component, Est) %>% 
  rename(Intercept_Est = Est)
# Merge these intercept values back to the original data frame
pi_with_intercept <- pi %>% left_join(intercept_values, by = "Component")
# Subtract the intercept value from the `Est` values
pi_with_intercept <- pi_with_intercept %>% mutate(Adjusted_Est = Est - Intercept_Est)
# Create a new data frame with the adjusted values
adjusted_pi <- pi_with_intercept %>% select(Component, Anno, Adjusted_Est, SE, PP)
# plot the result focusing on the selected annotation and omit component 5 because there is no signal in there
facet_labels <- c("1" = "Zero", "2" = "Small effect", "3" = "Moderate effect", "4" = "Large effect")
p3 = ggplot(subset(adjusted_pi, Anno %in% selected & Component != 5), aes(x=Anno, y=Adjusted_Est, fill=Anno)) + geom_bar(stat="identity", position="dodge") + 
  geom_errorbar(aes(ymin=Adjusted_Est-SE, ymax=Adjusted_Est+SE, color=Anno)) + geom_hline(yintercept=0, linetype="dashed") +
  facet_grid(.~Component, scales="free", labeller = labeller(Component = facet_labels)) +
  xlab("Functional annotation") + ylab("Deviation of Pi to average") +
  coord_flip() +
  guides(fill="none", color="none") +
  theme(axis.text = element_text(angle = 0, vjust = 0.5, hjust=1))
ggsave("height_pi_enrich_highlight.pdf", p3, width=12, height=6)
p3Despite large standard errors, evolutionary conserved regions, coding sequences, promoters and transcription start sites (TSS) tend to be enriched in causal variants with large effect sizes, whereas repressed regions and introns tend to be depleted in causal variants. Such information learned from the data will, in turn, help with better estimate the SNP effects and therefore improve prediction. Since we cannot provide validation with real genotypes and phenotypes, we are not doing prediction for this UKB height data in this practical exercise.