1 Objectives

We have learned the Bayesian methods for polygenic prediction using individual genotypes and phenotypes. Recently, methodologies have evolved to require only summary statistics from GWAS. In this practical, you will see how the individual- and summary-data-based methods are connected and where are the differences (e.g., BayesR vs. SBayesR). In addition to without a need of accessing to individual-level data, methods using summary statistics (sumstats) are also more computationally efficient, allowing us to perform the same analysis with a fraction of computing resource. In our latest research, we exploits eigen decomposition approach to further boost the computation efficiency. This new method (SBayesRC) allows us to fit multi-million SNPs simultaneously and incorporate functional annotations. Similar to the previous practical, we will use a small data set to illustrate the principles of SBayesR, and then use GCTB to run a large data set from UK Biobank (UKB) with SBayesR and SBayesRC.

Note that the small data set is free to download to your own laptop if you want. However, the UKB data set is not allowed to download because it contains real individual genotypes from the UK Biobank (although individual IDs has been masked).

2 Data

2.1 Small data set

This is a toy example data set consists of a training population (aka GWAS population, discovery population, or reference population) of 325 individuals. Each of these individuals had been genotyped for 10 SNPs. The trait was simulated in a way that the first SNP has an effect size of 2 and the 5th SNP has an effect size of 1 on the phenotype. The trait heritability is 0.5. The validation population (aka target population) is a hold-out sample of 31 individuals.

Path to Data set 1:

ls /data/module5/toy/

2.2 Large data set

This is a larger data set based on real genotypes at 273,604 SNPs from the UK Biobank. We simulated a trait with 100 causal variants with heritability of 0.5. The entire data set was partitioned into three independent populations:

Discovery population for conducting GWAS (n = 3,000).
Testing population for finding the best prediction model (n = 500).
Target population to be predicted (n = 1,00).

Path to Data set 2:

ls /data/module5/ukb/

3 Understanding the algorithm

This section is designed to help you understand the R code used in the method. You may skip it for now and return to it later if you prefer.

The following code is highlighted for explanatory purpose as part of the method. No need to run the code in this section.

3.1 SBayesR

The algorithm for SBayesR (Summary-data-based BayesR) has been implemented in a R script, sbayesr.R. Let’s load it in R and have a look.

source("/data/module5/scr/sbayesr.R")

Here we focus on the code that is different from individual-data-based BayesR, for which the R script bayesr.R can be found here.

source("/data/module5/scr/bayesr.R")

Compared to BayesR, there is a extra step at the beginning to scale all GWAS marginal effects so that they are in the per-genotype-SD unit with phenotypic variance equal to one (GWAS is usually performed using the 0/1/2 genotype coding but the algorithm is written based on the GWAS effects with standardised genotypes, as shown in the lecture).

  # Step 1. scale the GWAS SNP effects
  scale = sqrt(1/(n*se^2 + b^2))  # scale factors for marginal SNP effects
  b_scaled  = b*scale             # scaled marginal SNP effects (in units of genotype sd / phenotype sd)
  vary  = 1                       # phenotypic variance = 1 after the scaling

As MCMC begins, the next step is to sample SNP effects. Note that we don’t have fixed effects as in BayesR, because the fixed effects have been accounted for when fitting the GWAS model. For each SNP, like in BayesR, if the SNP effect is not zero, we sample it from a normal distribution with the mean being the BLUP solution of the “per-SNP mixed model equation”. The only difference between SBayesR and BayesR is that, as you can see from the code below, \(X_j'y_{corr}\) is replaced by \(n b_{corr}\) and \(X_j'X_j\) is replaced by \(n\), and the others remain the same.

      rhs = n*bcorr[i] + n*oldSample          # right hand side of the mixed model equation
      rhs = rhs/vare
      invLhs = 1.0/(n/c(vare) + invSigmaSq)   # inverse of left hand side of the mixed model equation
      betaHat = invLhs*c(rhs)                 # BLUP solution
      
      # if the effect is not zero, we sample it from a normal distribution 
      beta[i] = rnorm(1, betaHat[delta], sqrt(invLhs[delta]))   # given the distribution membership, the posterior is a normal distribution
      
      ###################################################################
      # In contrast, this is what we have in individual-data-based BayesR
      rhs = crossprod(X[,j], ycorr) + xpx[j]*oldSample  # right hand side of the mixed model equation
      rhs = rhs/vare
      invLhs = 1.0/(xpx[j]/c(vare) + invSigmaSq)        # inverse of left hand side of the mixed model equation
      betaHat = invLhs*c(rhs)                           # BLUP solution
      ###################################################################

After got a new sample of SNP effect \(\beta_j\), instead of adjusting \(y\) in BayesR, we adjust \(b_{corr}\) in SBayesR. This updating strategy is also known as the “right-hand side updating” strategy because essentially it’s updating the vector of \(X'y\) (\(=nb\)), which is the right-hand side of the mixed model equations.

      if (delta > 1) {
        # ...
        bcorr = bcorr + R[,i]*(oldSample - beta[i])         # update bhatcorr with the new sample
        # ...
      } else {
        if (oldSample) bcorr = bcorr + R[,i]*oldSample      # update bhatcorr with the new sample which is zero
        # ...
      }

      ###################################################################
      # In contrast, this is what we have in individual-data-based BayesR
      if (delta > 1) {
        # ...
        ycorr = ycorr + X[,j]*(oldSample - beta[j])              # update ycorr with the new sample
        # ...
      } else {
        if (oldSample) ycorr = ycorr + X[,j]*oldSample           # update ycorr with the new sample which is zero
        # ...
      }
      ###################################################################

The advantage of this updating strategy is that we only update \(b_{corr}\) with SNPs in LD, which could be a relatively small vector when considering LD blocks.

The Gibbs samplers for \(\pi\) and \(\sigma^2_{\beta}\) are the same in BayesR. To compute genetic variance, we have

\[\sigma^2_g = \beta'R\beta = \beta'(b - b_{corr})\] because \(b_{corr} = b - R\beta\). Since phenotypic variance is one, the SNP-based heritability equals to the genetic variance.

    # Step 5. Compute the SNP-based heritability
    bRb = crossprod(beta, (b-bcorr))  # compute genetic variance = beta' R beta
    varg = bRb
    h2  = varg

We can also use the calculated genetic variance to estimate the residual variance. One caveat is that unlike in BayesR, the residual variance is not guaranteed to be positive. When there exist substantial LD differences between GWAS and reference samples or substantial variation in per-SNP sample size or even errors in the GWAS summary statistics, the residual variance can be negative. As the algorithm can not proceed with a negative residual variance, we force it to be the phenotypic variance when this happens. In GCTB, we report an error message and require the user to check their data.

    sse = (vary - varg)*n
    vare = (sse + nue*scalee)/rchisq(1, n+nue)
    if (vare <= 0) vare = vary  # sometimes sse can be negative and would cause a problem

3.2 SBayesRC

The algorithm for SBayesRC (an extention of SBayesR to incorporate functional annotations) has been implemented in a R script, sbayesrc.R. Let’s load it in R and have a look.

source("/data/module5/scr/sbayesrc.R")

The differences between SBayesR and SBayesRC are highlighted and explained as below.

First, we need to initiate additional variables related to the SNP annotations.

  # annotation related variables
  snpPi = matrix(rep(pi, m), byrow=TRUE, nrow=m, ncol=ndist)  # per-SNP pi given the SNP annotations
  alpha_mcmc = matrix(0, niter, ncol(anno)*(ndist-1))         # MCMC samples of annotation effects
  p = matrix(nrow = m, ncol = ndist-1)                        # per-SNP conditional probability p
  z = matrix(nrow = m, ncol = ndist-1)                        # per-SNP conditional distribution membership indicator 
  alpha = matrix(0, nrow = ncol(anno), ncol = ndist-1)        # vector of annotation effects
  sigmaSqAlpha = rep(1, ndist-1)                              # variance of annotation effects
  sigmaSqAlpha_mcmc = matrix(0, niter, ndist-1)               # MCMC samples of the variance of annotation effects
  nAnno = ncol(anno)                                          # number of annotations
  annoDiag = apply(anno, 2, crossprod)                        # diagonal values of A'A where A is the annotation matrix

Second, when sampling SNP effects, we record the conditional distribution membership:

      if (delta > 1) {
        # ...
        for(j in 1:(delta-1)){  # set one to the "step up" indicator variable
          z[i,j] = 1
        }
      } else {
        # ...
      }

Next, there is an extra step to sample the annotation effects. It may look complicated, but the sampling scheme is similar to how we sample the SNP effects with individual-level data. The main difference is that in stead of using the genotype matrix as X and the phenotypes as y, here we use the annotation matrix as X and a latent variable (sampled from truncated normal distribution) as y whose mean is a linear combination of SNP annotations. The effect of an annotation in this model shows how much changes in the prior probability of the SNP with a nonzero effect.

    # Step 4. Sample the annotation effects on the SNP distribution mixing probabilities
    for (j in 1:(ndist-1)) {  # have ndist-1 number of conditional distributions
      if (j==1) {
        idx = 1:m             # for the first conditional distribution, data are all SNPs
        annoDiagj = annoDiag  # diagonal values of A'A
      } else {
        idx = which(z[,j-1] > 0)  # for the subsequent conditional distributions, data are the SNPs in the previous distribution
        if (length(idx) > 1) {
          annoDiagj = apply(as.matrix(anno[idx,]), 2, crossprod)  # recompute A'A depending on the SNP memberships
        }
      }
      
      if (length(idx)) {
        zj = z[idx,j]
        mu = anno[idx,] %*% matrix(alpha[,j])  # linear combination of annotations, which will be the mean of truncated normal distribution
        lj = array(0, length(zj))   # latent variable
        if (sum(zj==0)) lj[zj==0] = rtruncnorm(sum(zj==0), mean = mu[zj==0], sd = 1, a = -Inf, b = 0)   # sample latent variable
        if (sum(zj==1)) lj[zj==1] = rtruncnorm(sum(zj==1), mean = mu[zj==1], sd = 1, a = 0, b =  Inf)   # sample latent variable
        # sample annotation effects using Gibbs sampler (similar to the SNP effect sampler in the individual-level model)
        lj = lj - c(mu)  # adjust the latent variable by all annotation effects
        # sample intercept with a flat prior
        oldSample = alpha[1,j]
        rhs = sum(lj) + m*oldSample
        invLhs = 1.0/m
        ahat = invLhs*rhs
        alpha[1,j] = rnorm(1, ahat, sqrt(invLhs))
        lj = lj + (oldSample - alpha[1,j])
        # sample each annotation effect with a normal prior
        if (nAnno > 1) {
          for (k in 2:nAnno) {
            oldSample = alpha[k,j]
            rhs = crossprod(anno[idx,k], lj) + annoDiagj[k]*oldSample;
            invLhs = 1.0/(annoDiagj[k] + 1.0/sigmaSqAlpha[j])
            ahat = invLhs*rhs
            alpha[k,j] = rnorm(1, ahat, sqrt(invLhs))
            lj = lj + anno[idx,k]*(oldSample - alpha[k,j])
          }
          # sample annotation effect variance from a scaled inverse chi-square distribution
          sigmaSqAlpha[j] = (sum(alpha[-1,j]^2) + 2)/rchisq(1, nAnno-1+2)
        }
      }
    }
    alpha_mcmc[iter,] = c(alpha)  # store MCMC samples of annotation effects 
    sigmaSqAlpha_mcmc[iter,] = sigmaSqAlpha

Given the sampled annotation effects, we can then compute the conditional probabilities for each SNP that can “move up” one step towards a larger effect size:

    # Step 5. Compute per-SNP conditional probabilities from the annotation effects
    p = apply(alpha, 2, function(x){pnorm(anno %*% x)})

Given the conditional probabilities, we can compute the joint probabilities (\(\pi\)) for each SNP belong to the mixture distribution components:

    # Step 6. Compute the joint probabilities (pi) from the conditional probabilities (p)
    for (k in 1:ndist) {
      if (k < (ndist-1)) {
        snpPi[,k] = 1.0 - p[,k]
      } else {
        snpPi[,k] = 1
      }
      if (k > 1) {
        for (kk in 1:(k-1)) {
          snpPi[,k] = snpPi[,k]*p[,kk]
        }
      }
    }
    #for example, we want
    #snpPi[,1] = 1 - p[,1]
    #snpPi[,2] = (1-p[,2])*p[,1]
    #snpPi[,3] = (1-p[,3])*p[,1]*p[,2]
    #snpPi[,4] = p[,1]*p[,2]*p[,3]

These \(\pi\) values will be used to sample SNP effects in the next iteration, as like how it works in SBayesR.

4 Illustration of principle

Load data in R.

# data for training population
data_path="/data/module5/toy/"
X <- as.matrix(read.table(paste0(data_path,"xmat_trn.txt")))
y <- as.matrix(read.table(paste0(data_path,"yvec_trn.txt")))

# data for validation population
Xval <- as.matrix(read.table(paste0(data_path,"xmat_val.txt")))
yval <- as.matrix(read.table(paste0(data_path,"yvec_val.txt")))

4.1 SBLUP vs. BLUP

Run BLUP using the individual-level data as benchmark:

blup = function(X, y, lambda){
  nsnp = ncol(X)
  D = diag(c(0, rep(lambda,nsnp))) # first element is 0 because we do not add shrinkage parameter to the intercept
  W = cbind(1, X)
  coeff = crossprod(W) + D
  rhs = crossprod(W, y)
  return(solve(coeff, rhs)[-1]) # omit the first element which is for the intercept
}

lambda = 10
beta_blup = blup(X, y, lambda)

For SBLUP, we first need to obtain GWAS effect estimates (using 0/1/2 genotypes):

# run GWAS on the 0/1/2 genotypes
fit = apply(X, 2, function(x){summary(lm(y~x))$coefficients[2,1:2]})
b = fit[1,]
se = fit[2,]

Compute LD correlation matrix:

R = cor(X)

Then we scale the GWAS effects to be in per-genotype-standard-deviation unit:

nind = nrow(X)  # GWAS sample size
scale = sqrt(1/(nind*se^2 + b^2))  # calculate the scale factor for each SNP
b_scaled = b*scale  # scale the marginal effects

Now we are ready to run SBLUP:

sblup = function(b_scaled, scale, R, lambda){
  nsnp = ncol(X)
  I = diag(nsnp)
  coeff = R + I*lambda/nind
  rhs = b_scaled
  return(solve(coeff, rhs)/scale)  # put the solution back to the original scale
}

beta_sblup = sblup(b_scaled, scale, R, lambda)

Check whether BLUP and SBLUP solutions are the same.

print(cor(beta_blup, beta_sblup))

## [1] 0.9994172

plot(beta_blup, beta_sblup)
abline(a=0, b=1)

As you can see, when the full LD matrix from the GWAS sample is used, SBLUP is equivalent to BLUP. However, it is computationally prohibitive to use the genome-wide full LD matrix, nor feasible to have genotypes of GWAS samples. When a sparse LD matrix from a reference sample is used, it is more computational advantageous but the predictive performance can only be approaching to that from the analysis using individual-level data.

4.2 SBayesR vs. BayesR

SBayesR method is implemented in sbayesr.R.

source("/data/module5/scr/sbayesr.R")

Similarly, here we are using the marginal effects from GWAS with unstandardised genotypes as input data, and the scaling factors are computed using standard errors.

res_sbayesr = sbayesr(b, se, nind, R)

## 
##  iter  100, nnz =    5, sigmaSq =  1.866, h2 =  0.431, vare =  0.579
## 
##  iter  200, nnz =    8, sigmaSq =  1.189, h2 =  0.461, vare =  0.599
## 
##  iter  300, nnz =    8, sigmaSq =  1.342, h2 =  0.522, vare =  0.474
## 
##  iter  400, nnz =    8, sigmaSq =  2.079, h2 =  0.531, vare =  0.475
## 
##  iter  500, nnz =    3, sigmaSq =  1.116, h2 =  0.485, vare =  0.504
## 
##  iter  600, nnz =    2, sigmaSq =  6.982, h2 =  0.432, vare =  0.514
## 
##  iter  700, nnz =    3, sigmaSq =  0.926, h2 =  0.481, vare =  0.482
## 
##  iter  800, nnz =    2, sigmaSq =  1.573, h2 =  0.434, vare =  0.563
## 
##  iter  900, nnz =    8, sigmaSq =  1.147, h2 =  0.452, vare =  0.676
## 
##  iter 1000, nnz =    4, sigmaSq =  5.711, h2 =  0.453, vare =  0.572
## 
## Posterior mean:
##       Pi1       Pi2       Pi3       Pi4       Nnz   SigmaSq        h2      Vare 
## 0.4352539 0.2682846 0.1784209 0.1180406 4.8800000 2.3695175 0.5001994 0.5021219

beta_sbayesr = colMeans(res_sbayesr$beta)

Run BayesR as benchmark:

source("/data/module5/scr/bayesr.R")

res_bayesr = bayesr(X, y)

## 
##  iter  100, nnz =    5, sigmaSq =  3.320, h2 =  0.515, vare =  3.840, varg =  4.071 
## 
##  iter  200, nnz =   10, sigmaSq =  1.314, h2 =  0.482, vare =  4.011, varg =  3.737 
## 
##  iter  300, nnz =    7, sigmaSq =  2.202, h2 =  0.518, vare =  3.753, varg =  4.037 
## 
##  iter  400, nnz =    9, sigmaSq =  9.493, h2 =  0.433, vare =  4.215, varg =  3.222 
## 
##  iter  500, nnz =    8, sigmaSq =  7.024, h2 =  0.458, vare =  3.910, varg =  3.303 
## 
##  iter  600, nnz =   10, sigmaSq =  1.969, h2 =  0.469, vare =  3.835, varg =  3.390 
## 
##  iter  700, nnz =    7, sigmaSq =  3.303, h2 =  0.486, vare =  4.011, varg =  3.798 
## 
##  iter  800, nnz =    6, sigmaSq =  4.349, h2 =  0.548, vare =  3.692, varg =  4.482 
## 
##  iter  900, nnz =    2, sigmaSq =  1.167, h2 =  0.413, vare =  4.414, varg =  3.104 
## 
##  iter 1000, nnz =    3, sigmaSq =  2.773, h2 =  0.528, vare =  4.015, varg =  4.497 
## 
## Posterior mean:
##       Pi1       Pi2       Pi3       Pi4       Nnz   SigmaSq        h2      Vare 
## 0.2859364 0.2966387 0.2160459 0.2013790 7.0340000 3.7725760 0.4996304 3.8990448 
##      Varg 
## 3.9049359

beta_bayesr = colMeans(res_bayesr$beta)

Question: Are BayesR and SBayesR SNP effect estimates the same? What could possibly cause the difference?

cor(beta_bayesr, beta_sbayesr)

## [1] 0.9979863

plot(beta_bayesr, beta_sbayesr)
abline(a=0, b=1)

The small differences are mostly due to variation in MCMC sampling, known as Monte Carlo variance.

The posterior inclusion probability (PIP) are also expected to be similar between SBayesR and BayesR:

delta_bayesr = (res_bayesr$beta != 0)  # indicator variable for each SNP in each MCMC cycle
pip_bayesr = colMeans(delta_bayesr)    # frequency of the indicator variable being one across MCMC cycles

delta_sbayesr = (res_sbayesr$beta != 0)
pip_sbayesr = colMeans(delta_sbayesr)

plot(pip_bayesr, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="BayesR")

plot(pip_sbayesr, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="SBayesR")

4.3 SBayesRC

Due to extensive LD among SNPs, it’s hard to identify the causal variant of a smaller effect (SNP 5 only has a PIP of about 0.5). Incorporating functional annotations provides orthogonal information about causal variants, hence may be useful to improve pinpointing the causal variants. Here, we will show in principle how this will work. Suppose both causal variants are in the coding sequence and the other SNPs are in non-coding sequence. We can incorporate this functional annotation information in SBayesRC.

A slightly simplified R implementation of SBayesRC is at (it uses LD correlation matrix rather than eigen-decomposition data as described in the paper):

source("/data/module5/scr/sbayesrc.R")

SBayesRC requires a table of SNP annotations. For illustration purpose, we will annotate the SNP based on whether it is a causal variant. Of course, this is an extreme case. However, in principle, we do expect the causal variants to share some important functional annotations, such as non-synonymous variants.

The table of SNP annotation has the size of number of SNPs (rows) by number of annotations (could be multiple annotations) plus a column of one as the first column (this will be the intercept for the generalised linear model that links annotations to SNP effects).

int = rep(1,10)   # a vector of one as intercept
cv = c(1,0,0,0,1,0,0,0,0,0)  # use causal variant indication as annotation
anno = cbind(int, cv)
print(anno)

##       int cv
##  [1,]   1  1
##  [2,]   1  0
##  [3,]   1  0
##  [4,]   1  0
##  [5,]   1  1
##  [6,]   1  0
##  [7,]   1  0
##  [8,]   1  0
##  [9,]   1  0
## [10,]   1  0

Now we are ready to run SBayesRC with this annotation data. Note that although we provide annotations on SNPs, no weights are a priori given to the annotations. The method will estimate the impact of annotations together with the SNP effects jointly from the data. This is a unified Bayesian learning method, in contrast to the other methods requiring stepwise analyses to estimate the annotation impact as the first step.

res_sbayesrc = sbayesrc(b, se, nind, R, anno)

## 
##  iter  100, nnz =    2, sigmaSq =  8.347, h2 =  0.483, vare =  0.519 
## 
##  iter  200, nnz =    2, sigmaSq =  2.033, h2 =  0.428, vare =  0.623 
## 
##  iter  300, nnz =    2, sigmaSq =  4.797, h2 =  0.462, vare =  0.529 
## 
##  iter  400, nnz =    2, sigmaSq =  2.974, h2 =  0.538, vare =  0.463 
## 
##  iter  500, nnz =    4, sigmaSq =  1.874, h2 =  0.495, vare =  0.548 
## 
##  iter  600, nnz =    2, sigmaSq =  2.610, h2 =  0.438, vare =  0.516 
## 
##  iter  700, nnz =    3, sigmaSq =  1.392, h2 =  0.539, vare =  0.403 
## 
##  iter  800, nnz =    2, sigmaSq =  2.637, h2 =  0.492, vare =  0.563 
## 
##  iter  900, nnz =    2, sigmaSq =  3.112, h2 =  0.549, vare =  0.495 
## 
##  iter 1000, nnz =    2, sigmaSq =  1.627, h2 =  0.631, vare =  0.386 
## 
## Posterior mean of model parameters:
##         Pi1         Pi2         Pi3         Pi4         Nnz     SigmaSq 
## 0.783880984 0.001908917 0.214210099 0.001866270 2.116000000 2.738060922 
##          h2        Vare 
## 0.500648554 0.498333755 
## 
## Annotation effects:
##            p2         p3         p4
## int -2.221570 9.04731014 -6.2827290
## cv   4.240664 0.04895424 -0.1120171
## 
## Conditional probabilities:
##         p2 p3 p4
## int 0.0132  1  0
## cv  0.9783  1  0
## 
## Joint probabilities:
##        Pi1 Pi2    Pi3 Pi4
## int 0.9868   0 0.0132   0
## cv  0.0217   0 0.9783   0

Let’s have a look at the output. First, you may have noticed that Nnz (the number of nonzero effects) is about 2, the same as the number of simulated causal variants, and h2 (the SNP-based heritability) is very close to the true value of 0.5. Second, the biggest annotation effect is observed in the cell of p2 and cv. This mean the annotation has greatly increased the prior probability of having a nonzero effects for the annotated SNPs. This can also be seen from the Conditional probabilities result, where the cell of p2 and cv is close to one, indicating the posterior probability that a annotated SNP will move from zero effect to a nonzero effect distribution is almost one! From the Joint probabilities result, we can see that nearly all annotated SNPs are in the medium effect size distribution (Pi3).

Question: Putting all these results together, what would you conclude?

The outcome of the model parameters suggests that our annotation is very important. Now how has it helped with identifying the causal variants and polygenic prediction?

Let’s check the PIP of SNPs from SBayesRC.

delta_sbayesrc = (res_sbayesrc$beta != 0)  # indicator variable for each SNP in each MCMC cycle
pip_sbayesrc = colMeans(delta_sbayesrc)    # frequency of the indicator variable being one across MCMC cycles

plot(pip_sbayesrc, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="SBayesRC")

Question: How would you interpret the result?

Now both SNP 1 and 5 (which are our causal variants) passed the threshold of 0.9, suggesting the posterior probability of them being causal variants is greater than 0.9. It also implyes that the false discovery rate for identifying them as causal variants is less than 0.1.

You can also check the prediction accuracy.

beta_sbayesrc = colMeans(res_sbayesrc$beta)
ghat_sbayesrc=Xval %*% beta_sbayesrc
# prediction accuracy
print(summary(lm(yval ~ ghat_sbayesrc))$r.squared)

## [1] 0.5160346

# prediction bias
print(summary(lm(yval ~ ghat_sbayesrc))$coeff)

##               Estimate Std. Error  t value     Pr(>|t|)
## (Intercept)   0.699861  0.5451592 1.283774 2.093842e-01
## ghat_sbayesrc 1.074184  0.1931734 5.560723 5.347445e-06

5 UKB data analysis using GCTB

5.1 Run SBayesR using GCTB

A tutorial for using GCTB to run SBayesR can be found at https://cnsgenomics.com/software/gctb/#SBayesRCTutorial.

In our previous papers, we use shrunk LD matrix for SBayesR. Here, we will use SBayesR and SBayesRC with eigen decomposition on the LD matrix (low-rank model). This method has been published recently at Nature Genetics (Zheng et al 2024). It has a more robust performance.

5.1.1 Step 1: Get GWAS summary data.

GCTB also uses .ma file to input GWAS summary data.

Since the computing time and memory consumption of SBayesR is independent of GWAS sample size and only summary-level data is required, we are able to run SBayesR with a much larger GWAS data set. Find the GWAS summary data file generated from using the simulated data based on the more samples in UK Biobank for GWAS (n=340,000):

ls /data/module5/ukb/ukb_matched.ma

5.1.2 Step 2: Compute LD matrices

We will partition the whole genome into apprixmately independent LD blocks each with at least 4cM long. The LD block information is shown in file ‘/data/module5/ukb/ref4cM_v37.pos’. Less less to take a look at the file. We will compute the LD correlation matrix for each block. This process can be paralleled with multiple threads.

## SBayesR-eigen
bfile="/data/module5/ukb/gwas"
blockinfo="/data/module5/ukb/ref4cM_v37.pos"

gctb --bfile $bfile \
     --make-block-ldm \
     --block-info $blockinfo \
     --thread 4 \
     --out ldm

After running this command, it will generate a folder call ‘ldm’. It has the binary data for LD matrices and two text files: ‘ldm.info’ and ‘snp.info’. These text file stores LD block and SNP information. Feel free to have a look at these files.

5.1.3 Step 3: Compute eigen decomposition on the LD matrices.

The next step is to perform eigen decomposition on the generated LD matrices, required by the low-rank model. The command below will do this job.

gctb --ldm ldm --make-ldm-eigen --thread 4 --out ldm

5.1.4 Step 4: Impute the missing SNPs in the GWAS file that are not present in the LD reference data, and perform quality control (QC)

gwas="/data/module5/ukb/ukb_matched.ma"

gctb --ldm-eigen ldm --gwas-summary $gwas --impute-summary --thread 4 --out ukb_matched

It will generate a new file ‘ukb_matched.imputed.ma’ which will be used for the SBayesR analysis in the next step.

5.1.5 Step 5: Run SBayesR analysis

Putting the LD eigen decomposition data and GWAS summary data together, we are ready to run SBayesR with the low-rank model.

gctb --ldm-eigen ldm \
     --gwas-summary ukb_matched.imputed.ma \
     --sbayes R \
     --thread 4 \
     --out sbayesr

Result file sbayesr.parRes shows the posterior estimates of model parameters. Result file sbayesr.snpRes shows the estimates of joint SNP effects.

5.1.6 Compute PRS using PLINK

You can run polygenic prediction using the effect estimates from SBayesR. Firstly, copy /data/module5/sim/sbayesr.snpRes to your current folder, which is the result from genome-wide analysis. You are welcome to try running genome-wide analysis after the class when the server is not busy. Then, run

target="/data/module5/ukb/target"
snpRes="sbayesr.snpRes"
plink --bfile $target --score $snpRes 2 5 8 header sum center --out target.sbayesr

phenFile="/data/module5/ukb/target.phen" 
covFile="/data/module5/ukb/target.cov"
prsFile="target.sbayesr.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile

Question: Is the SNP-based heritability estimate what we expect? Is the prediction accuracy better than C+PT and SBLUP?

cat sbayesr.parRes

##               Mean            SD             
##    NumSnp2    760.159973      382.345795     
##    NumSnp3    35.090000       35.502308      
##    NumSnp4    110.980003      15.039633      
##        Vg2    0.032161        0.016700       
##        Vg3    0.014194        0.014937       
##        Vg4    0.953645        0.023798       
##    SigmaSq    0.008394        0.001235       
##        hsq    0.724308        0.063030       
##     ResVar    1.008693        0.002105

NumSnp1-NumSnp4 are the numbers of SNPs in the mixture components 1 to 4 (component 1: zero effect; component 2: small effects whose variance = 0.01 \(\times \sigma^2_\beta\); component 3: medium effects whose variance = 0.1 \(\times \sigma^2_\beta\); component 4: large effects whose variance = 1 \(\times \sigma^2_\beta\)). hsq is the SNP-based heritability estimate. Vg1-4 is the proportion of the SNP-based heritability explained by each mixture component.

5.2 Run SBayesRC with functional annotations using GCTB

Our simulation data do not include functional annotation information. Here, we will use the data from UKB height and 96 functional annotations from S-LDSC model to demonstrate how to run SBayesRC using GCTB. The command is similar to running SBayesR. Just replace --sbayes R with --sbayes RC and add annotation file by --annot [filename]. Note that you can also run SBayesRC using a R package that we have developed: https://github.com/zhilizheng/SBayesRC.

There are 7.4M SNPs in the summary statistics for UKB height. The first step is to match and impute the missing SNPs in the LD reference. We will use the LD reference data generated above.

The default MCMC chain length is 3000 with the first 1000 iterations as burn-in. Here, for the interest of time, we set a shorter chain length because it takes more time to run model incorporating annotations. In your own real trait analysis, it is recommended to use the default setting or even longer chain.

gwas="/data/module5/ukb/HT.ma"
annot="/data/module5/ukb/annotations.txt"

## Match and impute SNPs
gctb --ldm-eigen ldm --gwas-summary $gwas --impute-summary --thread 4 --out HT

## Run SBayesRC
gctb --ldm-eigen ldm \
     --gwas-summary HT.imputed.ma \
     --annot $annot \
     --sbayes RC \
     --thread 4 \
     --chain-length 1000 \
     --burn-in 200 \
     --out HT_sbayesrc

Besides the standard output files as you have seen in SBayesR, it will also generate a file ‘sbayesrc.parSetRes’ which shows the results for parameter sets regarding to functional annotations.

For example, you can check the SNP-based heritability enrichment across functional annotations in R code:

res = read.table("HT_sbayesrc.parSetRes", header=F)
names(res) = c("Par","Anno","Est","SE","PP")  # the columns are parameter, annotation, estimate, standard error, and posterior probability of greater than zero (or 1 if for enrichment)
hsq = subset(res, Par == "AnnoPerSnpHsq_Enrichment")

# plot the per-SNP heritability enrichment for all annotations
library(ggplot2)
hsq$Anno = factor(hsq$Anno, levels=hsq$Anno[order(hsq$Est, decreasing = T)])
p1 = ggplot(hsq, aes(x=Anno, y=Est, fill=Anno)) + geom_bar(stat="identity", position="dodge") + 
  geom_errorbar(aes(ymin=Est-SE, ymax=Est+SE, color=Anno)) + geom_hline(yintercept=1, linetype="dashed") +
  xlab("Functional annotation") + ylab("Per-SNP heritability enrichment") +
  guides(fill="none", color="none") +
  theme(axis.text = element_text(size=5, angle = 90, vjust = 0.5, hjust=1))
ggsave("height_h2_enrich_all.pdf", p1, width=12, height=6)
p1

Highlight some important functional annotations:

# highlight some annotations
selected = c("Ancient_Sequence_Age_Human_Promoter", "Human_Promoter_Villar_ExAC", "Intron_UCSC", "Coding_UCSC", "Conserved_LindbladToh", "TSS_Hoffman", "UTR_5_UCSC", "Enhancer_Andersson", "UTR_3_UCSC", "Transcr_Hoffman", "Repressed_Hoffman")
p2 = ggplot(subset(hsq, Anno %in% selected), aes(x=Anno, y=Est, fill=Anno)) + geom_bar(stat="identity", position="dodge") + 
  geom_errorbar(aes(ymin=Est-SE, ymax=Est+SE, color=Anno)) + geom_hline(yintercept=1, linetype="dashed") +
  xlab("Functional annotation") + ylab("Per-SNP heritability enrichment") +
  guides(fill="none", color="none") +
  theme(axis.text = element_text(angle = 90, vjust = 0.5, hjust=1))
ggsave("height_h2_enrich_highlight.pdf", p2, width=12, height=6)
p2

With SBayesRC result, we can gain some insights into the functional architecture of the trait:

library(stringr)
library(dplyr)
library(tidyr)

# Extract Pi parameter estimates
pi = res[str_detect(res$Par, "AnnoJointProb"), ]
pi = pi %>% separate(Par, into = c("Pi", "Component"), sep = "_pi")

# Identify the `Est` value for the `Intercept` row within each component
intercept_values <- pi %>% 
  filter(Anno == "Intercept") %>% 
  select(Component, Est) %>% 
  rename(Intercept_Est = Est)

# Merge these intercept values back to the original data frame
pi_with_intercept <- pi %>% left_join(intercept_values, by = "Component")

# Subtract the intercept value from the `Est` values
pi_with_intercept <- pi_with_intercept %>% mutate(Adjusted_Est = Est - Intercept_Est)

# Create a new data frame with the adjusted values
adjusted_pi <- pi_with_intercept %>% select(Component, Anno, Adjusted_Est, SE, PP)

# plot the result focusing on the selected annotation and omit component 5 because there is no signal in there
facet_labels <- c("1" = "Zero", "2" = "Small effect", "3" = "Moderate effect", "4" = "Large effect")
p3 = ggplot(subset(adjusted_pi, Anno %in% selected & Component != 5), aes(x=Anno, y=Adjusted_Est, fill=Anno)) + geom_bar(stat="identity", position="dodge") + 
  geom_errorbar(aes(ymin=Adjusted_Est-SE, ymax=Adjusted_Est+SE, color=Anno)) + geom_hline(yintercept=0, linetype="dashed") +
  facet_grid(.~Component, scales="free", labeller = labeller(Component = facet_labels)) +
  xlab("Functional annotation") + ylab("Deviation of Pi to average") +
  coord_flip() +
  guides(fill="none", color="none") +
  theme(axis.text = element_text(angle = 0, vjust = 0.5, hjust=1))
ggsave("height_pi_enrich_highlight.pdf", p3, width=12, height=6)
p3

Despite large standard errors, evolutionary conserved regions, coding sequences, promoters and transcription start sites (TSS) tend to be enriched in causal variants with large effect sizes, whereas repressed regions and introns tend to be depleted in causal variants. Such information learned from the data will, in turn, help with better estimate the SNP effects and therefore improve prediction. Since we cannot provide validation with real genotypes and phenotypes, we are not doing prediction for this UKB height data in this practical exercise.

Bayesian methods using summary statistics and functional annotations

Jian Zeng

2024-06-28