1 Objectives

We have learned the Bayesian methods for polygenic prediction using individual genotypes and phenotypes. Recently, methodologies have evolved to require only summary statistics from GWAS. In this practical, you will see how the individual- and summary-data-based methods are connected and where are the differences (e.g., BayesR vs. SBayesR). In addition to without a need of accessing to individual-level data, methods using summary statistics (sumstats) are also more computationally efficient, allowing us to perform the same analysis with a fraction of computing resource. In our latest research, we exploits eigen decomposition approach to further boost the computation efficiency. This new method (SBayesRC) allows us to fit multi-million SNPs simultaneously and incorporate functional annotations. Similar to the previous practical, we will use a small data set to illustrate the principles of SBayesR, and then use GCTB to run a large data set from UK Biobank (UKB) with SBayesR and SBayesRC.

Note: the small data set can be freely downloaded. The UK Biobank data set cannot be downloaded because it contains real individual genotypes, although individual IDs are masked.

2 Data

2.1 Small data set

This toy data set includes:

A training population (discovery GWAS) of 325 individuals, genotyped for 10 SNPs.
A validation population of 31 individuals (hold-out sample).

The trait was simulated such that SNP 1 has an effect size of 2 and SNP 5 has an effect size of 1. The trait heritability is set at 0.5.

Data location:

ls /data/module5/toy/

2.2 Large data set

This data set is based on real genotypes at 273,604 SNPs from the UK Biobank, with a simulated trait affected by 100 causal variants and heritability 0.5. The samples are split into three independent cohorts:

Discovery population for conducting GWAS (n = 3,000).
Testing population for finding the best prediction model (n = 500).
Target population for prediction (n = 100).

Data location:

ls /data/module5/ukb/

3 Understanding the algorithm

This section helps you understand the R code used in the method.

Do not run the R code below in this section.

3.1 SBayesR

The algorithm for SBayesR (Summary-data-based BayesR) is implemented in sbayesr.R.

cat /data/module5/scr/sbayesr.R

Here we focus on the code that is different from individual-data-based BayesR implemented in bayesr.R.

Compared to BayesR, SBayesR includes an additional step at the beginning to scale all GWAS marginal effects so that they are in the per-genotype-SD unit with phenotypic variance equal to 1. GWAS is usually performed using the 0/1/2 genotypes, whereas the algorithm assumes standardised genotypes, as shown in the lecture.

  # Step 1. scale the GWAS SNP effects
  scale = sqrt(1/(n*se^2 + b^2))  # scale factors for marginal SNP effects
  b_scaled  = b*scale             # scaled marginal SNP effects (in units of genotype sd / phenotype sd)
  vary  = 1                       # phenotypic variance = 1 after the scaling

Once MCMC begins, the next step is to sample SNP effects. Unlike BayesR, SBayesR does not include fixed effects, because they have already been adjusted in the GWAS. For each SNP, if its effect is not zero, we sample it from a normal distribution where the mean is the BLUP solution to the “per-SNP mixed model equation”. The only difference in this part between SBayesR and BayesR is the use of summary data equivalents:

\(X_j'y_{corr}\) is replaced by \(n b_{corr}\)
\(X_j'X_j\) is replaced by \(n\)

      rhs = n*bcorr[i] + n*oldSample          # right hand side of the mixed model equation
      rhs = rhs/vare
      invLhs = 1.0/(n/c(vare) + invSigmaSq)   # inverse of left hand side of the mixed model equation
      betaHat = invLhs*c(rhs)                 # BLUP solution
      
      # if the effect is not zero, we sample it from a normal distribution 
      beta[i] = rnorm(1, betaHat[delta], sqrt(invLhs[delta]))   # given the distribution membership, the posterior is a normal distribution
      
      ###################################################################
      # In contrast, this is what we have in individual-data-based BayesR
      rhs = crossprod(X[,j], ycorr) + xpx[j]*oldSample  # right hand side of the mixed model equation
      rhs = rhs/vare
      invLhs = 1.0/(xpx[j]/c(vare) + invSigmaSq)        # inverse of left hand side of the mixed model equation
      betaHat = invLhs*c(rhs)                           # BLUP solution
      ###################################################################

After sampling the SNP effect \(\beta_j\), instead of adjusting \(y\) (as in BayesR), we update \(b_{corr}\) in SBayesR. This strategy is known as “right-hand side updating”, because it updates the vector of \(X'y\) (\(=nb\)), the right-hand side of the mixed model equations.

      if (delta > 1) {
        # ...
        bcorr = bcorr + R[,i]*(oldSample - beta[i])         # update bhatcorr with the new sample
        # ...
      } else {
        if (oldSample) bcorr = bcorr + R[,i]*oldSample      # update bhatcorr with the new sample which is zero
        # ...
      }

      ###################################################################
      # In contrast, this is what we have in individual-data-based BayesR
      if (delta > 1) {
        # ...
        ycorr = ycorr + X[,j]*(oldSample - beta[j])              # update ycorr with the new sample
        # ...
      } else {
        if (oldSample) ycorr = ycorr + X[,j]*oldSample           # update ycorr with the new sample which is zero
        # ...
      }
      ###################################################################

This approach is efficient because \(b_{corr}\) is only updated for SNPs in LD with SNP \(i\), which is typically a smaller subset within an LD block.

The Gibbs samplers for \(\pi\) and \(\sigma^2_{\beta}\) are the same as in BayesR. To compute genetic variance, we use

\[\sigma^2_g = \beta'R\beta = \beta'(b - b_{corr})\] because \(b_{corr} = b - R\beta\). Since phenotypic variance is set to 1, SNP-based heritability equals the genetic variance.

    # Step 5. Compute the SNP-based heritability
    bRb = crossprod(beta, (b-bcorr))  # compute genetic variance = beta' R beta
    varg = bRb
    h2  = varg

We can also estimate the residual variance. However, unlike BayesR, the residual variance is not guaranteed to be positive. Issues like LD mismatches between GWAS and reference samples, variation in per-SNP sample size, or errors in the GWAS summary statistics can result in a negative residual variance estimate. When this happens, the algorithm cannot proceed, so we force the residual variance to equal 1 (phenotypic variance).

    sse = (vary - varg)*n
    vare = (sse + nue*scalee)/rchisq(1, n+nue)
    if (vare <= 0) vare = vary  # sometimes sse can be negative and would cause a problem

3.2 SBayesRC

The algorithm for SBayesRC (an extention of SBayesR to incorporate functional annotations) is implemented in sbayesrc.R.

cat /data/module5/scr/sbayesrc.R

Here are the key differences between SBayesR and SBayesRC.

First, we need to initiate additional variables related to the SNP annotations.

  # annotation related variables
  snpPi = matrix(rep(pi, m), byrow=TRUE, nrow=m, ncol=ndist)  # per-SNP pi given the SNP annotations
  alpha_mcmc = matrix(0, niter, ncol(anno)*(ndist-1))         # MCMC samples of annotation effects
  p = matrix(nrow = m, ncol = ndist-1)                        # per-SNP conditional probability p
  z = matrix(nrow = m, ncol = ndist-1)                        # per-SNP conditional distribution membership indicator 
  alpha = matrix(0, nrow = ncol(anno), ncol = ndist-1)        # vector of annotation effects
  sigmaSqAlpha = rep(1, ndist-1)                              # variance of annotation effects
  sigmaSqAlpha_mcmc = matrix(0, niter, ndist-1)               # MCMC samples of the variance of annotation effects
  nAnno = ncol(anno)                                          # number of annotations
  annoDiag = apply(anno, 2, crossprod)                        # diagonal values of A'A where A is the annotation matrix

Second, when sampling SNP effects, we record the conditional distribution membership for the SNP:

      if (delta > 1) {
        # ...
        for(j in 1:(delta-1)){  # set one to the "step up" indicator variable
          z[i,j] = 1
        }
      } else {
        # ...
      }

Next, there is an extra step to sample the annotation effects. It may look complicated, but the sampling scheme is similar to how we sample the SNP effects with individual-level data. The main difference is that in stead of using the genotype matrix as X and the phenotypes as y, here we use the annotation matrix as X and a latent variable (sampled from truncated normal distribution) as y whose mean is a linear combination of SNP annotations. The effect of an annotation in this model shows how much changes in the prior probability of the SNP with a nonzero effect.

    # Step 4. Sample the annotation effects on the SNP distribution mixing probabilities
    for (j in 1:(ndist-1)) {  # have ndist-1 number of conditional distributions
      if (j==1) {
        idx = 1:m             # for the first conditional distribution, data are all SNPs
        annoDiagj = annoDiag  # diagonal values of A'A
      } else {
        idx = which(z[,j-1] > 0)  # for the subsequent conditional distributions, data are the SNPs in the previous distribution
        if (length(idx) > 1) {
          annoDiagj = apply(as.matrix(anno[idx,]), 2, crossprod)  # recompute A'A depending on the SNP memberships
        }
      }
      
      if (length(idx)) {
        zj = z[idx,j]
        mu = anno[idx,] %*% matrix(alpha[,j])  # linear combination of annotations, which will be the mean of truncated normal distribution
        lj = array(0, length(zj))   # latent variable
        if (sum(zj==0)) lj[zj==0] = rtruncnorm(sum(zj==0), mean = mu[zj==0], sd = 1, a = -Inf, b = 0)   # sample latent variable
        if (sum(zj==1)) lj[zj==1] = rtruncnorm(sum(zj==1), mean = mu[zj==1], sd = 1, a = 0, b =  Inf)   # sample latent variable
        # sample annotation effects using Gibbs sampler (similar to the SNP effect sampler in the individual-level model)
        lj = lj - c(mu)  # adjust the latent variable by all annotation effects
        # sample intercept with a flat prior
        oldSample = alpha[1,j]
        rhs = sum(lj) + m*oldSample
        invLhs = 1.0/m
        ahat = invLhs*rhs
        alpha[1,j] = rnorm(1, ahat, sqrt(invLhs))
        lj = lj + (oldSample - alpha[1,j])
        # sample each annotation effect with a normal prior
        if (nAnno > 1) {
          for (k in 2:nAnno) {
            oldSample = alpha[k,j]
            rhs = crossprod(anno[idx,k], lj) + annoDiagj[k]*oldSample;
            invLhs = 1.0/(annoDiagj[k] + 1.0/sigmaSqAlpha[j])
            ahat = invLhs*rhs
            alpha[k,j] = rnorm(1, ahat, sqrt(invLhs))
            lj = lj + anno[idx,k]*(oldSample - alpha[k,j])
          }
          # sample annotation effect variance from a scaled inverse chi-square distribution
          sigmaSqAlpha[j] = (sum(alpha[-1,j]^2) + 2)/rchisq(1, nAnno-1+2)
        }
      }
    }
    alpha_mcmc[iter,] = c(alpha)  # store MCMC samples of annotation effects 
    sigmaSqAlpha_mcmc[iter,] = sigmaSqAlpha

Given the sampled annotation effects, we can then compute the conditional probabilities for each SNP which determine how likely the SNP is to belong to higher-effect mixture components.

    # Step 5. Compute per-SNP conditional probabilities from the annotation effects
    p = apply(alpha, 2, function(x){pnorm(anno %*% x)})

Given the conditional probabilities, we can compute the joint probabilities (\(\pi\)) for each SNP belong to the mixture distribution components:

    # Step 6. Compute the joint probabilities (pi) from the conditional probabilities (p)
    for (k in 1:ndist) {
      if (k < (ndist-1)) {
        snpPi[,k] = 1.0 - p[,k]
      } else {
        snpPi[,k] = 1
      }
      if (k > 1) {
        for (kk in 1:(k-1)) {
          snpPi[,k] = snpPi[,k]*p[,kk]
        }
      }
    }
    #for example, we want
    #snpPi[,1] = 1 - p[,1]
    #snpPi[,2] = (1-p[,2])*p[,1]
    #snpPi[,3] = (1-p[,3])*p[,1]*p[,2]
    #snpPi[,4] = p[,1]*p[,2]*p[,3]

These \(\pi\) values are then used in the next iteration of SNP effect sampling, just like in SBayesR.

4 Illustration of the principle

Load data in R.

# data for training population
data_path="/data/module5/toy/"
X <- as.matrix(read.table(paste0(data_path,"xmat_trn.txt")))
y <- as.matrix(read.table(paste0(data_path,"yvec_trn.txt")))

# data for validation population
Xval <- as.matrix(read.table(paste0(data_path,"xmat_val.txt")))
yval <- as.matrix(read.table(paste0(data_path,"yvec_val.txt")))

4.1 SBayesR vs. BayesR

SBayesR method is implemented in sbayesr.R.

source("/data/module5/scr/sbayesr.R")

## Warning: package 'MCMCpack' was built under R version 4.4.1

First, we need to obtain GWAS effect estimates (using 0/1/2 genotypes):

# run GWAS on the 0/1/2 genotypes
fit = apply(X, 2, function(x){summary(lm(y~x))$coefficients[2,1:2]})
b = fit[1,]
se = fit[2,]

Compute LD correlation matrix:

R = cor(X)

Then we scale the GWAS effects to be in per-genotype-standard-deviation unit:

nind = nrow(X)  # GWAS sample size
scale = sqrt(1/(nind*se^2 + b^2))  # calculate the scale factor for each SNP
b_scaled = b*scale  # scale the marginal effects

Now we are ready to run SBayesR:

res_sbayesr = sbayesr(b, se, nind, R)

## 
##  iter  100, nnz =    5, sigmaSq =  1.708, h2 =  0.464, vare =  0.487
## 
##  iter  200, nnz =    2, sigmaSq =  3.075, h2 =  0.434, vare =  0.642
## 
##  iter  300, nnz =    5, sigmaSq =  1.136, h2 =  0.488, vare =  0.538
## 
##  iter  400, nnz =    7, sigmaSq =  1.237, h2 =  0.524, vare =  0.503
## 
##  iter  500, nnz =    5, sigmaSq =  0.614, h2 =  0.475, vare =  0.526
## 
##  iter  600, nnz =    6, sigmaSq =  5.741, h2 =  0.510, vare =  0.508
## 
##  iter  700, nnz =    3, sigmaSq =  1.305, h2 =  0.506, vare =  0.464
## 
##  iter  800, nnz =    5, sigmaSq =  0.961, h2 =  0.482, vare =  0.533
## 
##  iter  900, nnz =    3, sigmaSq =  2.066, h2 =  0.504, vare =  0.462
## 
##  iter 1000, nnz =    6, sigmaSq =  1.701, h2 =  0.514, vare =  0.533
## 
## Posterior mean:
##       Pi1       Pi2       Pi3       Pi4       Nnz   SigmaSq        h2      Vare 
## 0.4339769 0.2486935 0.1943400 0.1229896 4.8420000 2.1536359 0.5034002 0.4974042

beta_sbayesr = colMeans(res_sbayesr$beta)

Run BayesR as benchmark:

source("/data/module5/scr/bayesr.R")

res_bayesr = bayesr(X, y)

## 
##  iter  100, nnz =   10, sigmaSq =  1.418, h2 =  0.451, vare =  4.020, varg =  3.304 
## 
##  iter  200, nnz =    7, sigmaSq =  2.304, h2 =  0.508, vare =  3.825, varg =  3.948 
## 
##  iter  300, nnz =    8, sigmaSq =  4.110, h2 =  0.516, vare =  3.888, varg =  4.150 
## 
##  iter  400, nnz =    5, sigmaSq =  3.808, h2 =  0.546, vare =  4.057, varg =  4.882 
## 
##  iter  500, nnz =    6, sigmaSq =  4.083, h2 =  0.478, vare =  3.769, varg =  3.452 
## 
##  iter  600, nnz =    6, sigmaSq =  2.811, h2 =  0.489, vare =  3.989, varg =  3.818 
## 
##  iter  700, nnz =    2, sigmaSq =  2.374, h2 =  0.467, vare =  3.775, varg =  3.313 
## 
##  iter  800, nnz =    8, sigmaSq =  3.129, h2 =  0.485, vare =  3.627, varg =  3.410 
## 
##  iter  900, nnz =    8, sigmaSq =  1.915, h2 =  0.490, vare =  4.257, varg =  4.094 
## 
##  iter 1000, nnz =    6, sigmaSq =  3.063, h2 =  0.448, vare =  4.323, varg =  3.509 
## 
## Posterior mean:
##       Pi1       Pi2       Pi3       Pi4       Nnz   SigmaSq        h2      Vare 
## 0.2782621 0.2839210 0.2125146 0.2253023 7.1020000 4.3112584 0.4962306 3.9160819 
##      Varg 
## 3.8714410

beta_bayesr = colMeans(res_bayesr$beta)

Question: Are BayesR and SBayesR SNP effect estimates the same? What could possibly cause the difference?

cor(beta_bayesr, beta_sbayesr)

## [1] 0.9192422

plot(beta_bayesr, beta_sbayesr)
abline(a=0, b=1)

Answer: The small differences are mostly due to variation in MCMC sampling, known as Monte Carlo variance.

The posterior inclusion probability (PIP) are also expected to be similar between SBayesR and BayesR:

delta_bayesr = (res_bayesr$beta != 0)  # indicator variable for each SNP in each MCMC cycle
pip_bayesr = colMeans(delta_bayesr)    # frequency of the indicator variable being one across MCMC cycles

delta_sbayesr = (res_sbayesr$beta != 0)
pip_sbayesr = colMeans(delta_sbayesr)

plot(pip_bayesr, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="BayesR")

plot(pip_sbayesr, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="SBayesR")

4.2 SBayesRC

Due to extensive LD among SNPs, it’s often difficult to identify causal variants, especially those with smaller effects. For example, SNP 5 only has a PIP of about 0.5. Functional annotations can provide orthogonal information to LD, helping to better identify causal variants. Here, we demonstrate how this works in principle. Suppose both causal variants are located in coding sequences, while the other SNPs are in non-coding regions. We can incorporate this annotation information using SBayesRC.

A simplified version of SBayesRC algorithm is implemented in sbayesrc.R. It uses LD correlation matrix rather than eigen-decomposition data as described in the original paper.

source("/data/module5/scr/sbayesrc.R")

## Warning: package 'truncnorm' was built under R version 4.4.1

SBayesRC requires a table of SNP annotations. Suppose SNP 1, 3, and 5 (where SNP 1 and 5 are causal variants) are non-synonymous variants. We can add a binary annotate for all SNPs indicating whether they are non-synonymous. For illustration purpose, this annotation is designed to be informative, as it covers the two causal variants but also includes a null SNP, making the scenario slightly more realistic.

The annotation table has the size of number of SNPs (rows) by number of annotations (could be multiple annotations) plus a column of one as the first column (an intercept for the generalised linear model that links annotations to SNP effects).

int = rep(1,10)   # a vector of one as intercept
nonsynonymous = c(1,0,1,0,1,0,0,0,0,0)  # whether the SNP is non-synoymous
anno = cbind(int, nonsynonymous)
print(anno)

##       int nonsynonymous
##  [1,]   1             1
##  [2,]   1             0
##  [3,]   1             1
##  [4,]   1             0
##  [5,]   1             1
##  [6,]   1             0
##  [7,]   1             0
##  [8,]   1             0
##  [9,]   1             0
## [10,]   1             0

We are now ready to run SBayesRC using this annotation matrix. Although we provide annotation information, no prior weights are assigned. The method learns the impact of annotations jointly with SNP effects from the data. This is a unified Bayesian approach, unlike stepwise approaches that estimate annotation enrichment before fitting the model.

res_sbayesrc = sbayesrc(b, se, nind, R, anno)

## 
##  iter  100, nnz =    3, sigmaSq =  2.455, h2 =  0.461, vare =  0.574 
## 
##  iter  200, nnz =    2, sigmaSq =  0.795, h2 =  0.494, vare =  0.456 
## 
##  iter  300, nnz =    2, sigmaSq =  2.727, h2 =  0.505, vare =  0.521 
## 
##  iter  400, nnz =    2, sigmaSq =  1.679, h2 =  0.533, vare =  0.506 
## 
##  iter  500, nnz =    3, sigmaSq =  3.371, h2 =  0.499, vare =  0.520 
## 
##  iter  600, nnz =    2, sigmaSq =  2.286, h2 =  0.460, vare =  0.496 
## 
##  iter  700, nnz =    3, sigmaSq =  0.354, h2 =  0.542, vare =  0.480 
## 
##  iter  800, nnz =    2, sigmaSq =  2.429, h2 =  0.463, vare =  0.528 
## 
##  iter  900, nnz =    3, sigmaSq =  1.789, h2 =  0.540, vare =  0.473 
## 
##  iter 1000, nnz =    3, sigmaSq =  5.327, h2 =  0.518, vare =  0.463 
## 
## Posterior mean of model parameters:
##         Pi1         Pi2         Pi3         Pi4         Nnz     SigmaSq 
## 0.745410357 0.031739057 0.222850587 0.006225115 2.518000000 2.649073414 
##          h2        Vare 
## 0.499710242 0.499996485 
## 
## Annotation effects:
##                      p2         p3         p4
## int           -1.739302 20.7690782 -4.7389488
## nonsynonymous  2.412028  0.2955737 -0.2428682
## 
## Conditional probabilities:
##                   p2 p3 p4
## int           0.0410  1  0
## nonsynonymous 0.7494  1  0
## 
## Joint probabilities:
##                  Pi1 Pi2    Pi3 Pi4
## int           0.9590   0 0.0410   0
## nonsynonymous 0.2506   0 0.7494   0

Let’s have a look at the output. First, you may have noticed that Nnz (the number of nonzero effects) substantially deceased compared to (S)BayesR about, which is now close to the number of simulated causal variants. Second, the biggest annotation effect is observed in the cell of p2 and nonsynonymous. This mean the annotation has greatly increased the prior probability of having a nonzero effects for the annotated SNPs. This can also be seen from the Conditional probabilities result, where the cell of p2 and nonsynonymous is close to one, indicating the posterior probability that a annotated SNP will move from zero effect to a nonzero effect distribution is almost one! From the Joint probabilities result, we can see that nearly all annotated SNPs are in the medium effect size distribution (Pi3).

Question: Putting all these results together, what would you conclude?

Answer: These results suggest that the annotation is highly informative. It plays a crucial role in distinguishing causal variants from non-causal ones, significantly enhancing the model’s ability to identify true signals.

To evaluate this, let’s check the PIP of SNPs from SBayesRC.

delta_sbayesrc = (res_sbayesrc$beta != 0)  # indicator variable for each SNP in each MCMC cycle
pip_sbayesrc = colMeans(delta_sbayesrc)    # frequency of the indicator variable being one across MCMC cycles

plot(pip_sbayesrc, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="SBayesRC")

Question: How would you interpret the result?

Answer: Both SNP 1 and 5 (the true causal variants) stand out with high PIPs, indicating strong posterior probability of being, partly due to their functional annotations. Notably, SNP 5 did not show strong association evidence in the (S)BayesR analysis above, which does not incorporate annotations. Although SNP 3 shares the same annotation, it does not have a high PIP. This is because posterior probability reflects a combination of likelihood from GWAS and the prior information from functional annotation.

Let’s also check how incorporating annotations affects polygenic prediction.

beta_sbayesrc = colMeans(res_sbayesrc$beta)
ghat_sbayesrc=Xval %*% beta_sbayesrc
# prediction accuracy
print(summary(lm(yval ~ ghat_sbayesrc))$r.squared)

## [1] 0.5189804

# prediction bias
print(summary(lm(yval ~ ghat_sbayesrc))$coeff)

##                Estimate Std. Error  t value     Pr(>|t|)
## (Intercept)   0.7227881  0.5388751 1.341291 1.902368e-01
## ghat_sbayesrc 1.0796755  0.1930191 5.593621 4.882015e-06

5 UKB data analysis using GCTB

5.1 Run SBayesR using GCTB

A tutorial for using GCTB to run SBayesR can be found at https://cnsgenomics.com/software/gctb/#SBayesRCTutorial.

In our previous papers, we use shrunk LD matrix for SBayesR. Here, we will use SBayesR and SBayesRC with eigen decomposition on the LD matrix (low-rank model). This method has been published recently in Nature Genetics (Zheng et al 2024). It has a more robust performance.

5.1.1 Step 1: Get GWAS summary data.

GCTB also uses .ma file to input GWAS summary data.

Since the computing time and memory consumption of SBayesR is independent of GWAS sample size and only summary-level data is required, we are able to run SBayesR with a much larger GWAS data set. Find the GWAS summary data file generated from using the simulated data based on the more samples in UK Biobank for GWAS (n=340,000):

ls /data/module5/ukb/ukb_matched.ma

5.1.2 Step 2: Compute LD matrices

We will partition the whole genome into apprixmately independent LD blocks each with at least 4cM long. The LD block information is shown in file ‘/data/module5/ukb/ref4cM_v37.pos’. Less less to take a look at the file. We will compute the LD correlation matrix for each block. This process can be paralleled with multiple threads.

## SBayesR-eigen
bfile="/data/module5/ukb/gwas"
blockinfo="/data/module5/ukb/ref4cM_v37.pos"

gctb --bfile $bfile \
     --make-block-ldm \
     --block-info $blockinfo \
     --thread 4 \
     --out ldm

After running this command, it will generate a folder call ‘ldm’. It has the binary data for LD matrices and two text files: ‘ldm.info’ and ‘snp.info’. These text file stores LD block and SNP information. Feel free to have a look at these files.

5.1.3 Step 3: Compute eigen decomposition on the LD matrices.

The next step is to perform eigen decomposition on the generated LD matrices, required by the low-rank model. The command below will do this job.

gctb --ldm ldm --make-ldm-eigen --thread 4 --out ldm

5.1.4 Step 4: Impute the missing SNPs in the GWAS file that are not present in the LD reference data, and perform quality control (QC)

gwas="/data/module5/ukb/ukb_matched.ma"

gctb --ldm-eigen ldm --gwas-summary $gwas --impute-summary --thread 4 --out ukb_matched

It will generate a new file ‘ukb_matched.imputed.ma’ which will be used for the SBayesR analysis in the next step.

5.1.5 Step 5: Run SBayesR analysis

Putting the LD eigen decomposition data and GWAS summary data together, we are ready to run SBayesR with the low-rank model.

gctb --ldm-eigen ldm \
     --gwas-summary ukb_matched.imputed.ma \
     --sbayes R \
     --thread 4 \
     --out sbayesr

Result file sbayesr.parRes shows the posterior estimates of model parameters. Result file sbayesr.snpRes shows the estimates of joint SNP effects.

5.1.6 Compute PRS using PLINK

You can run polygenic prediction using the effect estimates from SBayesR. Firstly, copy /data/module5/sim/sbayesr.snpRes to your current folder, which is the result from genome-wide analysis. You are welcome to try running genome-wide analysis after the class when the server is not busy. Then, run

target="/data/module5/ukb/target"
snpRes="sbayesr.snpRes"
plink --bfile $target --score $snpRes 2 5 8 header sum center --out target.sbayesr

phenFile="/data/module5/ukb/target.phen" 
covFile="/data/module5/ukb/target.cov"
prsFile="target.sbayesr.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile

Question: Is the SNP-based heritability estimate what we expect? Is the prediction accuracy better than C+PT and SBLUP?

cat sbayesr.parRes

##  Parameter    Mean            SD             
##    NumSnp1    269822.343750   338.656158     
##    NumSnp2    2413.254639     282.314545     
##    NumSnp3    232.940002      44.786907      
##    NumSnp4    31.829992       17.337288      
##    NumSnp5    90.369995       9.907162       
##        Vg1    0.000000        0.000000       
##        Vg2    0.024740        0.002852       
##        Vg3    0.021521        0.004173       
##        Vg4    0.034450        0.019483       
##        Vg5    0.919289        0.018361       
##    SigmaSq    0.005169        0.000193       
##        hsq    0.510662        0.002460       
##     ResVar    1.006251        0.002156

NumSnp1-NumSnp4 are the numbers of SNPs in the mixture components 1 to 4 (component 1: zero effect; component 2: small effects whose variance = 0.01 \(\times \sigma^2_\beta\); component 3: medium effects whose variance = 0.1 \(\times \sigma^2_\beta\); component 4: large effects whose variance = 1 \(\times \sigma^2_\beta\)). hsq is the SNP-based heritability estimate. Vg1-4 is the proportion of the SNP-based heritability explained by each mixture component.

5.2 Run SBayesRC with functional annotations using GCTB

Our simulation data do not include functional annotation information. Here, we will use the data from UKB height and 96 functional annotations from S-LDSC model to demonstrate how to run SBayesRC using GCTB. The command is similar to running SBayesR. Just replace --sbayes R with --sbayes RC and add annotation file by --annot [filename]. Note that you can also run SBayesRC using a R package that we have developed: https://github.com/zhilizheng/SBayesRC.

There are 7.4M SNPs in the summary statistics for UKB height. The first step is to match and impute the missing SNPs in the LD reference. We will use the LD reference data generated above.

The default MCMC chain length is 3000 with the first 1000 iterations as burn-in. Here, for the interest of time, we set a shorter chain length because it takes more time to run model incorporating annotations. In your own real trait analysis, it is recommended to use the default setting or even longer chain.

gwas="/data/module5/ukb/HT.ma"
annot="/data/module5/ukb/annotations.txt"

## Match and impute SNPs
gctb --ldm-eigen ldm --gwas-summary $gwas --impute-summary --thread 4 --out HT

## Run SBayesRC
gctb --ldm-eigen ldm \
     --gwas-summary HT.imputed.ma \
     --annot $annot \
     --sbayes RC \
     --thread 4 \
     --chain-length 1000 \
     --burn-in 200 \
     --out HT_sbayesrc

Besides the standard output files as you have seen in SBayesR, it will also generate files sbayesrc.parSetRes and sbayesrc.enrich which show the results for parameter sets regarding to functional annotations.

For example, you can check the SNP-based heritability enrichment across functional annotations in R code:

enrich = read.table("HT_sbayesrc.enrich", header=T)

# plot the per-SNP heritability enrichment for all annotations
library(ggplot2)
enrich$Annotation = factor(enrich$Annotation, levels=enrich$Annotation[order(enrich$Mean, decreasing = T)])
p1 = ggplot(enrich, aes(x=Annotation, y=Mean, fill=Annotation)) + geom_bar(stat="identity", position="dodge") + 
  geom_errorbar(aes(ymin=Mean-SD, ymax=Mean+SD, color=Annotation)) + geom_hline(yintercept=1, linetype="dashed") +
  xlab("Functional annotation") + ylab("Per-SNP heritability enrichment") +
  guides(fill="none", color="none") +
  theme(axis.text = element_text(size=5, angle = 90, vjust = 0.5, hjust=1))
ggsave("height_h2_enrich_all.pdf", p1, width=12, height=6)
p1

Highlight some important functional annotations:

# highlight some annotations
selected = c("Ancient_Sequence_Age_Human_Promoter", "Human_Promoter_Villar_ExAC", "Intron_UCSC", "Coding_UCSC", "Conserved_LindbladToh", "TSS_Hoffman", "UTR_5_UCSC", "Enhancer_Andersson", "UTR_3_UCSC", "Transcr_Hoffman", "Repressed_Hoffman")
p2 = ggplot(subset(enrich, Annotation %in% selected), aes(x=Annotation, y=Mean, fill=Annotation)) + geom_bar(stat="identity", position="dodge") + 
  geom_errorbar(aes(ymin=Mean-SD, ymax=Mean+SD, color=Annotation)) + geom_hline(yintercept=1, linetype="dashed") +
  xlab("Functional annotation") + ylab("Per-SNP heritability enrichment") +
  guides(fill="none", color="none") +
  theme(axis.text = element_text(angle = 90, vjust = 0.5, hjust=1))
ggsave("height_h2_enrich_highlight.pdf", p2, width=12, height=6)
p2

With SBayesRC result, we can gain some insights into the functional architecture of the trait:

library(stringr)
library(dplyr)
library(tidyr)

# Read results
res = read.table("HT_sbayesrc.parSetRes", header=T)
names(res) = c("Par","Anno","Est","SE","PP")  # the columns are parameter, annotation, estimate, standard error, and posterior probability of greater than zero

# Extract Pi parameter estimates
pi = res[str_detect(res$Par, "AnnoJointProb"), ]
pi = pi %>% separate(Par, into = c("Pi", "Component"), sep = "_pi")

# Identify the `Est` value for the `Intercept` row within each component
intercept_values <- pi %>% 
  filter(Anno == "Intercept") %>% 
  select(Component, Est) %>% 
  rename(Intercept_Est = Est)

# Merge these intercept values back to the original data frame
pi_with_intercept <- pi %>% left_join(intercept_values, by = "Component")

# Subtract the intercept value from the `Est` values
pi_with_intercept <- pi_with_intercept %>% mutate(Adjusted_Est = Est - Intercept_Est)

# Create a new data frame with the adjusted values
adjusted_pi <- pi_with_intercept %>% select(Component, Anno, Adjusted_Est, SE, PP)

# plot the result focusing on the selected annotation and omit component 5 because there is no signal in there
facet_labels <- c("1" = "Zero", "2" = "Small effect", "3" = "Moderate effect", "4" = "Large effect")
p3 = ggplot(subset(adjusted_pi, Anno %in% selected & Component != 5), aes(x=Anno, y=Adjusted_Est, fill=Anno)) + geom_bar(stat="identity", position="dodge") + 
  geom_errorbar(aes(ymin=Adjusted_Est-SE, ymax=Adjusted_Est+SE, color=Anno)) + geom_hline(yintercept=0, linetype="dashed") +
  facet_grid(.~Component, scales="free", labeller = labeller(Component = facet_labels)) +
  xlab("Functional annotation") + ylab("Deviation of Pi to average") +
  coord_flip() +
  guides(fill="none", color="none") +
  theme(axis.text = element_text(angle = 0, vjust = 0.5, hjust=1))
ggsave("height_pi_enrich_highlight.pdf", p3, width=12, height=6)
p3

Despite large standard errors, evolutionary conserved regions, coding sequences, promoters and transcription start sites (TSS) tend to be enriched in causal variants with large effect sizes, whereas repressed regions and introns tend to be depleted in causal variants. Such information learned from the data will, in turn, help with better estimate the SNP effects and therefore improve prediction. Since we cannot provide validation with real genotypes and phenotypes, we are not doing prediction for this UKB height data in this practical exercise.

Bayesian methods using summary statistics and functional annotations

Genetics & Genomics Winter School | Module 5 | Practical Exercise 5