1 Objectives

In this practical you will learn how to perform Best Linear Unbiased Prediction (BLUP). BLUP is a method to simultaneously estimate the effects of all SNPs by treating them as random effects. In contrast to ordinary least squares (OLS), a standard method used in GWAS, BLUP does not require SNP selection/filtering and imposes shrinkage that ensures unbiasness (\(E(\hat{u} | data) = u\)). We will use a small toy example data set to illustrate the methodology principle and compute two equivalent BLUP models by hands in R. Then, we will use GCTA software to run SNP-BLUP using the larger simulated data set from the UK Biobank.

Note that the small data set is free to download to your own laptop if you want. However, the UKB data set is not allowed to download because it contains real individual genotypes from the UK Biobank (although individual IDs has been masked).

2 Data

2.1 Small data set

This is a toy example data set consists of a training population (aka GWAS population, discovery population, or reference population) of 325 individuals. Each of these individuals had been genotyped for 10 SNPs. The trait was simulated in a way that the first SNP has an effect size of 2 and the 5th SNP has an effect size of 1 on the phenotype. The trait heritability is 0.5. The validation population (aka target population) is a hold-out sample of 31 individuals.

Path to Data set 1:

ls /data/module5/toy/

2.2 Large data set

This is a larger data set based on real genotypes at 273,604 SNPs from the UK Biobank. We simulated a trait with 100 causal variants with heritability of 0.5. The entire data set was partitioned into three independent populations:

  • Discovery population for conducting GWAS (n = 3,000).
  • Testing population for finding the best prediction model (n = 500).
  • Target population to be predicted (n = 1,00).

Path to Data set 2:

ls /data/module5/ukb/

3 Illustration of principle

Let’s start with reading in the data by R.

# data for training population
data_path="/data/module5/toy/"
X <- as.matrix(read.table(paste0(data_path,"xmat_trn.txt")))
y <- as.matrix(read.table(paste0(data_path,"yvec_trn.txt")))

# data for validation population
Xval <- as.matrix(read.table(paste0(data_path,"xmat_val.txt")))
yval <- as.matrix(read.table(paste0(data_path,"yvec_val.txt")))

3.1 SNP-BLUP

In SNP-BLUP, we will need to predict the effects of the 10 SNPs in the reference population, using the equations:

\[\begin{equation} \begin{bmatrix} 1_n'1_n & 1_n'X \\ \\ X'1_n & X'X+I\lambda \end{bmatrix} \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \begin{bmatrix} 1_n'y \\ X'y \end{bmatrix} \end{equation}\]

where \(\beta\) are the SNP effects, \(1_n\) is a vector of ones \((325 \times 1)\), \(X\) is a design matrix allocating SNP genotype to records, \(\mu\) is the overall mean. We will use R to solve these equations.

First, we specify the value of lambda as \((1-h^2)/(h^2/m) = 10\) where \(m=10\) is the number of SNPs. Then, we build the mixed model equations.

nind <- nrow(X)     # number of training individuals
nsnp <- ncol(X)     # number of SNPs
lambda   <- 10      # value for lambda

ones <- array(1, c(nind)) # a vector of ones
I <-diag(nsnp)    # an identity matrix with dimension of number of SNPs

# build coefficient matrix by blocks
coeff <- array(0, c(nsnp + 1, nsnp + 1))
coeff[1:1, 1:1] <- t(ones) %*% ones
coeff[1:1, 2:(nsnp+1)] <- t(ones) %*% X
coeff[2: 2:(nsnp+1), 1] <- t(X) %*% ones
coeff[2:(nsnp+1), 2:(nsnp+1)] <- t(X) %*% X + I * lambda

# build the right hand side
rhs = rbind(t(ones) %*% y, t(X) %*% y)

# get BLUP solution
solution_vec <- solve(coeff, rhs)




Question 1: Which SNP has the largest effect? Which SNP has the second largest effect? Is it consistent with the simulation setting? (hint: what’s the LD correlation between SNP 5 and 6?)

print(solution_vec[-1])  # the first element is the intercept estimate
##  [1]  1.90752815 -0.14131984 -0.14131984  0.16483065  0.22968939  0.27095334
##  [7]  0.09624804 -0.16483065 -0.07477323 -0.16483065

Take a moment to think about it before continuing.




The 1st SNPs is a causal variant with a large effect (true effect size = 2). The BLUP estimate of its effect is 1.9. Nearly spot on! The 5th SNP is the another causal variant with a smaller effect (true effect size = 1). However, the second largest BLUP estimate is at the 6th SNP (BLUP estimate = 0.27) and the third largest is at the 5th SNP (BLUP estimate = 0.23). This is because SNP 5 and 6 are in high LD (LD correlation 0.9) and SNP 6 has a larger effect estimate due to sampling. Besides SNP 1 and 5, the other SNPs are null SNPs, but their BLUP estimates are not zero. This is because BLUP assumes that all SNPs contribute to the trait with an equal effect variance. As a result, the effect of causal variant is smeared over SNPs in LD with it.




Question 2: The genotypes for the validation individuals are stored in Xval. Can you write a small R script to calculate their polygenic scores?




PGS_BLUP = Xval %*% solution_vec[-1]  # the first element is the intercept estimate




Question 3: What is the prediction accuracy (squared correlation between validation phenotypes and PGS) and bias (regression slope of validation phenotype on PGS)?




# prediction accuracy
print(cor(PGS_BLUP, yval)^2)
##             V1
## [1,] 0.5450557
# prediction bias
yval = yval - mean(yval)
PGS_BLUP = PGS_BLUP - mean(PGS_BLUP)

plot(PGS_BLUP, yval, xlab="PGS using BLUP", ylab="Phenotype")
abline(a=0, b=1)  # add y=x line
abline(lm(yval ~  PGS_BLUP), col="red")  # add regression line

print(lm(yval ~ PGS_BLUP))
## 
## Call:
## lm(formula = yval ~ PGS_BLUP)
## 
## Coefficients:
## (Intercept)     PGS_BLUP  
##  -2.361e-17    1.124e+00




3.2 GBLUP

We will analyse the same data using the GBLUP (BLUP based on the genomic relationship matrix) approach. The mixed model is

\[y = 1_n\mu + Zg +e\]

where terms are as above, and \(Z\) is a design matrix allocating records to individuals, and \(g\) is a vector of (genomic) breeding values. The \(g\) are random effects, assumed to be distributed \(\mathcal{N}(0, G\sigma_g^2)\), where \(G\) is the genomic relationship matrix (GRM).

The solutions to the mixed model equations are \[\begin{bmatrix} \hat{\mu} \\ \hat{g} \end{bmatrix} = \begin{bmatrix} 1_n'1_n & 1_n'Z \\ \\ Z'1_n & Z'Z+G^{-1}\dfrac{\sigma_e^2}{\sigma_g^2} \end{bmatrix}^{-1} \begin{bmatrix} 1_n'y \\ Z'y \end{bmatrix} \]

The first step in building the mixed model equations is to build \(G\).

\(G\) is constructed as \[G = W W' /m\] where \(m\) is the number of SNPs, \(w_{ij} = (x_{ij} - 2p_j)/\sqrt{2p_j(1-p_j)}\) and \(p_j\) is the allele frequency of the 2nd allele for SNP \(j\). Remember in the GBLUP, \(G\) (and \(Z\)) have to include all individuals, including the individuals with no phenotypes of their own that we wish to predict. So the \(X\) matrix has to include all individuals, including the progeny.

You can create this new \(X\) from the original X matrices in the first practical using the rbind (join by row command in R):

Xall = rbind(X, Xval)
nval = nrow(Xval)
ntotal = nrow(Xall)  # the total number of individuals

We next calculate GRM and its inverse.

# calculate allele frequency
p = colMeans(Xall)/2

# obtain standardised genotypes
W = do.call("cbind",lapply(1:nsnp,function(j) (Xall[,j]-2*p[j])/sqrt(2*p[j]*(1-p[j]))  ))

# compute GRM
G = W%*%t(W)/nsnp
# The next line adds a small amount to the diagonal of G, 
# otherwise G is not invertable in this small example!
G <- G + diag(ntotal)*0.01   

# compute the inverse of GRM
Ginv <- solve(G)

The only other matrix we do not have is \(Z\), which is has dimensions nind \(\times\) ntotal. \(Z\) is a diagonal matrix for those training individuals, and a block of zeros for those validation individual to be predicted (new individuals without having their phenotypes).

\(Z\) can be constructed as

Z1 <- diag(nind)
Z2 <- matrix(0, nind, nval)
Z <- cbind(Z1, Z2)

Now go ahead and build the mixed model solution equations above. For \(\dfrac{\sigma_e^2}{\sigma_g^2}\) use a value of \(1\).

# coeff
coeff <- array(0, c(ntotal + 1, ntotal + 1))
coeff[1:1, 1:1] <- t(ones) %*% ones
coeff[1:1, 2:(ntotal+1)] <- t(ones) %*% Z
coeff[2: 2:(ntotal+1), 1] <- t(Z) %*% ones
coeff[2:(ntotal+1), 2:(ntotal+1)] <- t(Z) %*% Z + Ginv

# right hand side
rhs = rbind(t(ones) %*% y, t(Z) %*% y)

# BLUP solution for the breeding value of each individual
gblup <- solve(coeff, rhs)




Question 4: What is the prediction accuracy and bias for the \(31\) selection candidates from the GBLUP, eg \(r(\hat{g},\text{yval})\)? How does this compare with the result of SNP-BLUP?




# the genomic prediction for the 31 selection candidates is
PGS_GBLUP = gblup[-c(1:326)]

# the accuracy is
cor(yval, PGS_GBLUP)^2
##         [,1]
## V1 0.5423429
# prediction bias
print(lm(yval ~ PGS_GBLUP))
## 
## Call:
## lm(formula = yval ~ PGS_GBLUP)
## 
## Coefficients:
## (Intercept)    PGS_GBLUP  
##      0.1807       1.1027




In conclusion, SNP-BLUP and GBLUP are two equivalent models. The small numerical differences between SNP-BLUP and GBLUP results are due to the use of raw genotype matrix in SNP-BLUP but standardised genotype matrix in GBLUP. Given the correct variance components, the BLUP estimates, in theory, are unbiased and give minimal prediction error variance. In practice, BLUP is often superior to C+PT but the outcome would depend on the genetic architecture of the trait. In the next session, we will explore how to use GCTA to perform BLUP in a real world data size.

4 UKB data analysis using GCTA

You can use GCTA (https://yanglab.westlake.edu.cn/software/gcta/#BLUP) to run BLUP, which is more computationally efficient than R. We demonstrate GCTA-BLUP using the simulated data set from the UK Biobank.

Feel free to check the result file from each step!

4.1 Step 1: build the genomic relationship matrix (GRM).

## code to generate GRM
bfile="/data/module5/ukb/gwas"
gcta64 --bfile $bfile --make-grm --threads 4 --out gwas

4.2 Step 2: run GBLUP

grm="gwas"
pheno="/data/module5/ukb/gwas.phen"
covar="/data/module5/ukb/gwas.cov"
gcta64 --reml --grm $grm --pheno $pheno --reml-pred-rand --qcovar $covar --out gwas

4.3 Step 3: obtain BLUP solutions for SNP effects

bfile="/data/module5/ukb/gwas"
indi_blp="gwas.indi.blp"
gcta64 --bfile $bfile --blup-snp $indi_blp --out gwas

4.4 Step 4: prediction

To compute the polygenic scores in the target population (do not need a testing population as required in C+PT for finding the optimal P-value threshold):

target="/data/module5/ukb/target"
plink --bfile $target --score gwas.snp.blp sum 1 2 3 --out target.snp.blp

Use R to evaluate the prediction accuracy

phenFile="/data/module5/ukb/target.phen" 
covFile="/data/module5/ukb/target.cov"
prsFile="target.snp.blp.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile

Question 5: Is the prediction accuracy from BLUP higher than that from C+PT? Why or why not? (hint: what’s the assumption for the SNP effect distribution in BLUP? Is the genetic model in the simulation consistent with this assumption?)