1 Objectives

In this practical, we will learn how to perform Best Linear Unbiased Prediction (BLUP). BLUP is a method to estimate the effects of all SNPs simultaneously by treating them as random effects. Unlike ordinary least squares (OLS), a standard method used in GWAS, BLUP does not require SNP selection or filtering. It applies shrinkage to ensure unbiasness of estimates, such that \(E(u | \hat{u}) = \hat{u}\).

We will use a small toy data set to illustrate the basic principles and manually compute BLUP estimates in R. This data set is available to download. You can also use R on our server to do this practical.

2 Data

This toy data set includes:

  • A training population (discovery GWAS) of 325 individuals, genotyped for 10 SNPs.
  • A validation population of 31 individuals (hold-out sample).

The trait was simulated such that SNP 1 has an effect size of 2 and SNP 5 has an effect size of 1. The trait heritability is set at 0.5.

Data location:

ls /data/module5/toy/

3 Estimate SNP effects using BLUP

Let’s start by loading the data in R.

# data for training population
data_path="/data/module5/toy/"
X <- as.matrix(read.table(paste0(data_path,"xmat_trn.txt")))
y <- as.matrix(read.table(paste0(data_path,"yvec_trn.txt")))

# data for validation population
Xval <- as.matrix(read.table(paste0(data_path,"xmat_val.txt")))
yval <- as.matrix(read.table(paste0(data_path,"yvec_val.txt")))

BLUP estimates SNP effects using the following system of equations:

\[\begin{equation} \begin{bmatrix} 1_n'1_n & 1_n'X \\ \\ X'1_n & X'X+I\lambda \end{bmatrix} \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \begin{bmatrix} 1_n'y \\ X'y \end{bmatrix} \end{equation}\]

where \(1_n\) is a vector of ones \((325 \times 1)\), \(X\) is the genotype matrix \((325 \times 10)\), \(I\) is an identity matrix, \(\lambda = \frac{1 - h^2}{h^2/m}\) is the shrinkage parameter, \(\mu\) is the overall mean, and \(\beta\) is the vector of SNP effects. In the literature, this system of equations is known as mixed model equations (MME), and the big matrix on the left is called coefficient matrix.

We will use R to build the components and solve these equations.

First, we need to know the value of \(\lambda\).

Question: What’s the interpretation of \(\lambda\)? Can you write a code to compute it?


Take a moment to think about it before continuing.




Answer: \(\lambda\) is the ratio of residual variance to the random effect variance, which controls the degree to which random effect estimates shrunk toward zero. The higher the \(\lambda\), the more uncertain we are about the SNP effects, so we shrink them more strongly toward zero. It can be estimated as \((1-h^2)/(h^2/m) = 10\) where \(m=10\) is the number of SNPs.

h2 = 0.5           # trait heritability
nind = nrow(X)     # number of training individuals
nsnp = ncol(X)     # number of SNPs
lambda = (1-h2)/(h2/nsnp)
lambda
## [1] 10

We already have matrix X and vector y. We still need a vector of ones and a identity matrix with the dimension of number of SNPs by number of SNPs.

Question: Can you write a code to make the vector of ones and the identity matrix?


Take a moment to think about it before continuing.




Answer:

ones = array(1, nind) # a vector of ones
I = diag(nsnp)  # an identity matrix with dimension of number of SNPs


Now we are ready to build the coefficient matrix.

Question: Can you write a code to initialise a place holder for the coefficient matrix? What’s the size of the coefficient matrix?


Take a moment to think about it before continuing.




Answer:

coeff <- array(0, c(nsnp + 1, nsnp + 1))


We can fill in the matrix by blocks. For example,

# build coefficient matrix by blocks
coeff[1:1, 1:1] <- t(ones) %*% ones
coeff[1:1, 2:(nsnp+1)] <- t(ones) %*% X

Question: Can you build the other blocks of the coefficient matrix?


Take a moment to think about it before continuing.




Answer:

coeff[2: 2:(nsnp+1), 1] <- t(X) %*% ones
coeff[2:(nsnp+1), 2:(nsnp+1)] <- t(X) %*% X + I * lambda

Question: Can you build the right hand side of the MME?


Take a moment to think about it before continuing.




Answer:

# build the right hand side
rhs = rbind(t(ones) %*% y, t(X) %*% y)

The solutions can be obtained easily by using the inbuilt function solve:

# get BLUP solution
solution_vec <- solve(coeff, rhs)


Question: What is the solution for the mean? Which SNP has the largest effect? Which SNP has the second largest effect? Does the result make sense to you? [Hint: What’s the LD correlation between SNP 5 and 6?]


Take a moment to think about it before continuing.




Answer:

print(solution_vec)
##                V1
##  [1,]  0.64651875
##  [2,]  1.90752815
##  [3,] -0.14131984
##  [4,] -0.14131984
##  [5,]  0.16483065
##  [6,]  0.22968939
##  [7,]  0.27095334
##  [8,]  0.09624804
##  [9,] -0.16483065
## [10,] -0.07477323
## [11,] -0.16483065
cor(X)
##             V1         V2         V3         V4          V5          V6
## V1   1.0000000  0.1338359  0.1338359  0.7284017  0.57898634  0.64984918
## V2   0.1338359  1.0000000  1.0000000  0.3789184  0.25731616  0.36895246
## V3   0.1338359  1.0000000  1.0000000  0.3789184  0.25731616  0.36895246
## V4   0.7284017  0.3789184  0.3789184  1.0000000  0.81429611  0.81303937
## V5   0.5789863  0.2573162  0.2573162  0.8142961  1.00000000  0.90141987
## V6   0.6498492  0.3689525  0.3689525  0.8130394  0.90141987  1.00000000
## V7   0.1023013  0.1726592  0.1726592  0.2053259  0.10824685  0.18231275
## V8  -0.7284017 -0.3789184 -0.3789184 -1.0000000 -0.81429611 -0.81303937
## V9   0.3062689  0.4491011  0.4491011  0.4820692  0.07706692  0.05818721
## V10 -0.7284017 -0.3789184 -0.3789184 -1.0000000 -0.81429611 -0.81303937
##              V7         V8          V9        V10
## V1   0.10230128 -0.7284017  0.30626887 -0.7284017
## V2   0.17265922 -0.3789184  0.44910112 -0.3789184
## V3   0.17265922 -0.3789184  0.44910112 -0.3789184
## V4   0.20532585 -1.0000000  0.48206922 -1.0000000
## V5   0.10824685 -0.8142961  0.07706692 -0.8142961
## V6   0.18231275 -0.8130394  0.05818721 -0.8130394
## V7   1.00000000 -0.2053259  0.07758962 -0.2053259
## V8  -0.20532585  1.0000000 -0.48206922  1.0000000
## V9   0.07758962 -0.4820692  1.00000000 -0.4820692
## V10 -0.20532585  1.0000000 -0.48206922  1.0000000

The 1st SNPs is a causal variant with a large effect (true effect size = 2). The BLUP estimate of its effect is 1.9. Nearly spot on! The 5th SNP is the another causal variant with a smaller effect (true effect size = 1). However, the second largest BLUP estimate is at the 6th SNP (BLUP estimate = 0.27) and the third largest is at the 5th SNP (BLUP estimate = 0.23). This is because SNP 5 and 6 are in high LD (LD correlation 0.9) and SNP 6 has a larger effect estimate due to sampling. Besides SNP 1 and 5, the other SNPs are null SNPs, but their BLUP estimates are not zero. This is because BLUP assumes that all SNPs contribute to the trait with an equal effect variance. As a result, the effect of causal variant is smeared over SNPs in LD with it.


4 Predict polygenic scores

Question: The genotypes for the validation individuals are stored in Xval. Can you write a R code to calculate their polygenic scores?


Take a moment to think about it before continuing.




Answer: We use all SNPs with their BLUP solutions to predict PGS. No SNP selection or filtering is needed.

PGS_BLUP = Xval %*% solution_vec[-1]  # the first element is the intercept estimate




Question: What is the prediction accuracy (squared correlation between validation phenotypes and PGS) and bias (regression slope of validation phenotype on PGS)?


Take a moment to think about it before continuing.




# prediction accuracy
print(cor(PGS_BLUP, yval)^2)
##             V1
## [1,] 0.5450557
# prediction bias
yval = yval - mean(yval)
PGS_BLUP = PGS_BLUP - mean(PGS_BLUP)

plot(PGS_BLUP, yval, xlab="PGS using BLUP", ylab="Phenotype")
abline(a=0, b=1)  # add y=x line
abline(lm(yval ~  PGS_BLUP), col="red")  # add regression line

print(lm(yval ~ PGS_BLUP))
## 
## Call:
## lm(formula = yval ~ PGS_BLUP)
## 
## Coefficients:
## (Intercept)     PGS_BLUP  
##   5.709e-17    1.124e+00


Question: Do these results make sense to you? How do these results compare to those from the basic C+PT method?




5 A compact code for BLUP

blup = function(X, y, lambda){
  nsnp = ncol(X)
  D = diag(c(0, rep(lambda,nsnp))) # first element is 0 because we do not add shrinkage parameter to the intercept
  W = cbind(1, X)
  coeff = crossprod(W) + D
  rhs = crossprod(W, y)
  return(solve(coeff, rhs))
}

lambda = 10
solution_blup = blup(X, y, lambda)
solution_blup
##              V1
##      0.64651875
## V1   1.90752815
## V2  -0.14131984
## V3  -0.14131984
## V4   0.16483065
## V5   0.22968939
## V6   0.27095334
## V7   0.09624804
## V8  -0.16483065
## V9  -0.07477323
## V10 -0.16483065