In this practical you will perform genomic prediction in a small toy example data set using two equivalent BLUP models in R. Then, we will use GCTA software to run SNP-BLUP using the larger simulated data set.
The scripts (source code) used for this practical include
ls /data/module5/scr/get_pred_r2.R
Data set 1 (toy) is a toy example data set consists of a training population (aka GWAS population, discovery population or reference population) of 325 individuals each has been genotyped for 10 SNPs. The trait was simulated in a way that the first SNP has an effect size of 2 and the 5th SNP has an effect size of 1 on the phenotype. The trait heritability is 0.5. There are a set of 31 individuals who are the target of PRS prediction (aka validation population or target population).
Path to Data set 1:
ls /data/module5/toy/
Let us start with reading in the data.
nmarkers <- 10 #number of markers
nrecords <- 325 #number of records
nvalidation <- 31 #number of records for validation
# data for training population
X <- matrix(scan("/data/module5/toy/xmat_trn.inp"), ncol=nmarkers, byrow = TRUE)
y <- matrix(scan("/data/module5/toy/yvec_trn.inp"), byrow = TRUE)
# data for validation population
Xval <- matrix(scan("/data/module5/toy/xmat_val.inp"), ncol=nmarkers, byrow = TRUE)
yval <- matrix(scan("/data/module5/toy/yvec_val.inp"), byrow = TRUE)
In SNP-BLUP, we will need to predict the effects of the 10 SNPs in the reference population, using the equations:
\[\begin{equation} \begin{bmatrix} 1_n'1_n & 1_n'X \\ \\ X'1_n & X'X+I\lambda \end{bmatrix} \begin{bmatrix} \hat{\mu} \\ \hat{\alpha} \end{bmatrix} = \begin{bmatrix} 1_n'y \\ X'y \end{bmatrix} \end{equation}\]
where \(\alpha\) are the SNP effects, \(1_n\) is a vector of ones \((325 \times 1)\), \(X\) is a design matrix allocating SNP genotype to records, \(\mu\) is the overall mean. We will use R to solve these equations.
First, we specify the value of lambda as \((1-h^2)/(h^2/m) = 10\) where \(m=10\) is the number of SNPs. Then, we build the mixed model equations.
lambda <- 10 # value for lambda
ones <- array(1, c(nrecords)) # a vector of ones
I <-diag(nmarkers) # an identity matrix with dimention of m by m
# build coefficient matrix by blocks
coeff <- array(0, c(nmarkers + 1, nmarkers + 1))
coeff[1:1, 1:1] <- t(ones) %*% ones
coeff[1:1, 2:(nmarkers+1)] <- t(ones) %*% X
coeff[2: 2:(nmarkers+1), 1] <- t(X) %*% ones
coeff[2:(nmarkers+1), 2:(nmarkers+1)] <- t(X) %*% X + I * lambda
# build the right hand side
rhs = rbind(t(ones) %*% y, t(X) %*% y)
# get BLUP solution
solution_vec <- solve(coeff, rhs)
Question 1: Which SNP has the largest effect? Which SNP has the second largest effect? Is it consistent with the simulation setting? (hint: what’s the LD correlation between SNP 2 and 3?)
print(solution_vec)
## [,1]
## [1,] 0.64651875
## [2,] 1.90752815
## [3,] -0.14131984
## [4,] -0.14131984
## [5,] 0.16483065
## [6,] 0.22968939
## [7,] 0.27095334
## [8,] 0.09624804
## [9,] -0.16483065
## [10,] -0.07477323
## [11,] -0.16483065
Question 2: The genotypes for the selection
candidates are stored in Xval
. Can you write a small R
script to calculate the GEBV?
GEBV = Xval %*% solution_vec[-1] # the first element is the intercept estimate
Question 3: Fours years later, all the selection
candidates receive a phenotypic record from a progeny test. The results
are in yval
. What is the squared correlation between your
GEBV and the progeny test result?
cor(GEBV, yval)**2
## [,1]
## [1,] 0.5450557
We will analyse the same data using the GBLUP (BLUP based on the genomic relationship matrix) approach. The mixed model is
\[y = 1_n\mu + Zg +e\]
where terms are as above, and \(Z\) is a design matrix allocating records to individuals, and \(g\) is a vector of (genomic) breeding values. The \(g\) are random effects, assumed to be distributed \(\mathcal{N}(0, G\sigma_g^2)\), where \(G\) is the genomic relationship matrix (GRM).
The solutions to the mixed model equations are \[\begin{bmatrix} \hat{\mu} \\ \hat{g} \end{bmatrix} = \begin{bmatrix} 1_n'1_n & 1_n'Z \\ \\ Z'1_n & Z'Z+G^{-1}\dfrac{\sigma_e^2}{\sigma_g^2} \end{bmatrix}^{-1} \begin{bmatrix} 1_n'y \\ Z'y \end{bmatrix} \]
The first step in building the mixed model equations is to build \(G\).
\(G\) is constructed as \[G = W W' /m\] where \(m\) is the number of markers, \(w_{ij} = (x_{ij} - 2p_j)/\sqrt{2p_j(1-p_j)}\) and \(p_j\) is the allele frequency of the 2nd allele for SNP \(j\). Remember in the GBLUP, \(G\) (and \(Z\)) have to include all individuals, including the individuals with no phenotypes of their own that we wish to predict. So the \(X\) matrix has to include all individuals, including the progeny.
You can create this new \(X\) from the original X matrices in the first practical using the rbind (join by row command in R):
Xall <- rbind(X, Xval)
nanims = nrow(Xall)
We next calculate GRM and its inverse.
# calculate allele frequency
p = colMeans(Xall)/2
# obtain standardised genotypes
W = do.call("cbind",lapply(1:nmarkers,function(j) (Xall[,j]-2*p[j])/sqrt(2*p[j]*(1-p[j])) ))
# compute GRM
G = W%*%t(W)/nmarkers
# The next line adds a small amount to the diagonal of G,
# otherwise G is not invertable in this small example!
G <- G + diag(nanims)*0.01
# compute the inverse of GRM
Ginv <- solve(G)
The only other matrix we do not have is \(Z\), which is has dimensions nrecords \(\times\) nanims. \(Z\) is a diagonal matrix for those animals with records, and a block of zeros for those animals without records (as there are no records to allocate these animals to).
\(Z\) can be constructed as
Z1 <-diag(nrecords)
Z2 <-matrix(0, nrecords, nvalidation)
Z <- cbind(Z1, Z2)
Now go ahead and build the mixed model solution equations above. For \(\dfrac{\sigma_e^2}{\sigma_g^2}\) use a value of \(1\).
# coeff
coeff <- array(0, c(nanims + 1, nanims + 1))
coeff[1:1, 1:1] <- t(ones) %*% ones
coeff[1:1, 2:(nanims+1)] <- t(ones) %*% Z
coeff[2: 2:(nanims+1), 1] <- t(Z) %*% ones
coeff[2:(nanims+1), 2:(nanims+1)] <- t(Z) %*% Z + Ginv
# right hand side
rhs = rbind(t(ones) %*% y, t(Z) %*% y)
# BLUP solution for the breeding value of each individual
gblup <- solve(coeff, rhs)
Question 4: What is the accuracy of the genomic predictions for the \(31\) selection candidates from the GBLUP, eg \(r(\hat{g},\text{yval})\)? How does this compare with the accuracy of SNP-BLUP?
# the genomic prediction for the 31 selection candidates is
yval_pred = gblup[-c(1:326)]
# the accuracy is
cor(yval, yval_pred)**2
## [,1]
## [1,] 0.5423429
Question 5: If you plot GEBV for the SNP-BLUP prac against \(\hat{g}\), do you get a regression line with a slope of 1, indicating these are equivalent models? Why, or why not (hint, did we centre and standardise the \(X\) matrix to the \(W\) matrix in SNP-BLUP)? If you use the same genotype matrix (\(W\)) in both SNP-BLUP and GBLUP are the genomic predictions identical?
#png("snpblup_vs_gblup.png")
plot(GEBV, yval_pred, xlab = "GEBV from SNP-BLUP", ylab = "GEBV from G-BLUP")
lm(yval_pred ~ GEBV)
##
## Call:
## lm(formula = yval_pred ~ GEBV)
##
## Coefficients:
## (Intercept) GEBV
## -2.124 1.017
abline(lm(yval_pred ~ GEBV))
#dev.off()
SNP-BLUP with standardised genotype matrix.
X = W[1:325,]
coeff <- array(0, c(nmarkers + 1, nmarkers + 1))
coeff[1:1, 1:1] <- t(ones) %*% ones
coeff[1:1, 2:(nmarkers+1)] <- t(ones) %*% X
coeff[2: 2:(nmarkers+1), 1] <- t(X) %*% ones
coeff[2:(nmarkers+1), 2:(nmarkers+1)] <- t(X) %*% X + lambda * I
rhs = rbind(t(ones) %*% y, t(X) %*% y)
solution_vec <- solve(coeff, rhs)
Wprog = W[-c(1:325),] #only W for the progeny
GEBV_W = Wprog %*% solution_vec[-1]
cor(yval, GEBV_W)**2
## [,1]
## [1,] 0.5423038
#png("snpblup_scaled_vs_gblup.png")
plot(GEBV_W, yval_pred, xlab = "GEBV from SNP-BLUP with scaled genotypes", ylab = "GEBV from G-BLUP")
lm(yval_pred ~ GEBV_W)
##
## Call:
## lm(formula = yval_pred ~ GEBV_W)
##
## Coefficients:
## (Intercept) GEBV_W
## -6.409e-05 9.998e-01
abline(lm(yval_pred ~ GEBV_W))
#dev.off()
You can use GCTA (https://yanglab.westlake.edu.cn/software/gcta/#BLUP) to run BLUP, which is more computationally efficient than R. We demonstrate GCTA-BLUP using the simulated data set used in Practical 1.
## code to generate GRM
bfile="/data/module5/sim/gwas"
gcta --bfile $bfile --make-grm --threads 4 --out gwas
grm="gwas"
pheno="/data/module5/sim/gwas.phen"
covar="/data/module5/sim/gwas.cov"
gcta --reml --grm $grm --pheno $pheno --reml-pred-rand --qcovar $covar --out gwas
bfile="/data/module5/sim/gwas"
indi_blp="gwas.indi.blp"
gcta --bfile $bfile --blup-snp $indi_blp --out gwas
To compute the polygenic scores in the target population (do not need a testing population as required in C+PT for finding the optimal P-value threshold):
target="/data/module5/sim/target"
plink --bfile $target --score gwas.snp.blp sum 1 2 3 --out target.snp.blp
Use R to evaluate the prediction accuracy
phenFile="/data/module5/sim/target.phen"
covFile="/data/module5/sim/target.cov"
prsFile="target.snp.blp.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile
Question 6: Is the prediction accuracy from BLUP higher than that from C+PT? Why or why not? (hint: what’s the assumption for the SNP effect distribution in BLUP? Is the genetic model in the simulation consistent with this assumption?)