1 Objectives

In this practical, we will learn how to evaluate the prediction accuracy in quantitative and disease traits using various statistics. We will also use simulation to demonstrate potential pitfalls in the PGS prediction analysis.

All data are simulated with the simulation code provided below.

2 How to evaluate the prediction performance

2.1 R-square for quantitative traits

### simulate the data
set.seed(100)  # set this seed so you can get the same result
n = 10000
### quantitative trait
yhat = rnorm(n)
age = as.integer(runif(n, 20, 60))
sex = rbinom(n, 1, 0.55)
y = yhat*sqrt(0.1) + scale(age)*sqrt(0.02) +  rnorm(n)*sqrt(0.88)
ids = paste("indi", 1:n, sep="")
data = data.frame(ID=ids, pheno=y, prs=yhat, age=age, sex=sex)
### linear regression
lmR = lm(pheno ~ age + sex, data=data)  ### reduced module
lmF = lm(pheno ~ age + sex + prs, data=data) ### full module
### look at the summary results for the two models
summary(lmR)
summary(lmF)
### incremental r-square
summary(lmF)$"r.square" - summary(lmR)$"r.square"
## [1] 0.09699013

Interpretation: This is the incremental \(R^2\) – the proportion of variance in phenotype explained by the PRS after adjusting for age and sex.

2.2 Nagelkerke’s R-sqaure

### simulate the data
### binary trait
set.seed(101)
yhat = rnorm(n)
age = as.integer(runif(n, 20, 60))
sex = rbinom(n, 1, 0.55)
y = yhat*sqrt(0.1) + scale(age)*sqrt(0.02) +  rnorm(n)*sqrt(0.88)
y = as.numeric(y>=0)
ids = paste("indi", 1:n, sep="")
data = data.frame(ID=ids, pheno=y, prs=yhat, age=age, sex=sex)
### logistic regression
glmR = glm(pheno ~ age + sex, data=data, family=binomial(logit))  ### reduced module
glmF = glm(pheno ~ age + sex + prs, data=data, family=binomial(logit))  ### full module
### look at the summary results for the two models
summary(glmR)
summary(glmF)
### log-likelihood
N = nrow(data)
LLF = logLik(glmF)
LLR = logLik(glmR)

### Cox&Snell R2
CSv <-  1-exp((2/N)*(LLR[1]-LLF[1]))
CSv
## [1] 0.065939
### Nagelkerke's R2
NKv <- CSv/(1-exp((2/N)*LLR[1]))
NKv
## [1] 0.08815463

Interpretation: Nagelkerke’s \(R^2\) is a pseudo-\(R^2\) metric appropriate for logistic regression. It approximates the proportion of variance explained on the observed binary scale.

2.3 AUC

### AUC
### install.packages('pROC')
library('pROC')
## Warning: package 'pROC' was built under R version 4.4.1
aucF = auc(data$pheno, glmF$linear.predictors)  ### AUC for full module
aucR = auc(data$pheno, glmR$linear.predictors)  ### AUC for reduced module
aucF; aucR
## Area under the curve: 0.6549
## Area under the curve: 0.5509
aucF - aucR
## [1] 0.104029
### draw the ROC
#install.packages("PredictABEL")
library(PredictABEL)
plotROC(data=data, cOutcome=2, predrisk=glmF$linear.predictors)

## AUC [95% CI] for the model 1 :  0.655 [ 0.644  -  0.666 ]

Interpretation: AUC measures the discriminative ability of the model. The difference aucF - aucR shows how much PRS improves case/control classification beyond age and sex.

2.4 Variance explained on liability scale

### function to convert R-square from 0-1 observed scale
### to liability scale
h2l_R2 <- function(k, r2, p) {
  # k baseline disease risk
  # r2 from a linear regression model of genomic profile risk score
  # p proportion of sample that are cases
  # calculates proportion of variance explained on the liability scale
  # from ABC at http://www.complextraitgenomics.com/software/
  # Lee SH, Goddard ME, Wray NR, Visscher PM. (2012) A better coefficient of determination for genetic profile analysis. Genet Epidemiol. 2012 Apr;36(3):214-24.
  x = qnorm(1-k)
  z = dnorm(x)
  i = z/k
  C = k*(1-k)*k*(1-k)/(z^2*p*(1-p))
  theta = i*((p-k)/(1-k))*(i*((p-k)/(1-k))-x)
  h2l_R2 = C*r2 / (1 + C*theta*r2)
  return(h2l_R2)
}
K=0.1  ## population prevalence
P = sum(data$pheno)/nrow(data)  ### propotion of cases in the sample
P
## [1] 0.5073
### linear regression with 0/1 values
lmR = lm(pheno~age+sex, data=data)
lmF = lm(pheno~age+sex+prs, data=data)
#R2v = summary(lmF)$"r.square" - summary(lmR)$"r.square"
#To strictly follow the equation 7 in Lee et al.
R2v = 1-exp((2/N)*(logLik(lmR)[1]-logLik(lmF)[1]))
R2v
## [1] 0.06595017
### convert to liability scale
h2l_R2(K,R2v,P)
## [1] 0.07130684

Interpretation: Converts observed-scale \(R^2\) to liability-scale, accounting for case-control ascertainment and population prevalence.

2.5 Decile Odds Ratio

### cut into deciles
data$decile = cut(data$prs, breaks=c(quantile(data$prs, probs=seq(0,1,by=0.1))), labels=1:10, include.lowest=T)
### calculate manually the odds in each decile
#### install.packages("tidyverse")
library(dplyr)
data %>% group_by(decile) %>% 
    summarise(n_case=sum(pheno==1), n_control=sum(pheno==0)) %>%
    mutate(odds = n_case/n_control) %>%
    mutate(ORs = odds/odds[1])
## # A tibble: 10 × 5
##    decile n_case n_control  odds   ORs
##    <fct>   <int>     <int> <dbl> <dbl>
##  1 1         280       720 0.389  1   
##  2 2         362       638 0.567  1.46
##  3 3         438       562 0.779  2.00
##  4 4         440       560 0.786  2.02
##  5 5         507       493 1.03   2.64
##  6 6         518       482 1.07   2.76
##  7 7         570       430 1.33   3.41
##  8 8         584       416 1.40   3.61
##  9 9         643       357 1.80   4.63
## 10 10        731       269 2.72   6.99
### calculate ORs using logistic regression
glmD <- glm(pheno ~ decile, data = data, family = binomial(logit))
### Odds for being a case compared to control in each decile
ORD <- exp(glmD$coefficients)
ORD
## (Intercept)     decile2     decile3     decile4     decile5     decile6 
##   0.3888889   1.4590237   2.0040671   2.0204082   2.6444509   2.7634855 
##     decile7     decile8     decile9    decile10 
##   3.4086379   3.6098901   4.6314526   6.9877854
### Plot odds
ORDL <- exp(glmD$coefficients-1.96*summary(glmD)$coefficients[,2])
ORDH <- exp(glmD$coefficients+1.96*summary(glmD)$coefficients[,2])
plot(ORD,ylim=c(min(ORDL),max(ORDH)))
arrows(seq(1,10,1), ORD, seq(1,10,1), ORDH, angle=90,length=0.10) # Draw error bars
arrows(seq(1,10,1), ORD, seq(1,10,1), ORDL, angle=90,length=0.10) # Draw error bars

Interpretation: Shows how odds of disease increase with higher PRS deciles. ORs relative to the lowest decile highlight risk stratification ability.

3 Pitfalls in PGS evaluation

3.1 Simulate data

### we simulate data now
set.seed(612)    ### set a seed for reproducation
n=1000; m=100    ### n: sample size; m: the number of SNPs

### simulate gneotype from binomial distribution
### minor allele frequency from uniform distribution
mafs = runif(m, 0.05, 0.5)
x = do.call("cbind", lapply(1:m, function(x) rbinom(n, 2, mafs[x]))) 
colnames(x) = paste("SNP", 1:m, sep="")

### simulate a phenotype from an independent standard normal distribution
### null hypothesis: prediction accuracy should be 0
y = rnorm(n)
data = data.frame(y, x)

3.2 Directly report R2 in the discovery sample

### E(R2) = m/n when m<n
summary(lm(as.formula(paste("y~", paste("SNP", 1:m, sep="", collapse="+"), sep="")),data=data.frame(data)))$"r.square"
## [1] 0.08951179

Q: What’s the interpretation of this \(R^2\)? [Hint: Is it close to m/n, where m is the number of SNPs and n is the sample size?]


Take a moment to think about it before continuing.




Answer: It reflects in-sample fit, not true predictive power. Under the null (no association), the expected \(R^2\) is approximately \(m/n\) (i.e., \(100/1000 = 0.1\)). This is overfitting, i.e., the model captures noise as signal.

3.3 Winner’s curse

### perform GWAS: association using R lm() function
### instead of Plink software (only practical for small data)
### select the top 10 SNPs
pvals = sapply(1:m, function(i) coef(summary(lm(y~x[,i])))[2,4])
### No p-values passed Bonferroni threshold of 5e-4 (0.05/100)
head(sort(pvals))  
## [1] 0.0007738269 0.0144049898 0.0185012440 0.0612186306 0.0612745215
## [6] 0.0616904469
top10SNPs = colnames(x)[head(order(pvals), 10)]
summary(lm(as.formula(paste("y~", paste(top10SNPs, collapse="+"), sep="")),data=data))$"r.square"
## [1] 0.04027331

Q: What’s the interpretation of this \(R^2\)?


Take a moment to think about it before continuing.




Answer: It shows inflated \(R^2\) due to the winner’s curse—selecting SNPs based on smallest p-values in the same data set used for model evaluation. This leads to optimistically biased performance.

3.4 Estimate the SNPs in the total sample, and evaluate in the target sample

### first 900 samples into discovery set
y_dis = y[1:900]
x_dis = x[1:900,]
### the remaining 100 samples into target set
y_target = y[901:1000]
x_target = x[901:1000,]

### estimate effect sizes in the total sample
b_total = sapply(1:m, function(i) coef(summary(lm(y~x[,i])))[2,1])
### evaluate the PRS in the target sample
prs1 = x_target %*% b_total
summary(lm(y_target~prs1))$"r.square"
## [1] 0.1573479
### estimate effect sizes only in the discovery sample
b_dis = sapply(1:m, function(i) coef(summary(lm(y_dis~x_dis[,i])))[2,1])
### evaluate the PRS in the target sample
prs2 = x_target %*% b_dis
summary(lm(y_target~prs2))$"r.square"
## [1] 0.001510061

Q: What are the interpretation of these \(R^2\)?


Take a moment to think about it before continuing.




Answer:

  • The \(R^2\) using b_total is slightly inflated, as the effects were estimated using the full sample (including the target).
  • The \(R^2\) using b_dis is more realistic, as it reflects out-of-sample prediction performance.