1 Objectives

We will learn how to compute polygenic scores (PGS) for complex traits (or polygenic risk scores, PRS, for complex diseases) using a basic method, i.e., clumping + P-value thresholding (C+PT) based on the GWAS results. In this practical, we will use two simulated data sets. One is a small data set with 10 SNPs, which is easy to manage and compute in R. Another data set has a size closer to the real world problem, including ~300k SNPs across 22 chromosomes. We will use the small data set to illustrate the principle of the method and calculate PGS by hands in R. We will then analyse the larger data set using a formal pipeline with PLINK.

Note that the small data set is free to download to your own laptop if you want. However, the UKB data set is not allowed to download because it contains real individual genotypes from the UK Biobank (although individual IDs has been masked).

2 Data

2.1 Small data set

This is a toy example data set consists of a training population (aka GWAS population, discovery population, or reference population) of 325 individuals. Each of these individuals had been genotyped for 10 SNPs. The trait was simulated in a way that the first SNP has an effect size of 2 and the 5th SNP has an effect size of 1 on the phenotype. The trait heritability is 0.5. The validation population (aka target population) is a hold-out sample of 31 individuals.

Path to Data set 1:

ls /data/module5/toy/

2.2 Large data set

This is a larger data set based on real genotypes at 273,604 SNPs from the UK Biobank. We simulated a trait with 100 causal variants with heritability of 0.5. The entire data set was partitioned into three independent populations:

  • Discovery population for conducting GWAS (n = 3,000).
  • Testing population for finding the best prediction model (n = 500).
  • Target population to be predicted (n = 1,00).

Path to Data set 2:

ls /data/module5/ukb/

The GWAS results were stored in a file named gwas.ma:

sumstat="/data/module5/ukb/gwas.ma"
head $sumstat

3 Illustration of principle

Let’s start with using R to calculate PGS in the small data set. The aim is to predict PRS for the 31 validation individuals. You can use the following R code to read the data into R (if you are using our server, you can enter R by typing ‘R’ in the terminal):

# data for training population
data_path="/data/module5/toy/"
X <- as.matrix(read.table(paste0(data_path,"xmat_trn.txt")))
y <- as.matrix(read.table(paste0(data_path,"yvec_trn.txt")))

# data for validation population
Xval <- as.matrix(read.table(paste0(data_path,"xmat_val.txt")))
yval <- as.matrix(read.table(paste0(data_path,"yvec_val.txt")))

Firstly, we perform GWAS in the training (or discovery) population, using a simple linear regression of phenotypes on genotypes for one SNP at a time. The slope of the regression is the SNP effect estimate. Such an effect estimate from GWAS is known as the marginal effect estimate for its ignorance of linkage disequilibrium (LD) correlations between SNPs.

fit = apply(X, 2, function(x){summary(lm(y~x))$coef[2,]})
bhat = fit[1,]

Now you have the SNP effect estimates bhat and the genotypes of validation samples Xval. Can you write a R code to predict the PGS for the 31 validation individuals? Remember that PGS is a weighted sum of allele counts across SNPs.

# Given the SNP effect vector
print(bhat)
##         V1         V2         V3         V4         V5         V6         V7 
##  2.6026705  0.6301389  0.6301389  2.2707780  2.0718035  2.2161266  0.5976793 
##         V8         V9        V10 
## -2.2707780  0.8994498 -2.2707780
# and validation genotype matrix
head(Xval)
##      V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## [1,]  1  1  1  1  1  1  2  1  0   1
## [2,]  1  0  0  1  1  1  2  1  0   1
## [3,]  1  0  0  1  1  1  2  1  0   1
## [4,]  1  0  0  1  1  1  2  1  0   1
## [5,]  0  0  0  0  0  0  1  2  0   2
## [6,]  1  1  1  1  1  1  2  1  0   1

Take a moment to think about it before continuing.




Suppose one wrote the following R code using all SNPs with their marginal effect estimates from GWAS in the prediction model.

PGS = Xval %*% bhat

Let’s plot the phenotype against PGS. What do you find?

# center PGS and phenotype to remove the intercept because the intercept is irrelevant
PGS = PGS - mean(PGS)
yval = yval - mean(yval)

# plot phenotype against PGS
plot(PGS, yval, xlab="PGS using all SNPs", ylab="Phenotype")
abline(a=0, b=1)  # add y=x line
abline(lm(yval ~ PGS), col="red")  # add regression line

# you can check the value of the regression slope by
print(lm(yval ~ PGS))
## 
## Call:
## lm(formula = yval ~ PGS)
## 
## Coefficients:
## (Intercept)          PGS  
##   2.955e-17    1.675e-01

You may have noticed that the regression slope of phenotypes (y-axis) on PGS (x-axis) is substantially smaller than one. What does it mean?

Take a moment to think about it before continuing.




Ideally, we want the slope to be equal to one, which would indicate that an unit change in PGS leads to an unit change in phenotype. This is often referred to as “unbiasedness”. When the slope is below one, it means the PGS is upward biased, i.e., the absolute predicted value is larger than the true value. You can see this from the difference in the range of values between x and y axes.

What could be the cause of such a bias?

Take a moment to think about it before continuing.




3.1 LD pruning

To answer the question above, let’s check the LD correlations of genotypes between SNPs.

R = cor(X)
print(R)
##             V1         V2         V3         V4          V5          V6
## V1   1.0000000  0.1338359  0.1338359  0.7284017  0.57898634  0.64984918
## V2   0.1338359  1.0000000  1.0000000  0.3789184  0.25731616  0.36895246
## V3   0.1338359  1.0000000  1.0000000  0.3789184  0.25731616  0.36895246
## V4   0.7284017  0.3789184  0.3789184  1.0000000  0.81429611  0.81303937
## V5   0.5789863  0.2573162  0.2573162  0.8142961  1.00000000  0.90141987
## V6   0.6498492  0.3689525  0.3689525  0.8130394  0.90141987  1.00000000
## V7   0.1023013  0.1726592  0.1726592  0.2053259  0.10824685  0.18231275
## V8  -0.7284017 -0.3789184 -0.3789184 -1.0000000 -0.81429611 -0.81303937
## V9   0.3062689  0.4491011  0.4491011  0.4820692  0.07706692  0.05818721
## V10 -0.7284017 -0.3789184 -0.3789184 -1.0000000 -0.81429611 -0.81303937
##              V7         V8          V9        V10
## V1   0.10230128 -0.7284017  0.30626887 -0.7284017
## V2   0.17265922 -0.3789184  0.44910112 -0.3789184
## V3   0.17265922 -0.3789184  0.44910112 -0.3789184
## V4   0.20532585 -1.0000000  0.48206922 -1.0000000
## V5   0.10824685 -0.8142961  0.07706692 -0.8142961
## V6   0.18231275 -0.8130394  0.05818721 -0.8130394
## V7   1.00000000 -0.2053259  0.07758962 -0.2053259
## V8  -0.20532585  1.0000000 -0.48206922  1.0000000
## V9   0.07758962 -0.4820692  1.00000000 -0.4820692
## V10 -0.20532585  1.0000000 -0.48206922  1.0000000

You can see that some SNPs are in high LD correlations with others. This may result in double counting of the SNP effects because the SNP effects estimated from GWAS are not conditional on other SNPs (the marginal SNP effects).

Below is a simple R code to remove one of the SNPs in pairwise LD \(r^2\) greater than a threshold (e.g., 0.5). This procedure is known as LD pruning.

rsqThreshold = 0.5
nsnp = ncol(X)
removed = c()
for (i in 1:(nsnp-1)) {
  if (! i %in% removed) { 
    for (j in (i+1):nsnp) {
      if(R[i,j]^2 > rsqThreshold) removed = c(removed, j)
    }
  }
}
removed = unique(removed)
all = 1:nsnp
keep = all[!all %in% removed] # remaining SNPs after LD pruning
print(R[keep, keep])  # LD correlations among LD pruned SNPs
##           V1        V2         V5         V7         V9
## V1 1.0000000 0.1338359 0.57898634 0.10230128 0.30626887
## V2 0.1338359 1.0000000 0.25731616 0.17265922 0.44910112
## V5 0.5789863 0.2573162 1.00000000 0.10824685 0.07706692
## V7 0.1023013 0.1726592 0.10824685 1.00000000 0.07758962
## V9 0.3062689 0.4491011 0.07706692 0.07758962 1.00000000

Now, let’s use the LD-pruned SNPs only to calculate PRS and check the result.

PGS2 = Xval[,keep] %*% bhat[keep]
PGS2 = PGS2 - mean(PGS2)

plot(PGS2, yval, xlab="PGS using LD-pruned SNPs", ylab="Phenotype")
abline(a=0, b=1)
abline(lm(yval~PGS2), col="red")

print(lm(yval ~ PGS2))
## 
## Call:
## lm(formula = yval ~ PGS2)
## 
## Coefficients:
## (Intercept)         PGS2  
##   4.558e-17    5.428e-01

3.2 P-value thresholding

The slope has been improved but is still low. Checking the P-values of the LD-pruned SNPs finds that some SNP effects are statistically insignificant.

pval = fit[4,]
print(pval[keep])
##           V1           V2           V5           V7           V9 
## 4.491829e-49 6.581417e-02 8.597549e-23 5.433139e-02 5.818065e-04

Including insignificant SNPs in the PGS may add noise to the prediction. Let’s try to keep genome-wide significant (P < 5e-8) SNPs only in the equation. This step of filtering is called P-value thresholding.

pvalThreshold = 5e-8
keepFinal = keep[keep %in% which(pval < pvalThreshold)]
PGS3 = Xval[,keepFinal] %*% bhat[keepFinal]
PGS3 = PGS3 - mean(PGS3)

plot(PGS3, yval, xlab="PGS using LD-pruned and P-value thresholded SNPs", ylab="Phenotype")
abline(a=0, b=1)
abline(lm(yval~PGS3), col="red")

print(lm(yval ~ PGS3))
## 
## Call:
## lm(formula = yval ~ PGS3)
## 
## Coefficients:
## (Intercept)         PGS3  
##  -5.944e-17    6.651e-01

You can also compare the prediction accuracy (squared correlation between phenotypes and PGS) from different criteria of SNP selection. Note that the theoretical upper bound of the prediction accuracy is the heritability of the simulated trait (0.5).

print(cor(yval, PGS)^2)   ## prediction accuracy using all SNPs
##         [,1]
## V1 0.3696879
print(cor(yval, PGS2)^2)  ## prediction accuracy using LD pruned SNPs
##         [,1]
## V1 0.4361839
print(cor(yval, PGS3)^2)  ## prediction accuracy using LD pruned + P-value thresholded SNPs 
##         [,1]
## V1 0.4794083

In conclusion, LD pruning and P-value thresholding improves prediction accuracy and unbiasness by avoiding double counting of SNP effects and excluding possible false positives. However, in practice, it is often not clear where to draw the line between the false and true positives. In addition, LD clumping is usually better than LD pruning, because the clumping algorithm takes P-values into account when filtering SNPs in LD. In the next session, we will use PLINK to carry out the clumping + P-value thresholding (C+PT) method for the analysis of the simulated UKB data.

4 UKB data analysis

In the subsequent analysis, we will use the simulated data set based on the UKB data (/data/module5/ukb/) to demonstrate the C+PT method using software PLINK (https://www.cog-genomics.org/plink/1.9/). Code is adapted from https://choishingwan.github.io/PRS-Tutorial/plink/.

4.1 Step 1: Clumping

Select independent SNPs. This step requires individual genotype data for LD computation. Here we use the GWAS sample as LD reference. In practice, the genotype data in GWAS may not be available and you will need to use a reference sample that matches the GWAS sample in genetics. Note that the following code is bash script that needs to be run in Terminal.

LDref="/data/module5/ukb/gwas"
sumstat="/data/module5/ukb/gwas.ma"
plink \
    --bfile $LDref \
    --clump-p1 1 \
    --clump-r2 0.1 \
    --clump-kb 250 \
    --clump $sumstat \
    --clump-snp-field SNP \
    --clump-field p \
    --out gwas
  • clump-p1 1 P-value threshold for a SNP to be included as an index SNP. 1 is selected such that all SNPs are include for clumping.
  • clump-r2 0.1 SNPs having \(r^2\) higher than 0.1 with the index SNPs will be removed.
  • clump-kb 250 SNPs within 250k of the index SNP are considered for clumping.
  • clump $sumstat Base data (summary statistic) file containing the P-value information.
  • clump-snp-field SNP Specifies that the column SNP contains the SNP IDs.
  • clump-field p Specifies that the column P contains the P-value information.

This will generate gwas.clumped, containing the index SNPs after clumping is performed. We can extract the index SNP ID by performing the following command ($3 because the third column contains the SNP ID):

awk 'NR!=1{print $3}' gwas.clumped >  gwas.clumped.snp

It’s always good to check the result file and see if everything works fine. You can look at the clumping result file by

less -S gwas.clumped

4.2 Step 2: Generate PRS based on a range of P-value thresholds

plink provides a convenient function --score and --q-score-range for calculating polygenic scores given a range of P-value thresholds.

We will need three files:

  • The base data file: gwas.ma
  • A file containing SNP IDs and their corresponding P-values ($1 because SNP ID is located in the first column; $7 because the P-value is located in the seventh column).
sumstat="/data/module5/ukb/gwas.ma"
awk '{print $1,$7}' $sumstat > SNP.pvalue
  • A file containing the different P-value thresholds for inclusion of SNPs in the PRS. Here calculate PRS corresponding to a few thresholds for illustration purposes:
echo "1e-8 0 1e-8" > range_list 
echo "1e-6 0 1e-6" >> range_list
echo "1e-4 0 1e-4" >> range_list
echo "0.01 0 0.01" >> range_list
echo "0.05 0 0.05" >> range_list
echo "0.1 0 0.1" >> range_list
echo "0.5 0 0.5" >> range_list
echo "1 0 1" >> range_list

We can then calculate the PRS with the following PLINK command in the testing population

test="/data/module5/ukb/test"
sumstat="/data/module5/ukb/gwas.ma"
plink \
    --bfile $test \
    --score $sumstat 1 2 5 sum header \
    --q-score-range range_list SNP.pvalue \
    --extract gwas.clumped.snp \
    --out test
  • score $sumstat 1 2 5 sum header Read from the file, assuming that the 1st column is the SNP ID; 2nd column is the effective allele information; the 5th column is the effect size estimate; the file contains a header and we want to calculate the score as the weighted sum without dividing by the total number of SNPs.
  • q-score-range range_list SNP.pvalue We want to calculate PRS based on the thresholds defined in range_list, where the threshold values (P-values) were stored in SNP.pvalue.

4.3 Step 3: Finding the “best-fit” PRS in the testing population

The P-value threshold that provides the “best-fit” PRS under the C+PT method is usually unknown. To approximate the “best-fit” PRS, we can perform a regression between PRS calculated at a range of P-value thresholds and then select the PRS that explains the highest phenotypic variance. This can be achieved using R as follows:

dataDir = "/data/module5/ukb/"
phenFile = paste0(dataDir, "test.phen")
covFile = paste0(dataDir, "test.cov")

p.threshold = c("1e-8","1e-6","1e-4","0.01","0.05","0.1","0.5","1")
# Read in the phenotype file 
phenotype = read.table(phenFile, header=F)
colnames(phenotype) = c("FID", "IID", "Trait") 
# Read in the covariates
covariates = read.table(covFile, header=F)
colnames(covariates) = c("FID", "IID", "Sex", "Age", paste0("PC",seq(10)))
# Now merge the files
pheno = merge(phenotype, covariates, by=c("FID", "IID"))
# Calculate the null model (model with PRS) using a linear regression 
null.model = lm(Trait~., data=pheno)
# And the R2 of the null model is 
null.r2 = summary(null.model)$r.squared
prs.result = NULL
for(i in p.threshold){
    # Go through each p-value threshold
    prs = read.table(paste0("test.", i, ".profile"), header=T)
    # Merge the prs with the phenotype matrix
    # We only want the FID, IID and PRS from the PRS file, therefore we only select the 
    # relevant columns
    pheno.prs = merge(pheno, prs[,c("FID","IID", "SCORESUM")], by=c("FID", "IID"))
    # Now perform a linear regression on trait with PRS and the covariates
    # ignoring the FID and IID from our model
    model = lm(Trait~., data=pheno.prs)
    # model R2 is obtained as 
    model.r2 = summary(model)$r.squared
    # R2 of PRS is simply calculated as the model R2 minus the null R2
    prs.r2 = model.r2-null.r2
    # We can also obtain the coeffcient and p-value of association of PRS as follow
    prs.coef = summary(model)$coeff["SCORESUM",]
    prs.beta = as.numeric(prs.coef[1])
    prs.se = as.numeric(prs.coef[2])
    prs.p = as.numeric(prs.coef[4])
    # We can then store the results
    prs.result = rbind(prs.result, data.frame(Threshold=i, 
                                               R2=prs.r2, 
                                               P=prs.p, 
                                               BETA=prs.beta,
                                               SE=prs.se))
}
# Best result is:
prs.result[which.max(prs.result$R2),]
PT = prs.result[which.max(prs.result$R2),1]
PT = as.character(PT)
write.table(t(c(PT, 0, PT)), "selected_range", quote=F, row.names=F, col.names=F)

4.4 Step 4: visualisation

Viz the result using R.

# ggplot2 is a handy package for plotting
library(ggplot2)
# generate a pretty format for p-value output
prs.result$print.p <- round(prs.result$P, digits = 3)
prs.result$print.p[!is.na(prs.result$print.p) &
                    prs.result$print.p == 0] <-
    format(prs.result$P[!is.na(prs.result$print.p) &
                            prs.result$print.p == 0], digits = 2)
prs.result$print.p <- sub("e", "*x*10^", prs.result$print.p)
# Initialize ggplot, requiring the threshold as the x-axis 
# (use factor so that it is uniformly distributed)
p = ggplot(data = prs.result, aes(x = factor(Threshold, levels = p.threshold), y = R2)) +
    # Specify that we want to print p-value on top of the bars
    geom_text(
        aes(label = paste(print.p)),
        vjust = -1.5,
        hjust = 0,
        angle = 45,
        cex = 4,
        parse = T
    )  +
    # Specify the range of the plot, *1.25 to provide enough space for the p-values
    scale_y_continuous(limits = c(0, max(prs.result$R2) * 1.25)) +
    # Specify the axis labels
    xlab(expression(italic(P) - value ~ threshold ~ (italic(P)[T]))) +
    ylab(expression(paste("PRS model fit:  ", R ^ 2))) +
    # Draw a bar plot
    geom_bar(aes(fill = -log10(P)), stat = "identity") +
    # Specify the colors
    scale_fill_gradient2(
        low = "dodgerblue",
        high = "firebrick",
        mid = "dodgerblue",
        midpoint = 1e-4,
        name = bquote(atop(-log[10] ~ model, italic(P) - value),)
    ) +
    # Some beautification of the plot
    theme_classic() + theme(
        axis.title = element_text(face = "bold", size = 18),
        axis.text = element_text(size = 14),
        legend.title = element_text(face = "bold", size =
                                        18),
        legend.text = element_text(size = 14),
        axis.text.x = element_text(angle = 45, hjust =
                                    1)
    )
p
# save the plot
ggsave("Pred_R2_barplot.png", p, height = 7, width = 7)

Questions

  1. Which P-value threshold generates the “best-fit” PRS?

  2. How much phenotypic variation does the “best-fit” PRS explain?

4.5 Step 5: apply the best model to the target population

Generate PRS for the target population based on the selected selected P-value threshold.

target="/data/module5/ukb/target"
sumstat="/data/module5/ukb/gwas.ma"
plink \
    --bfile $target \
    --score $sumstat 1 2 5 sum header \
    --q-score-range selected_range SNP.pvalue \
    --extract gwas.clumped.snp \
    --out target

Evaluate the prediction accuracy in the target individuals using provided R script.

phenFile="/data/module5/ukb/target.phen" 
covFile="/data/module5/ukb/target.cov"
prsFile="target.1e-6.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile

Is the prediction accuracy expected?