In this practical exervise, we will learn how to compute polygenic scores (PGS) for complex traits, or polygenic risk scores (PRS) for complex diseases, using a basic approach: clumping + P-value thresholding (C+PT) based on GWAS results. We will work with two simulated data sets:
We will use the small data set to illustrate the principles and calculate PGS by hand in R. Then, the larger data set will be analysed with a pipeline using PLINK (https://www.cog-genomics.org/plink/1.9/).
Note: the small data set can be freely downloaded. The UK Biobank data set cannot be downloaded because it contains real individual genotypes, although individual IDs are masked.
This toy data set includes:
The trait was simulated such that SNP 1 has an effect size of 2 and SNP 5 has an effect size of 1. The trait heritability is set at 0.5.
Data location:
ls /data/module5/toy/Example R code to load the data:
# data for training population
data_path="/data/module5/toy/"
X <- as.matrix(read.table(paste0(data_path,"xmat_trn.txt")))
y <- as.matrix(read.table(paste0(data_path,"yvec_trn.txt")))
# data for validation population
Xval <- as.matrix(read.table(paste0(data_path,"xmat_val.txt")))
yval <- as.matrix(read.table(paste0(data_path,"yvec_val.txt")))You can download this small data set using the following Linux command in the Terminal (or Powershell for Windows):
scp -r username@hostname:/data/module5/toy your/local/pathThis data set is based on real genotypes at 273,604 SNPs from the UK Biobank, with a simulated trait affected by 100 causal variants and heritability 0.5. The samples are split into three independent cohorts:
Data location:
ls /data/module5/ukb/The GWAS summary statistics are stored in gwas.ma:
head /data/module5/ukb/gwas.ma## SNP A1 A2 freq b se p N
## rs10000010 C T 0.477993 -0.2796 0.3719 0.4522 2999
## rs10000030 A G 0.135302 -0.09771 0.5496 0.8589 2997
## rs1000007 C T 0.275333 -0.1548 0.4153 0.7094 3000
## rs10000121 G A 0.468092 -0.5717 0.3729 0.1253 2993
## rs10000141 A G 0.087725 0.6495 0.6649 0.3287 2998
## rs1000016 G A 0.068402 0.6553 0.7509 0.3829 2997
## rs10000169 C T 0.250833 -0.1971 0.4317 0.648 3000
## rs1000022 T C 0.415138 0.07813 0.3815 0.8377 2999
## rs10000272 C T 0.056985 0.8443 0.8181 0.3024 2992Let’s start with using R to calculate PGS in the small data set. Start R and load the data as shown above.
Perform GWAS in the training population by regressing the phenotype
on each SNP one at a time. This gives marginal SNP effect estimate
(bhat), ignoring linkage disequilibrium (LD) between
SNPs.
fit = apply(X, 2, function(x){summary(lm(y ~ x))$coef[2,]})
bhat = fit[1,]Now, using the validation genotype matrix Xval and the
SNP effect estimates bhat,
Question: Can you write a R code to predict the PGS for the 31 validation individuals? [Hint: PGS is a weighted sum of allele counts across SNPs.]
# Given the SNP effect vector
print(bhat)##         V1         V2         V3         V4         V5         V6         V7 
##  2.6026705  0.6301389  0.6301389  2.2707780  2.0718035  2.2161266  0.5976793 
##         V8         V9        V10 
## -2.2707780  0.8994498 -2.2707780# and validation genotype matrix
head(Xval)##      V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## [1,]  1  1  1  1  1  1  2  1  0   1
## [2,]  1  0  0  1  1  1  2  1  0   1
## [3,]  1  0  0  1  1  1  2  1  0   1
## [4,]  1  0  0  1  1  1  2  1  0   1
## [5,]  0  0  0  0  0  0  1  2  0   2
## [6,]  1  1  1  1  1  1  2  1  0   1
Take a moment to think about it before continuing.
Suppose we predict PGS using all SNPs with their marginal effect estimates from GWAS:
PGS = Xval %*% bhatLet’s check the prediction accuracy by taking the square of correlation between phenotype and PGS.
print(cor(yval, PGS)^2)##         [,1]
## V1 0.3696879Question: Is this prediction accuracy high? [Hint: What value do you expect if PGS can perfectly predict the true genetic value?]
Take a moment to think about it before continuing.
Answer: Not too high, as the upper bound of prediction accuracy is the trait heritability (0.5), which determines the maximum proportion of variation that can be explained by the genetic predictor.
Let’s also check the bias of PGS by regressing the phenotype on the PGS.
# center PGS and phenotype to remove the intercept because the intercept is irrelevant
PGS = PGS - mean(PGS)
yval = yval - mean(yval)
# plot phenotype against PGS
plot(PGS, yval, xlab="PGS using all SNPs", ylab="Phenotype")
abline(a=0, b=1)  # add y=x line
abline(lm(yval ~ PGS), col="red")  # add regression line# you can check the value of the regression slope by
print(lm(yval ~ PGS))## 
## Call:
## lm(formula = yval ~ PGS)
## 
## Coefficients:
## (Intercept)          PGS  
##   2.152e-17    1.675e-01You may have noticed that the regression slope of phenotypes on PGS is substantially smaller than one.
Question: What does it mean? [Hint: slope < 1 means that one unit increase in x leads to less than an unit increase in y. Does PGS tend to be inflated or deflated relative to the true phenotype value?]
Take a moment to think about it before continuing.
Answer: Ideally, we want the slope to be equal to one, which would indicate that an unit change in PGS leads to an unit change in phenotype. This is often referred to as “unbiasedness”. When the slope is below one, it means the PGS is upward biased, i.e., the absolute predicted value is larger than the true value. You can also see this from the difference in the range of values between x and y axes.
Question: What could be the cause(s) of such a relative low prediction accuracy and a large bias?
Take a moment to think about it before continuing.
To answer the question above, let’s check the LD correlations between SNPs.
R = cor(X)
print(R)##             V1         V2         V3         V4          V5          V6
## V1   1.0000000  0.1338359  0.1338359  0.7284017  0.57898634  0.64984918
## V2   0.1338359  1.0000000  1.0000000  0.3789184  0.25731616  0.36895246
## V3   0.1338359  1.0000000  1.0000000  0.3789184  0.25731616  0.36895246
## V4   0.7284017  0.3789184  0.3789184  1.0000000  0.81429611  0.81303937
## V5   0.5789863  0.2573162  0.2573162  0.8142961  1.00000000  0.90141987
## V6   0.6498492  0.3689525  0.3689525  0.8130394  0.90141987  1.00000000
## V7   0.1023013  0.1726592  0.1726592  0.2053259  0.10824685  0.18231275
## V8  -0.7284017 -0.3789184 -0.3789184 -1.0000000 -0.81429611 -0.81303937
## V9   0.3062689  0.4491011  0.4491011  0.4820692  0.07706692  0.05818721
## V10 -0.7284017 -0.3789184 -0.3789184 -1.0000000 -0.81429611 -0.81303937
##              V7         V8          V9        V10
## V1   0.10230128 -0.7284017  0.30626887 -0.7284017
## V2   0.17265922 -0.3789184  0.44910112 -0.3789184
## V3   0.17265922 -0.3789184  0.44910112 -0.3789184
## V4   0.20532585 -1.0000000  0.48206922 -1.0000000
## V5   0.10824685 -0.8142961  0.07706692 -0.8142961
## V6   0.18231275 -0.8130394  0.05818721 -0.8130394
## V7   1.00000000 -0.2053259  0.07758962 -0.2053259
## V8  -0.20532585  1.0000000 -0.48206922  1.0000000
## V9   0.07758962 -0.4820692  1.00000000 -0.4820692
## V10 -0.20532585  1.0000000 -0.48206922  1.0000000You can see that some SNPs are in high LD correlations with others. This may result in double counting of SNP effects because the SNP effects estimated from GWAS do not account for LD between SNPs.
Below is a simple R code to remove one of the SNPs in pairwise LD \(r^2\) greater than a threshold (e.g., 0.5). This procedure is known as LD pruning.
rsqThreshold = 0.5
nsnp = ncol(X)
removed = c()
for (i in 1:(nsnp-1)) {
  if (! i %in% removed) { 
    for (j in (i+1):nsnp) {
      if(R[i,j]^2 > rsqThreshold) removed = c(removed, j)
    }
  }
}
removed = unique(removed)
all = 1:nsnp
keep = all[!all %in% removed] # remaining SNPs after LD pruning
print(R[keep, keep])  # LD correlations among LD pruned SNPs##           V1        V2         V5         V7         V9
## V1 1.0000000 0.1338359 0.57898634 0.10230128 0.30626887
## V2 0.1338359 1.0000000 0.25731616 0.17265922 0.44910112
## V5 0.5789863 0.2573162 1.00000000 0.10824685 0.07706692
## V7 0.1023013 0.1726592 0.10824685 1.00000000 0.07758962
## V9 0.3062689 0.4491011 0.07706692 0.07758962 1.00000000Now, let’s use the LD-pruned SNPs only to calculate PRS and check the prediction accuracy and bias.
PGS2 = Xval[,keep] %*% bhat[keep]
print(cor(yval, PGS2)^2)##         [,1]
## V1 0.4361839PGS2 = PGS2 - mean(PGS2)
plot(PGS2, yval, xlab="PGS using LD-pruned SNPs", ylab="Phenotype")
abline(a=0, b=1)
abline(lm(yval~PGS2), col="red")print(lm(yval ~ PGS2))## 
## Call:
## lm(formula = yval ~ PGS2)
## 
## Coefficients:
## (Intercept)         PGS2  
##   2.390e-16    5.428e-01
Question: Are the prediction accuracy and bias improved? If so, why?
Take a moment to think about it before continuing.
Answer: Yes, both prediction accuracy and bias are improved. This improvement occurs because LD pruning helps to avoid double counting of SNP effects.
Checking the P-values of the LD-pruned SNPs finds that some SNP effects are statistically insignificant different from zero.
pval = fit[4,]
print(pval[keep])##           V1           V2           V5           V7           V9 
## 4.491829e-49 6.581417e-02 8.597549e-23 5.433139e-02 5.818065e-04Including SNPs that do not affect the phenotype in the PGS may add noise to the prediction. But where is the cutoff line? Let’s try to keep genome-wide significant (P-value < 5e-8) SNPs only in the equation. This step of filtering is called P-value thresholding.
pvalThreshold = 5e-8
keepFinal = keep[keep %in% which(pval < pvalThreshold)]
PGS3 = Xval[,keepFinal] %*% bhat[keepFinal]
print(cor(yval, PGS3)^2)##         [,1]
## V1 0.4794083PGS3 = PGS3 - mean(PGS3)
plot(PGS3, yval, xlab="PGS using LD-pruned and P-value thresholded SNPs", ylab="Phenotype")
abline(a=0, b=1)
abline(lm(yval~PGS3), col="red")print(lm(yval ~ PGS3))## 
## Call:
## lm(formula = yval ~ PGS3)
## 
## Coefficients:
## (Intercept)         PGS3  
##   3.378e-16    6.651e-01
Question: Are the prediction accuracy and bias improved? If so, why?
Take a moment to think about it before continuing.
You may try other significance thresholds and see how the result changes.
In conclusion, LD pruning and P-value thresholding improves prediction accuracy and unbiasness by avoiding double counting of SNP effects and excluding possible false positives. However, in practice, it is often not clear where to draw the line between the false and true positives. In addition, LD clumping is usually better than LD pruning, because the clumping algorithm takes P-values into account when filtering SNPs in LD. In the next session, we will use PLINK to carry out the clumping + P-value thresholding (C+PT) method for the analysis of the simulated UKB data.
In the subsequent analysis, we will use the simulated data set based
on the UKB data (/data/module5/ukb/) to demonstrate the
C+PT method using software PLINK (https://www.cog-genomics.org/plink/1.9/). Code is
adapted from https://choishingwan.github.io/PRS-Tutorial/plink/.
Select independent SNPs. This step requires individual genotype data for LD computation. Here we use the GWAS sample as LD reference. In practice, the genotype data in GWAS may not be available and you will need to use a reference sample that matches the GWAS sample in ancestry Note that the following code is bash script that needs to be run in Terminal.
LDref="/data/module5/ukb/gwas"
sumstat="/data/module5/ukb/gwas.ma"
plink \
    --bfile $LDref \
    --clump-p1 1 \
    --clump-r2 0.1 \
    --clump-kb 250 \
    --clump $sumstat \
    --clump-snp-field SNP \
    --clump-field p \
    --out gwasclump-p1 1 P-value threshold for a SNP to be included
as an index SNP. 1 is selected such that all SNPs are include for
clumping.clump-r2 0.1 SNPs having \(r^2\) higher than 0.1 with the index SNPs
will be removed.clump-kb 250 SNPs within 250k of the index SNP are
considered for clumping.clump    $sumstat Base data (summary statistic) file
containing the P-value information.clump-snp-field  SNP Specifies that the column SNP
contains the SNP IDs.clump-field  p Specifies that the column P contains the
P-value information.This will generate gwas.clumped, containing the index SNPs after
clumping is performed. We can extract the index SNP ID by performing the
following command ($3 because the third column contains the
SNP ID):
awk 'NR!=1{print $3}' gwas.clumped >  gwas.clumped.snpIt’s always good to check the result file and see if everything works fine. You can look at the clumping result file by
less -S gwas.clumpedplink provides a convenient function
--score and --q-score-range for calculating
polygenic scores given a range of P-value thresholds.
We will need three files:
$1 because SNP ID is located in the first column;
$7 because the P-value is located in the seventh
column).sumstat="/data/module5/ukb/gwas.ma"
awk '{print $1,$7}' $sumstat > SNP.pvalueecho "1e-8 0 1e-8" > range_list 
echo "1e-6 0 1e-6" >> range_list
echo "1e-4 0 1e-4" >> range_list
echo "0.01 0 0.01" >> range_list
echo "0.05 0 0.05" >> range_list
echo "0.1 0 0.1" >> range_list
echo "0.5 0 0.5" >> range_list
echo "1 0 1" >> range_listWe can then calculate the PRS with the following PLINK command in the testing population
test="/data/module5/ukb/test"
sumstat="/data/module5/ukb/gwas.ma"
plink \
    --bfile $test \
    --score $sumstat 1 2 5 sum header \
    --q-score-range range_list SNP.pvalue \
    --extract gwas.clumped.snp \
    --out testscore $sumstat 1 2 5 sum header Read from the file,
assuming that the 1st column is the SNP ID;
2nd column is the effective allele information; the
5th column is the effect size estimate; the file contains a
header and we want to calculate the score as the weighted
sum without dividing by the total number of SNPs.q-score-range range_list SNP.pvalue We want to
calculate PRS based on the thresholds defined in
range_list, where the threshold values (P-values) were
stored in SNP.pvalue.The P-value threshold that provides the “best-fit” PRS under the C+PT method is usually unknown. To approximate the “best-fit” PRS, we can perform a regression between PRS calculated at a range of P-value thresholds and then select the PRS that explains the highest phenotypic variance. This can be achieved using R as follows:
dataDir = "/data/module5/ukb/"
phenFile = paste0(dataDir, "test.phen")
covFile = paste0(dataDir, "test.cov")
p.threshold = c("1e-8","1e-6","1e-4","0.01","0.05","0.1","0.5","1")
# Read in the phenotype file 
phenotype = read.table(phenFile, header=F)
colnames(phenotype) = c("FID", "IID", "Trait") 
# Read in the covariates
covariates = read.table(covFile, header=F)
colnames(covariates) = c("FID", "IID", "Sex", "Age", paste0("PC",seq(10)))
# Now merge the files
pheno = merge(phenotype, covariates, by=c("FID", "IID"))
# Calculate the null model (model with PRS) using a linear regression 
null.model = lm(Trait~., data=pheno)
# And the R2 of the null model is 
null.r2 = summary(null.model)$r.squared
prs.result = NULL
for(i in p.threshold){
    # Go through each p-value threshold
    prs = read.table(paste0("test.", i, ".profile"), header=T)
    # Merge the prs with the phenotype matrix
    # We only want the FID, IID and PRS from the PRS file, therefore we only select the 
    # relevant columns
    pheno.prs = merge(pheno, prs[,c("FID","IID", "SCORESUM")], by=c("FID", "IID"))
    # Now perform a linear regression on trait with PRS and the covariates
    # ignoring the FID and IID from our model
    model = lm(Trait~., data=pheno.prs)
    # model R2 is obtained as 
    model.r2 = summary(model)$r.squared
    # R2 of PRS is simply calculated as the model R2 minus the null R2
    prs.r2 = model.r2-null.r2
    # We can also obtain the coeffcient and p-value of association of PRS as follow
    prs.coef = summary(model)$coeff["SCORESUM",]
    prs.beta = as.numeric(prs.coef[1])
    prs.se = as.numeric(prs.coef[2])
    prs.p = as.numeric(prs.coef[4])
    # We can then store the results
    prs.result = rbind(prs.result, data.frame(Threshold=i, 
                                               R2=prs.r2, 
                                               P=prs.p, 
                                               BETA=prs.beta,
                                               SE=prs.se))
}
# Best result is:
prs.result[which.max(prs.result$R2),]
PT = prs.result[which.max(prs.result$R2),1]
PT = as.character(PT)
write.table(t(c(PT, 0, PT)), "selected_range", quote=F, row.names=F, col.names=F)Viz the result using R.
# ggplot2 is a handy package for plotting
library(ggplot2)
# generate a pretty format for p-value output
prs.result$print.p <- round(prs.result$P, digits = 3)
prs.result$print.p[!is.na(prs.result$print.p) &
                    prs.result$print.p == 0] <-
    format(prs.result$P[!is.na(prs.result$print.p) &
                            prs.result$print.p == 0], digits = 2)
prs.result$print.p <- sub("e", "*x*10^", prs.result$print.p)
# Initialize ggplot, requiring the threshold as the x-axis 
# (use factor so that it is uniformly distributed)
p = ggplot(data = prs.result, aes(x = factor(Threshold, levels = p.threshold), y = R2)) +
    # Specify that we want to print p-value on top of the bars
    geom_text(
        aes(label = paste(print.p)),
        vjust = -1.5,
        hjust = 0,
        angle = 45,
        cex = 4,
        parse = T
    )  +
    # Specify the range of the plot, *1.25 to provide enough space for the p-values
    scale_y_continuous(limits = c(0, max(prs.result$R2) * 1.25)) +
    # Specify the axis labels
    xlab(expression(italic(P) - value ~ threshold ~ (italic(P)[T]))) +
    ylab(expression(paste("PRS model fit:  ", R ^ 2))) +
    # Draw a bar plot
    geom_bar(aes(fill = -log10(P)), stat = "identity") +
    # Specify the colors
    scale_fill_gradient2(
        low = "dodgerblue",
        high = "firebrick",
        mid = "dodgerblue",
        midpoint = 1e-4,
        name = bquote(atop(-log[10] ~ model, italic(P) - value),)
    ) +
    # Some beautification of the plot
    theme_classic() + theme(
        axis.title = element_text(face = "bold", size = 18),
        axis.text = element_text(size = 14),
        legend.title = element_text(face = "bold", size =
                                        18),
        legend.text = element_text(size = 14),
        axis.text.x = element_text(angle = 45, hjust =
                                    1)
    )
p
# save the plot
ggsave("Pred_R2_barplot.png", p, height = 7, width = 7)Questions
Which P-value threshold generates the “best-fit” PRS?
How much phenotypic variation does the “best-fit” PRS explain?
Take a moment to think about it before continuing.
Generate PRS for the target population based on the selected selected P-value threshold.
target="/data/module5/ukb/target"
sumstat="/data/module5/ukb/gwas.ma"
plink \
    --bfile $target \
    --score $sumstat 1 2 5 sum header \
    --q-score-range selected_range SNP.pvalue \
    --extract gwas.clumped.snp \
    --out targetEvaluate the prediction accuracy in the target individuals using provided R script.
phenFile="/data/module5/ukb/target.phen" 
covFile="/data/module5/ukb/target.cov"
prsFile="target.1e-6.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFileQuestion: What’s the prediction accuracy? Is this value expected?