We will learn how to compute polygenic scores (PGS) (or polygenic risk scores (PRS) for diseases) using a commonly used simple method, i.e., clumping + P-value thresholding (C+PT) based on GWAS results.

Code modified from https://choishingwan.github.io/PRS-Tutorial/plink/

Data

We will work on a data set based on real genotypes and simulated phenotypes. The entire data set was partitioned into three independent data sets:

The GWAS results were stored in a file named simu.ma:

sumstat="/data/module4/prac1/simu.ma"
head $sumstat
## SNP A1 A2 MAF BETA SE P NMISS
## rs1000002 1 2 0.4867 -0.4164 0.6243 0.5048 9989
## rs1000005 2 1 0.41 0.4399 0.6395 0.4916 9961
## rs10000085 1 2 0.1337 -0.5903 0.9221 0.5221 9991
## rs10000091 2 1 0.3619 -0.4232 0.6463 0.5126 9989
## rs10000150 1 2 0.2235 -0.08279 0.7524 0.9124 9959
## rs10000169 2 1 0.2532 -0.9035 0.7158 0.2069 9986
## rs1000017 1 2 0.4545 -0.5301 0.6344 0.4034 9948
## rs10000189 1 2 0.4169 -0.875 0.6356 0.1687 9893
## rs1000019 2 1 0.1914 0.5932 0.7965 0.4564 9907

Step 1: Clumping

Select independent SNPs. This step requires individual genotype data for LD computation. Here we use the GWAS sample as LD reference. In practice, the genotype data in GWAS may not be available and you will need to use a reference sample that matches the GWAS sample in genetics.

LDref="/data/module4/prac1/gwas"
sumstat="/data/module4/prac1/simu.ma"
plink \
    --bfile $LDref \
    --clump-p1 1 \
    --clump-r2 0.1 \
    --clump-kb 250 \
    --clump $sumstat \
    --clump-snp-field SNP \
    --clump-field P \
    --out simu

This will generate simu.clumped, containing the index SNPs after clumping is performed. We can extract the index SNP ID by performing the following command ($3 because the third column contains the SNP ID):

awk 'NR!=1{print $3}' simu.clumped >  simu.clumped.snp

Step 2: Generate PRS based on a range of P-value thresholds

plink provides a convenient function --score and --q-score-range for calculating polygenic scores given a range of P-value thresholds.

We will need three files:

sumstat="/data/module4/prac1/simu.ma"
awk '{print $1,$7}' $sumstat > SNP.pvalue
echo "1e-8 0 1e-8" > range_list 
echo "1e-6 0 1e-6" >> range_list
echo "1e-4 0 1e-4" >> range_list
echo "0.01 0 0.01" >> range_list
echo "0.05 0 0.05" >> range_list
echo "0.1 0 0.1" >> range_list
echo "0.5 0 0.5" >> range_list
echo "1 0 1" >> range_list

We can then calculate the PRS with the following plink command in the testing population

test="/data/module4/prac1/test"
sumstat="/data/module4/prac1/simu.ma"
plink \
    --bfile $test \
    --score $sumstat 1 2 5 sum header \
    --q-score-range range_list SNP.pvalue \
    --extract simu.clumped.snp \
    --out test

Step 3: Finding the “best-fit” PRS in the testing population

The P-value threshold that provides the “best-fit” PRS under the C+PT method is usually unknown. To approximate the “best-fit” PRS, we can perform a regression between PRS calculated at a range of P-value thresholds and then select the PRS that explains the highest phenotypic variance. This can be achieved using R as follows:

p.threshold = c("1e-8","1e-6","1e-4","0.01","0.05","0.1","0.5","1")
# Read in the phenotype file 
phenotype = read.table("/data/module4/prac1/simu.phen", header=F)
colnames(phenotype) <- c("FID", "IID", "Trait") 
# Read in the covariates
covariates = read.table("/data/module4/prac1/covariates.cov", header=T)
# Read in the individual list for testing
indlist = read.table("/data/module4/prac1/test.indlist", header=F)
colnames(indlist) <- c("FID", "IID") 
# Now merge the files
pheno <- merge(merge(phenotype, covariates, by=c("FID", "IID")), indlist, by=c("FID","IID"))
# We can then calculate the null model (model with PRS) using a linear regression 
null.model <- lm(Trait~., data=pheno[,!colnames(pheno)%in%c("FID","IID")])
# And the R2 of the null model is 
null.r2 <- summary(null.model)$r.squared
prs.result <- NULL
for(i in p.threshold){
    # Go through each p-value threshold
    prs <- read.table(paste0("test.",i,".profile"), header=T)
    # Merge the prs with the phenotype matrix
    # We only want the FID, IID and PRS from the PRS file, therefore we only select the 
    # relevant columns
    pheno.prs <- merge(pheno, prs[,c("FID","IID", "SCORESUM")], by=c("FID", "IID"))
    # Now perform a linear regression on trait with PRS and the covariates
    # ignoring the FID and IID from our model
    model <- lm(Trait~., data=pheno.prs[,!colnames(pheno.prs)%in%c("FID","IID")])
    # model R2 is obtained as 
    model.r2 <- summary(model)$r.squared
    # R2 of PRS is simply calculated as the model R2 minus the null R2
    prs.r2 <- model.r2-null.r2
    # We can also obtain the coeffcient and p-value of association of PRS as follow
    prs.coef <- summary(model)$coeff["SCORESUM",]
    prs.beta <- as.numeric(prs.coef[1])
    prs.se <- as.numeric(prs.coef[2])
    prs.p <- as.numeric(prs.coef[4])
    # We can then store the results
    prs.result <- rbind(prs.result, data.frame(Threshold=i, 
                                               R2=prs.r2, 
                                               P=prs.p, 
                                               BETA=prs.beta,
                                               SE=prs.se))
}
# Best result is:
prs.result[which.max(prs.result$R2),]
PT = prs.result[which.max(prs.result$R2),1]
write.table(t(c(PT, 0, PT)), "selected_range", quote=F, row.names=F, col.names=F)

Step 4: visualisation

# ggplot2 is a handy package for plotting
library(ggplot2)
# generate a pretty format for p-value output
prs.result$print.p <- round(prs.result$P, digits = 3)
prs.result$print.p[!is.na(prs.result$print.p) &
                    prs.result$print.p == 0] <-
    format(prs.result$P[!is.na(prs.result$print.p) &
                            prs.result$print.p == 0], digits = 2)
prs.result$print.p <- sub("e", "*x*10^", prs.result$print.p)
# Initialize ggplot, requiring the threshold as the x-axis 
# (use factor so that it is uniformly distributed)
p = ggplot(data = prs.result, aes(x = factor(Threshold, levels = p.threshold), y = R2)) +
    # Specify that we want to print p-value on top of the bars
    geom_text(
        aes(label = paste(print.p)),
        vjust = -1.5,
        hjust = 0,
        angle = 45,
        cex = 4,
        parse = T
    )  +
    # Specify the range of the plot, *1.25 to provide enough space for the p-values
    scale_y_continuous(limits = c(0, max(prs.result$R2) * 1.25)) +
    # Specify the axis labels
    xlab(expression(italic(P) - value ~ threshold ~ (italic(P)[T]))) +
    ylab(expression(paste("PRS model fit:  ", R ^ 2))) +
    # Draw a bar plot
    geom_bar(aes(fill = -log10(P)), stat = "identity") +
    # Specify the colors
    scale_fill_gradient2(
        low = "dodgerblue",
        high = "firebrick",
        mid = "dodgerblue",
        midpoint = 1e-4,
        name = bquote(atop(-log[10] ~ model, italic(P) - value),)
    ) +
    # Some beautification of the plot
    theme_classic() + theme(
        axis.title = element_text(face = "bold", size = 18),
        axis.text = element_text(size = 14),
        legend.title = element_text(face = "bold", size =
                                        18),
        legend.text = element_text(size = 14),
        axis.text.x = element_text(angle = 45, hjust =
                                    1)
    )
# save the plot
ggsave("Pred_R2_barplot.png", p, height = 7, width = 7)

Prediction R2

Questions

  1. Which P-value threshold generates the “best-fit” PRS?

  2. How much phenotypic variation does the “best-fit” PRS explain?

Step 5: apply the best model to the target population

Generate PRS for the target population based on the selected selected P-value threshold.

target="/data/module4/prac1/target"
sumstat="/data/module4/prac1/simu.ma"
plink \
    --bfile $target \
    --score $sumstat 1 2 5 sum header \
    --q-score-range selected_range SNP.pvalue \
    --extract simu.clumped.snp \
    --out target

Evaluate the prediction accuracy in the target individuals using R script.

phenFile="/data/module4/prac1/simu.phen" 
covFile="/data/module4/prac1/covariates.cov"
indlistFile="/data/module4/prac1/target.indlist"
prsFile="target.1e-4.profile"
Rscript /data/module4/prac1/get_pred_r2.R $phenFile $covFile $indlistFile $prsFile

Is the prediction accuracy expected?