We will learn how to compute polygenic scores (PGS) (or polygenic risk scores (PRS) for diseases) using a commonly used simple method, i.e., clumping + P-value thresholding (C+PT) based on GWAS results.
The scripts used for this practical include
ls /data/module5/scr/get_pred_r2.R
We will use two sets of data.
This is a toy example data set consists of a training population (aka GWAS population, discovery population or reference population) of 325 individuals each has been genotyped for 10 SNPs. The trait was simulated in a way that the first SNP has an effect size of 2 and the 5th SNP has an effect size of 1 on the phenotype. The trait heritability is 0.5. There are a set of 31 individuals who are the target of PRS prediction (aka validation population or target population).
Path to Data set 1:
ls /data/module5/toy/
This is a larger data set based on real genotypes and simulated phenotypes. We simulated a trait with 100 causal variants with heritability of 0.5. The entire data set was partitioned into three independent populations:
Path to Data set 2:
ls /data/module5/sim/
The GWAS results were stored in a file named
gwas.ma
:
sumstat="/data/module5/sim/gwas.ma"
head $sumstat
For your reference, this is the code used to generate the GWAS result. You do not need to run it in the practical.
## perform GWAS
sim="/data/module5/sim/"
plink \
--bfile ${sim}gwas \
--pheno ${sim}gwas.phen \
--covar ${sim}gwas.cov \
--linear hide-covar \
--out gwas
## generate .ma file
Rscript /data/module5/scr/generate_ma.R ${sim}gwas.bim ${sim}gwas.frq ${sim}gwas.assoc.linear gwas.ma
Let us start with playing around with the toy example data set. Your task is to predict PRS for the 31 validation individuals. First, you can use the following R code to read data into R:
<- 10 #number of markers
nmarkers
# data for training population
<- matrix(scan("/data/module5/toy/xmat_trn.inp"), ncol=nmarkers, byrow = TRUE)
X <- matrix(scan("/data/module5/toy/yvec_trn.inp"), byrow = TRUE)
y
# data for validation population
<- matrix(scan("/data/module5/toy/xmat_val.inp"), ncol=nmarkers, byrow = TRUE)
Xval <- matrix(scan("/data/module5/toy/yvec_val.inp"), byrow = TRUE) yval
Then, we can do GWAS in the training (or discovery) population.
= apply(X, 2, function(x){summary(lm(y~x))$coefficients[2,]})
fit = fit[1,] bhat
These are the SNP effect estimates (i.e., SNP weights) we will use in the calculation of PRS.
Now can you write a R code to predict the PRS of the 31 validation individuals?
Suppose one wrote the following R code using all SNPs with their marginal effect estimates from GWAS in the prediction model.
= Xval %*% bhat prs
Let’s check the prediction accuracy (measured by squared correlation of phenotypes and PRS) and the biasness of PRS (measured by the slope of regressing the phenotypes on PRS).
print(cor(yval, prs)^2)
lm(yval ~ prs)
plot(prs, yval)
abline(lm(yval~prs), col="red")
What’s the prediction accuracy? What does the slope tell us? Ideally, we want the slope to be equal to one, which would indicate that an unit change in PRS leads to an unit change in phenotype, so that our PRS is unbiased. If this is not the case, what could be the cause of that?
Let’s check the correlations of genotypes between SNPs (linkage disequilibrium or LD).
= cor(X)
R print(R)
It should be obvious that some SNPs are in high LD correlation with others. This may result in double counting of the SNP effects because the SNP effects estimated from GWAS are not conditional on other SNPs (known as marginal SNP effects).
Below is a simple R code to remove one of the SNPs in pair-wise LD greater than a threshold.
= 0.2
threshold = c()
removed for (i in 1:(nmarkers-1)) {
for (j in (i+1):nmarkers) {
if(abs(R[i,j]) > threshold) removed = c(removed, j)
}
}= unique(removed)
removed = 1:nmarkers
all = all[!all %in% removed]
keep R[keep, keep]
Now, let’s only use the approximately independent SNPs to calculate PRS and check the result.
= Xval[,keep] %*% bhat[keep]
prs2
print(cor(yval, prs2)^2)
lm(yval ~ prs2)
plot(prs2, yval)
abline(lm(yval~prs), col="red")
What can we conclude from the prediction accuracy and the slope?
In the subsequent analysis, we will use the simulated data set (/data/module5/sim/) to demonstrate the C+PT method using software PLINK (https://www.cog-genomics.org/plink/1.9/). Code modified from https://choishingwan.github.io/PRS-Tutorial/plink/
Select independent SNPs. This step requires individual genotype data for LD computation. Here we use the GWAS sample as LD reference. In practice, the genotype data in GWAS may not be available and you will need to use a reference sample that matches the GWAS sample in genetics. Note that the following code is bash script that needs to be run in Terminal.
LDref="/data/module5/sim/gwas"
sumstat="/data/module5/sim/gwas.ma"
plink \
--bfile $LDref \
--clump-p1 1 \
--clump-r2 0.1 \
--clump-kb 250 \
--clump $sumstat \
--clump-snp-field SNP \
--clump-field p \
--out gwas
This will generate gwas.clumped, containing the index SNPs after
clumping is performed. We can extract the index SNP ID by performing the
following command ($3
because the third column contains the
SNP ID):
awk 'NR!=1{print $3}' gwas.clumped > gwas.clumped.snp
plink
provides a convenient function
--score
and --q-score-range
for calculating
polygenic scores given a range of P-value thresholds.
We will need three files:
$1
because SNP ID is located in the first column;
$7
because the P-value is located in the seventh
column).sumstat="/data/module5/sim/gwas.ma"
awk '{print $1,$7}' $sumstat > SNP.pvalue
echo "1e-8 0 1e-8" > range_list
echo "1e-6 0 1e-6" >> range_list
echo "1e-4 0 1e-4" >> range_list
echo "0.01 0 0.01" >> range_list
echo "0.05 0 0.05" >> range_list
echo "0.1 0 0.1" >> range_list
echo "0.5 0 0.5" >> range_list
echo "1 0 1" >> range_list
We can then calculate the PRS with the following plink
command in the testing population
test="/data/module5/sim/test"
sumstat="/data/module5/sim/gwas.ma"
plink \
--bfile $test \
--score $sumstat 1 2 5 sum header \
--q-score-range range_list SNP.pvalue \
--extract gwas.clumped.snp \
--out test
1st
column is the SNP ID;
2nd
column is the effective allele information; the
5th
column is the effect size estimate; the file contains a
header
and we want to calculate the score as the weighted
sum without dividing by the total number of SNPs.range_list
, where the threshold values (P-values) were
stored in SNP.pvalue
.The P-value threshold that provides the “best-fit” PRS under the C+PT method is usually unknown. To approximate the “best-fit” PRS, we can perform a regression between PRS calculated at a range of P-value thresholds and then select the PRS that explains the highest phenotypic variance. This can be achieved using R as follows:
= "/data/module5/sim/"
dataDir = paste0(dataDir, "test.phen")
phenFile = paste0(dataDir, "test.cov")
covFile
= c("1e-8","1e-6","1e-4","0.01","0.05","0.1","0.5","1")
p.threshold # Read in the phenotype file
= read.table(phenFile, header=F)
phenotype colnames(phenotype) = c("FID", "IID", "Trait")
# Read in the covariates
= read.table(covFile, header=F)
covariates colnames(covariates) = c("FID", "IID", "Sex", "Age", paste0("PC",seq(10)))
# Now merge the files
= merge(phenotype, covariates, by=c("FID", "IID"))
pheno # Calculate the null model (model with PRS) using a linear regression
= lm(Trait~., data=pheno)
null.model # And the R2 of the null model is
= summary(null.model)$r.squared
null.r2 = NULL
prs.result for(i in p.threshold){
# Go through each p-value threshold
= read.table(paste0("test.", i, ".profile"), header=T)
prs # Merge the prs with the phenotype matrix
# We only want the FID, IID and PRS from the PRS file, therefore we only select the
# relevant columns
= merge(pheno, prs[,c("FID","IID", "SCORESUM")], by=c("FID", "IID"))
pheno.prs # Now perform a linear regression on trait with PRS and the covariates
# ignoring the FID and IID from our model
= lm(Trait~., data=pheno.prs)
model # model R2 is obtained as
= summary(model)$r.squared
model.r2 # R2 of PRS is simply calculated as the model R2 minus the null R2
= model.r2-null.r2
prs.r2 # We can also obtain the coeffcient and p-value of association of PRS as follow
= summary(model)$coeff["SCORESUM",]
prs.coef = as.numeric(prs.coef[1])
prs.beta = as.numeric(prs.coef[2])
prs.se = as.numeric(prs.coef[4])
prs.p # We can then store the results
= rbind(prs.result, data.frame(Threshold=i,
prs.result R2=prs.r2,
P=prs.p,
BETA=prs.beta,
SE=prs.se))
}# Best result is:
which.max(prs.result$R2),]
prs.result[= prs.result[which.max(prs.result$R2),1]
PT = as.character(PT)
PT write.table(t(c(PT, 0, PT)), "selected_range", quote=F, row.names=F, col.names=F)
This is a R script.
# ggplot2 is a handy package for plotting
library(ggplot2)
# generate a pretty format for p-value output
$print.p <- round(prs.result$P, digits = 3)
prs.result$print.p[!is.na(prs.result$print.p) &
prs.result$print.p == 0] <-
prs.resultformat(prs.result$P[!is.na(prs.result$print.p) &
$print.p == 0], digits = 2)
prs.result$print.p <- sub("e", "*x*10^", prs.result$print.p)
prs.result# Initialize ggplot, requiring the threshold as the x-axis
# (use factor so that it is uniformly distributed)
= ggplot(data = prs.result, aes(x = factor(Threshold, levels = p.threshold), y = R2)) +
p # Specify that we want to print p-value on top of the bars
geom_text(
aes(label = paste(print.p)),
vjust = -1.5,
hjust = 0,
angle = 45,
cex = 4,
parse = T
+
) # Specify the range of the plot, *1.25 to provide enough space for the p-values
scale_y_continuous(limits = c(0, max(prs.result$R2) * 1.25)) +
# Specify the axis labels
xlab(expression(italic(P) - value ~ threshold ~ (italic(P)[T]))) +
ylab(expression(paste("PRS model fit: ", R ^ 2))) +
# Draw a bar plot
geom_bar(aes(fill = -log10(P)), stat = "identity") +
# Specify the colors
scale_fill_gradient2(
low = "dodgerblue",
high = "firebrick",
mid = "dodgerblue",
midpoint = 1e-4,
name = bquote(atop(-log[10] ~ model, italic(P) - value),)
+
) # Some beautification of the plot
theme_classic() + theme(
axis.title = element_text(face = "bold", size = 18),
axis.text = element_text(size = 14),
legend.title = element_text(face = "bold", size =
18),
legend.text = element_text(size = 14),
axis.text.x = element_text(angle = 45, hjust =
1)
)
p# save the plot
ggsave("Pred_R2_barplot.png", p, height = 7, width = 7)
Questions
Which P-value threshold generates the “best-fit” PRS?
How much phenotypic variation does the “best-fit” PRS explain?
Generate PRS for the target population based on the selected selected P-value threshold.
target="/data/module5/sim/target"
sumstat="/data/module5/sim/gwas.ma"
plink \
--bfile $target \
--score $sumstat 1 2 5 sum header \
--q-score-range selected_range SNP.pvalue \
--extract gwas.clumped.snp \
--out target
Evaluate the prediction accuracy in the target individuals using provided R script.
phenFile="/data/module5/sim/target.phen"
covFile="/data/module5/sim/target.cov"
prsFile="target.1e-6.profile"
Rscript /data/module5/scr/get_pred_r2.R $phenFile $covFile $prsFile
Is the prediction accuracy expected?
A work by Jian Zeng