The gsmr R-package implements the GSMR (Generalised Summary-data-based Mendelian Randomisation) method to test for causal association between a risk factor and disease1. The R package is developed by Zhihong Zhu, Zhili Zheng, Futao Zhang and Jian Yang at Institute for Molecular Bioscience, the University of Queensland. Bug reports or questions: z.zhu1@uq.edu.au or jian.yang@uq.edu.au.
Zhu, Z. et al. Causal associations between risk factors and common diseases inferred from GWAS summary data. BioRxiv, 168674.
The gsmr requires R >= 2.15, you can install it in R by:
# gsmr requires the R-package survey
install.packages("survey")
# install gsmr
install.packages("http://cnsgenomics.com/software/gsmr/static/gsmr_1.0.3.tar.gz",repos=NULL,type="source")
The gsmr source codes are available in gsmr_1.0.3.tar.gz.
This online document has been integrated in the gsmr R-package, we can check that by the standard “?function_name” command in R.
The GSMR analysis only requires summary-level data from genome-wide association studies (GWAS). Here is an example, where the risk factor (x) is LDL cholesterol (LDL-c) and the disease (y) is coronary artery disease (CAD). GWAS summary data for both LDL-c and CAD are available in the public domain (Global Lipids Genetics Consortium et al. 2013, Nature Genetics; Nikpay, M. et al. 2015, Nature Genetics).
library("gsmr")
data("gsmr")
head(gsmr_data)
## SNP a1 a2 freq bzx bzx_se bzx_pval bzx_n bzy
## 1 rs2419604 A G 0.2830715 0.0302 0.0040 7.490e-14 172807.0 0.010183
## 2 rs676385 A G 0.3116318 -0.0354 0.0043 1.169e-15 171609.0 -0.022094
## 3 rs648673 C G 0.1315721 -0.0503 0.0057 1.155e-18 163522.0 -0.026150
## 4 rs17035630 A G 0.1352071 0.0505 0.0061 1.438e-16 167679.5 0.031693
## 5 rs646776 C T 0.2241335 -0.1602 0.0044 1.630e-272 173021.0 -0.101049
## 6 rs10410 A G 0.1075555 0.0410 0.0061 6.197e-11 168300.0 0.037495
## bzy_se bzy_pval bzy_n
## 1 0.0103044 3.230472e-01 184305
## 2 0.0101795 2.997370e-02 184305
## 3 0.0141759 6.508460e-02 184305
## 4 0.0131027 1.557110e-02 184305
## 5 0.0114222 9.010000e-19 184305
## 6 0.0173902 3.107610e-02 184305
dim(gsmr_data)
## [1] 151 12
The summary data contain 151 genetic instruments (i.e. SNPs).
# Save the genetic variants and coded alleles in R
write.table(gsmr_data[,c(1,2)], "gsmr_example_snps.allele", col.names=F, row.names=F, quote=F)
# Extract the genotype data from a PLINK file using GCTA (command line)
gcta64 --bfile gsmr_example --extract gsmr_example_snps.allele --update-ref-allele gsmr_example_snps.allele --out gsmr_example
Note: the two steps above guarantee that the LD correlations are calculated based on the coded alleles (sometimes called effect alleles) for the SNP effects.
# Estimate LD correlation matrix in R
snp_coeff_id = scan("gsmr_example.xmat.gz", what="", nlines=1)
snp_coeff = read.table("gsmr_example.xmat.gz", header=F, skip=2)
snp_order = match(gsmr_data[,1], snp_coeff_id)
snp_coeff_id = snp_coeff_id[snp_order]
snp_coeff = snp_coeff[, snp_order]
ldrho = cor(snp_coeff)
colnames(ldrho) = rownames(ldrho) = snp_coeff_id
# Check the size of the correlation matrix and double-check if the order of the SNPs in the LD correlation matrix is consistent with that in the GWAS summary data.
dim(ldrho)
## [1] 151 151
# show the first 5 rows and columns of the matrix
ldrho[1:5,1:5]
## rs2419604 rs676385 rs648673 rs17035630 rs646776
## rs2419604 1.000000000 0.01225467 0.003622746 -0.003508759 0.008039383
## rs676385 0.012254667 1.00000000 -0.086363592 0.032564923 0.167010220
## rs648673 0.003622746 -0.08636359 1.000000000 -0.033264311 0.204437659
## rs17035630 -0.003508759 0.03256492 -0.033264311 1.000000000 -0.195795791
## rs646776 0.008039383 0.16701022 0.204437659 -0.195795791 1.000000000
Note: all the analyses implemented in this R-package only require the summary data (e.g. “gsmr_data”) and the LD correlation matrix (e.g. “ldrho”) listed above.
If the risk factor was not standardised in GWAS, we need to re-scale the effect sizes using the method below. This process requires allele frequencies, z-statistics and sample size.
snpfreq = gsmr_data$freq # minor allele frequency of SNPs
bzx = gsmr_data$bzx # effects of instruments on risk factor
bzx_se = gsmr_data$bzx_se # standard errors of bzx
bzx_n = gsmr_data$bzx_n # sample size for GWAS of the risk factor
std_zx = std_effect(snpfreq, bzx, bzx_se, bzx_n) # perform standardize
gsmr_data$std_bzx = std_zx$b # standardized bzx
gsmr_data$std_bzx_se = std_zx$se # standardized bzx_se
head(gsmr_data)
## SNP a1 a2 freq bzx bzx_se bzx_pval bzx_n bzy
## 1 rs2419604 A G 0.2830715 0.0302 0.0040 7.490e-14 172807.0 0.010183
## 2 rs676385 A G 0.3116318 -0.0354 0.0043 1.169e-15 171609.0 -0.022094
## 3 rs648673 C G 0.1315721 -0.0503 0.0057 1.155e-18 163522.0 -0.026150
## 4 rs17035630 A G 0.1352071 0.0505 0.0061 1.438e-16 167679.5 0.031693
## 5 rs646776 C T 0.2241335 -0.1602 0.0044 1.630e-272 173021.0 -0.101049
## 6 rs10410 A G 0.1075555 0.0410 0.0061 6.197e-11 168300.0 0.037495
## bzy_se bzy_pval bzy_n std_bzx std_bzx_se
## 1 0.0103044 3.230472e-01 184305 0.02850320 0.003775258
## 2 0.0101795 2.997370e-02 184305 -0.03033421 0.003684664
## 3 0.0141759 6.508460e-02 184305 -0.04563918 0.005171835
## 4 0.0131027 1.557110e-02 184305 0.04179863 0.005048943
## 5 0.0114222 9.010000e-19 184305 -0.14785679 0.004060986
## 6 0.0173902 3.107610e-02 184305 0.03738796 0.005562599
The estimate of causal effect of risk factor on disease can be biased by pleiotropy (see Ref 1 for details). This is an analysis to detect and eliminate from the analysis instruments that show significant pleiotropic effects on both risk factor and disease. The HEIDI-outlier analysis requires bzx (effect of genetic instrument on risk factor), bzx_se (standard error of bzx), bzy (effect of genetic instrument on disease), bzy_se (standard error of bzy) and ldrho (LD matrix of instruments). Note that LD matrix can be estimated from a reference sample with individual-level genotype data.
Here is an example to perform a HEIDI-outlier analysis.
bzx = gsmr_data$std_bzx # SNP effects on risk factor
bzx_se = gsmr_data$std_bzx_se # standard errors of bzx
bzx_pval = gsmr_data$bzx_pval # p-values for bzx
bzy = gsmr_data$bzy # SNP effects on disease
bzy_se = gsmr_data$bzy_se # standard errors of bzy
gwas_thresh = 5e-8 # GWAS threshold to select SNPs as the instruments for the GSMR analysis
heidi_thresh = 0.01 # HEIDI-outlier threshold
filtered_index = heidi_outlier(bzx, bzx_se, bzx_pval, bzy, bzy_se, ldrho, snp_coeff_id, gwas_thresh, heidi_thresh) # perform HEIDI-outlier analysis
filtered_gsmr_data = gsmr_data[filtered_index,] # select data passed HEIDI-outlier filtering
filtered_snp_id = snp_coeff_id[filtered_index] # select SNPs that passed HEIDI-outlier filtering
dim(gsmr_data)
## [1] 151 14
dim(filtered_gsmr_data)
## [1] 138 14
There are 13 instruments filtered out by HEIDI-outlier and 138 instruments are retained for further analysis.
This is the main analysis of this R-package which utilises multiple genetic instruments to test for causal effect of risk factor on disease.
bzx = filtered_gsmr_data$std_bzx # SNP effects on risk factor
bzx_se = filtered_gsmr_data$std_bzx_se # standard errors of bzx
bzx_pval = filtered_gsmr_data$bzx_pval # p-values for bzx
bzy = filtered_gsmr_data$bzy # SNP effects on disease
bzy_se = filtered_gsmr_data$bzy_se # standard errors of bzy
filtered_ldrho = ldrho[filtered_gsmr_data$SNP,filtered_gsmr_data$SNP] # LD correlation matrix of SNPs
gsmr_results = gsmr(bzx, bzx_se, bzx_pval, bzy, bzy_se, filtered_ldrho, filtered_snp_id) # GSMR analysis
cat("Effect of exposure on outcome: ",gsmr_results$bxy)
## Effect of exposure on outcome: 0.4080517
cat("Standard error of bxy: ",gsmr_results$bxy_se)
## Standard error of bxy: 0.02249235
cat("P-value of bxy: ", gsmr_results$bxy_pval)
## P-value of bxy: 1.490807e-73
cat("Used index to GSMR analysis: ", gsmr_results$used_index[1:5], "...")
## Used index to GSMR analysis: 1 2 3 4 5 ...
effect_col = colors()[75]
vals = c(bzx-bzx_se, bzx+bzx_se)
xmin = min(vals); xmax = max(vals)
vals = c(bzy-bzy_se, bzy+bzy_se)
ymin = min(vals); ymax = max(vals)
par(mar=c(5,5,4,2))
plot(bzx, bzy, pch=20, cex=0.8, bty="n", cex.axis=1.1, cex.lab=1.2,
col=effect_col, xlim=c(xmin, xmax), ylim=c(ymin, ymax),
xlab=expression(LDL~cholesterol~(italic(b[zx]))),
ylab=expression(Coronary~artery~disease~(italic(b[zy]))))
abline(0, gsmr_results$bxy, lwd=1.5, lty=2, col="dim grey")
nsnps = length(bzx)
for( i in 1:nsnps ) {
# x axis
xstart = bzx[i] - bzx_se[i]; xend = bzx[i] + bzx_se[i]
ystart = bzy[i]; yend = bzy[i]
segments(xstart, ystart, xend, yend, lwd=1.5, col=effect_col)
# y axis
xstart = bzx[i]; xend = bzx[i]
ystart = bzy[i] - bzy_se[i]; yend = bzy[i] + bzy_se[i]
segments(xstart, ystart, xend, yend, lwd=1.5, col=effect_col)
}
Note: The dashed line is not a fitted regression line but a line with slope of bxy and intercept of 0.