The gsmr R-package implements the GSMR (Generalised Summary-data-based Mendelian Randomisation) method to test for causal association between a risk factor and disease1. The R package is developed by Zhihong Zhu, Zhili Zheng, Futao Zhang and Jian Yang at Institute for Molecular Bioscience, the University of Queensland. Bug reports or questions: z.zhu1@uq.edu.au or jian.yang@uq.edu.au.
Zhu, Z. et al. Causal associations between risk factors and common diseases inferred from GWAS summary data. BioRxiv, 168674.
The gsmr requires R >= 2.15, you can install it in R by:
# gsmr requires the R-package survey
install.packages("survey")
# install gsmr
install.packages("http://cnsgenomics.com/software/gsmr/static/gsmr_1.0.3.tar.gz",repos=NULL,type="source")
The gsmr source codes are available in gsmr_1.0.3.tar.gz.
This online document has been integrated in the gsmr R-package, we can check that by the standard “?function_name” command in R.
The GSMR analysis only requires summary-level data from genome-wide association studies (GWAS). Here is an example, where the risk factor (x) is LDL cholesterol (LDL-c) and the disease (y) is coronary artery disease (CAD). GWAS summary data for both LDL-c and CAD are available in the public domain (Global Lipids Genetics Consortium et al. 2013, Nature Genetics; Nikpay, M. et al. 2015, Nature Genetics).
library("gsmr")
data("gsmr")
head(gsmr_data)
## SNP a1 a2 freq bzx bzx_se bzx_pval bzx_n bzy
## 1 rs2419604 A G 0.2830715 0.0302 0.0040 7.490e-14 172807.0 0.010183
## 2 rs676385 A G 0.3116318 -0.0354 0.0043 1.169e-15 171609.0 -0.022094
## 3 rs648673 C G 0.1315721 -0.0503 0.0057 1.155e-18 163522.0 -0.026150
## 4 rs17035630 A G 0.1352071 0.0505 0.0061 1.438e-16 167679.5 0.031693
## 5 rs646776 C T 0.2241335 -0.1602 0.0044 1.630e-272 173021.0 -0.101049
## 6 rs10410 A G 0.1075555 0.0410 0.0061 6.197e-11 168300.0 0.037495
## bzy_se bzy_pval bzy_n
## 1 0.0103044 3.230472e-01 184305
## 2 0.0101795 2.997370e-02 184305
## 3 0.0141759 6.508460e-02 184305
## 4 0.0131027 1.557110e-02 184305
## 5 0.0114222 9.010000e-19 184305
## 6 0.0173902 3.107610e-02 184305
dim(gsmr_data)
## [1] 151 12
The summary data contain 151 genetic instruments (i.e. SNPs).
# Save the genetic variants and coded alleles in R
write.table(gsmr_data[,c(1,2)], "gsmr_example_snps.allele", col.names=F, row.names=F, quote=F)
# Extract the genotype data from a PLINK file using GCTA (command line)
gcta64 --bfile gsmr_example --extract gsmr_example_snps.allele --update-ref-allele gsmr_example_snps.allele --out gsmr_example
Note: the two steps above guarantee that the LD correlations are calculated based on the coded alleles (sometimes called effect alleles) for the SNP effects.
# Estimate LD correlation matrix in R
snp_coeff_id = scan("gsmr_example.xmat.gz", what="", nlines=1)
snp_coeff = read.table("gsmr_example.xmat.gz", header=F, skip=2)
snp_order = match(gsmr_data[,1], snp_coeff_id)
snp_coeff_id = snp_coeff_id[snp_order]
snp_coeff = snp_coeff[, snp_order]
ldrho = cor(snp_coeff)
colnames(ldrho) = rownames(ldrho) = snp_coeff_id
# Check the size of the correlation matrix and double-check if the order of the SNPs in the LD correlation matrix is consistent with that in the GWAS summary data.
dim(ldrho)
## [1] 151 151
# show the first 5 rows and columns of the matrix
ldrho[1:5,1:5]
## rs2419604 rs676385 rs648673 rs17035630 rs646776
## rs2419604 1.000000000 0.01225467 0.003622746 -0.003508759 0.008039383
## rs676385 0.012254667 1.00000000 -0.086363592 0.032564923 0.167010220
## rs648673 0.003622746 -0.08636359 1.000000000 -0.033264311 0.204437659
## rs17035630 -0.003508759 0.03256492 -0.033264311 1.000000000 -0.195795791
## rs646776 0.008039383 0.16701022 0.204437659 -0.195795791 1.000000000
Note: all the analyses implemented in this R-package only require the summary data (e.g. “gsmr_data”) and the LD correlation matrix (e.g. “ldrho”) listed above.
If the risk factor was not standardised in GWAS, we need to re-scale the effect sizes using the method below. This process requires allele frequencies, z-statistics and sample size.
snpfreq = gsmr_data$freq # minor allele frequency of SNPs
bzx = gsmr_data$bzx # effects of instruments on risk factor
bzx_se = gsmr_data$bzx_se # standard errors of bzx
bzx_n = gsmr_data$bzx_n # sample size for GWAS of the risk factor
std_zx = std_effect(snpfreq, bzx, bzx_se, bzx_n) # perform standardize
gsmr_data$std_bzx = std_zx$b # standardized bzx
gsmr_data$std_bzx_se = std_zx$se # standardized bzx_se
head(gsmr_data)
## SNP a1 a2 freq bzx bzx_se bzx_pval bzx_n bzy
## 1 rs2419604 A G 0.2830715 0.0302 0.0040 7.490e-14 172807.0 0.010183
## 2 rs676385 A G 0.3116318 -0.0354 0.0043 1.169e-15 171609.0 -0.022094
## 3 rs648673 C G 0.1315721 -0.0503 0.0057 1.155e-18 163522.0 -0.026150
## 4 rs17035630 A G 0.1352071 0.0505 0.0061 1.438e-16 167679.5 0.031693
## 5 rs646776 C T 0.2241335 -0.1602 0.0044 1.630e-272 173021.0 -0.101049
## 6 rs10410 A G 0.1075555 0.0410 0.0061 6.197e-11 168300.0 0.037495
## bzy_se bzy_pval bzy_n std_bzx std_bzx_se
## 1 0.0103044 3.230472e-01 184305 0.02850320 0.003775258
## 2 0.0101795 2.997370e-02 184305 -0.03033421 0.003684664
## 3 0.0141759 6.508460e-02 184305 -0.04563918 0.005171835
## 4 0.0131027 1.557110e-02 184305 0.04179863 0.005048943
## 5 0.0114222 9.010000e-19 184305 -0.14785679 0.004060986
## 6 0.0173902 3.107610e-02 184305 0.03738796 0.005562599
The estimate of causal effect of risk factor on disease can be biased by pleiotropy (see Ref 1 for details). This is an analysis to detect and eliminate from the analysis instruments that show significant pleiotropic effects on both risk factor and disease. The HEIDI-outlier analysis requires bzx (effect of genetic instrument on risk factor), bzx_se (standard error of bzx), bzy (effect of genetic instrument on disease), bzy_se (standard error of bzy) and ldrho (LD matrix of instruments). Note that LD matrix can be estimated from a reference sample with individual-level genotype data.
Here is an example to perform a HEIDI-outlier analysis.
bzx = gsmr_data$std_bzx # SNP effects on risk factor
bzx_se = gsmr_data$std_bzx_se # standard errors of bzx
bzx_pval = gsmr_data$bzx_pval # p-values for bzx
bzy = gsmr_data$bzy # SNP effects on disease
bzy_se = gsmr_data$bzy_se # standard errors of bzy
gwas_thresh = 5e-8 # GWAS threshold to select SNPs as the instruments for the GSMR analysis
heidi_thresh = 0.01 # HEIDI-outlier threshold
filtered_index = heidi_outlier(bzx, bzx_se, bzx_pval, bzy, bzy_se, ldrho, snp_coeff_id, gwas_thresh, heidi_thresh) # perform HEIDI-outlier analysis
filtered_gsmr_data = gsmr_data[filtered_index,] # select data passed HEIDI-outlier filtering
filtered_snp_id = snp_coeff_id[filtered_index] # select SNPs that passed HEIDI-outlier filtering
dim(gsmr_data)
## [1] 151 14
dim(filtered_gsmr_data)
## [1] 138 14
There are 13 instruments filtered out by HEIDI-outlier and 138 instruments are retained for further analysis.
This is the main analysis of this R-package which utilises multiple genetic instruments to test for causal effect of risk factor on disease.
bzx = filtered_gsmr_data$std_bzx # SNP effects on risk factor
bzx_se = filtered_gsmr_data$std_bzx_se # standard errors of bzx
bzx_pval = filtered_gsmr_data$bzx_pval # p-values for bzx
bzy = filtered_gsmr_data$bzy # SNP effects on disease
bzy_se = filtered_gsmr_data$bzy_se # standard errors of bzy
filtered_ldrho = ldrho[filtered_gsmr_data$SNP,filtered_gsmr_data$SNP] # LD correlation matrix of SNPs
gsmr_results = gsmr(bzx, bzx_se, bzx_pval, bzy, bzy_se, filtered_ldrho, filtered_snp_id) # GSMR analysis
cat("Effect of exposure on outcome: ",gsmr_results$bxy)
## Effect of exposure on outcome: 0.4080517
cat("Standard error of bxy: ",gsmr_results$bxy_se)
## Standard error of bxy: 0.02249235
cat("P-value of bxy: ", gsmr_results$bxy_pval)
## P-value of bxy: 1.490807e-73
cat("Used index to GSMR analysis: ", gsmr_results$used_index[1:5], "...")
## Used index to GSMR analysis: 1 2 3 4 5 ...
effect_col = colors()[75]
vals = c(bzx-bzx_se, bzx+bzx_se)
xmin = min(vals); xmax = max(vals)
vals = c(bzy-bzy_se, bzy+bzy_se)
ymin = min(vals); ymax = max(vals)
par(mar=c(5,5,4,2))
plot(bzx, bzy, pch=20, cex=0.8, bty="n", cex.axis=1.1, cex.lab=1.2,
col=effect_col, xlim=c(xmin, xmax), ylim=c(ymin, ymax),
xlab=expression(LDL~cholesterol~(italic(b[zx]))),
ylab=expression(Coronary~artery~disease~(italic(b[zy]))))
abline(0, gsmr_results$bxy, lwd=1.5, lty=2, col="dim grey")
nsnps = length(bzx)
for( i in 1:nsnps ) {
# x axis
xstart = bzx[i] - bzx_se[i]; xend = bzx[i] + bzx_se[i]
ystart = bzy[i]; yend = bzy[i]
segments(xstart, ystart, xend, yend, lwd=1.5, col=effect_col)
# y axis
xstart = bzx[i]; xend = bzx[i]
ystart = bzy[i] - bzy_se[i]; yend = bzy[i] + bzy_se[i]
segments(xstart, ystart, xend, yend, lwd=1.5, col=effect_col)
}
Note: The dashed line is not a fitted regression line but a line with slope of bxy and intercept of 0.
GSMR (Generalised Summary-data-based Mendelian Randomisation) is a flexible and powerful approach that utilises multiple genetic instruments to test for causal association between a risk factor and disease using summary-level data from independent genome-wide association studies.
An analysis to detect and eliminate from the analysis instruments that show significant pleiotropic effects on both risk factor and disease
Standardization of SNP effect and its standard error using z-statistic, allele frequency and sample size