Here we will meta-analyse results from a genome-wide association study (GWAS) of height by Wood et al. 2014 Nat Genet and a GWAS of height by Jiang et al. 2019 Nat Genet.
Download Height GZIP
).To download sumstats directly to the cluster, copy the link to the
file you want to download and use
wget -O <file_name.txt> <URL>
.
NOTE: This step has already been run for you, so that you don’t have to wait for the files to be downloaded.
# ****************************
# Download summary statistics
# ****************************
# Go to directory where files will be saved
cd /data/module1/7_metaPrac
## Wood et al
url=http://portals.broadinstitute.org/collaboration/giant/images/0/01/GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz
wget -O Height_Wood2014.gz $url
## Jiang et al
url=https://yanglab.westlake.edu.cn/data/fastgwa_data/UKB/50.v1.1.fastGWA.gz
wget -O Height_Jiang2019.gz $url
Use zcat
to see the first lines of the file and assess
the information provided.
# Wood et al. 2014
zcat /data/module1/7_metaPrac/Height_Wood2014.gz | head
## MarkerName Allele1 Allele2 Freq.Allele1.HapMapCEU b SE p N
## rs4747841 A G 0.551 -0.0011 0.0029 0.70 253213
## rs4749917 T C 0.436 0.0011 0.0029 0.70 253213
## rs737656 A G 0.367 -0.0062 0.0030 0.042 253116
## rs737657 A G 0.358 -0.0062 0.0030 0.041 252156
## rs7086391 T C 0.12 -0.0087 0.0038 0.024 248425
## rs878177 T C 0.3 0.014 0.0031 8.2e-06 251271
## rs878178 A T 0.644 0.0067 0.0031 0.029 253086
## rs12219605 T G 0.427 0.0011 0.0029 0.70 253213
## rs3763688 C G 0.144 -0.0022 0.0045 0.62 253056
Columns are: MarkerName, SNP rsID; Allele1, effect allelle,; Allele2, other allele; Freq.Allele1.HapMapCEU, Allele 1 frequency in EUR individuals from HapMap; b, SE, and p, effect size, standard error and P-value of the effect of Allele 1; N, sample size.
# Jiang et al. 2019
zcat /data/module1/7_metaPrac/Height_Jiang2019.gz | head
## CHR SNP POS A1 A2 N AF1 BETA SE P
## 1 rs144155419 717587 G A 445877 0.989495 0.00183444 0.00722402 0.799545
## 1 rs58276399 731718 T C 435350 0.888746 -2.38349e-05 0.00237193 0.991982
## 1 rs141242758 734349 T C 436593 0.888847 -9.42285e-05 0.00236962 0.96828
## 1 rs28544273 751343 T A 447512 0.878876 0.000566575 0.00225319 0.801463
## 1 rs28527770 751756 T C 447816 0.878591 0.000583722 0.00225047 0.795343
## 1 rs3115860 753405 C A 452415 0.127989 -0.00104942 0.00218969 0.631757
## 1 rs3131970 753425 T C 452802 0.124182 -0.000304986 0.00221626 0.890546
## 1 rs2073813 753541 G A 451742 0.872927 0.000910638 0.00219765 0.678604
## 1 rs3131969 754182 A G 452129 0.128369 -0.00110491 0.00218736 0.613465
Columns are: CHR, Chromosome; SNP, SNP rsID; POS, SNP base pair position; A1, effect allelle,; A2, other allele; ; N, sample size; AF1, A1 frequency in GWAS sample (EUR individuals from UKB); BETA, SE, and P, effect size, standard error and P-value of the effect of A1.
NOTE: In some datasets the effects may have been estimated for the alternative allele. It is very important to read through the documentation associated with the summary statistics to understand what is the effect allele.
We will use PLINK to meta-analyse the Wood et al. and Jiang et al. GWAS. As outlined in PLINK’s website, the program requires the effect size estimates to be in a column named “BETA”. As we saw above, the Wood et al. GWAS has betas in a column called “b”. Here, we will change the name of that column to “BETA”.
We also saw above that the Wood et al. summary statistics do not include information about chromosome (CHR) and SNP base pair position (BP). Although PLINK does not require these fields to meta-analyse results, it expects those columns as input. Here, we will add two dummy columns to the Wood et al. sumstats, representing the CHR and BP.
We will run these steps in the command line with UNIX code. Alternatively, you could open the file in R, make these changes and save, but it would take longer.
# Go to your home directory
cd ~
# Change column name
zcat /data/module1/7_metaPrac/Height_Wood2014.gz |
sed -e '1s/b/BETA/' |
awk '{if (NR==1) {print $0,"CHR","BP"} else {print $0,1,1}}' |
gzip > Height_Wood2014_edited.gz
## The 1st line unzips the original file. The pipe at the end of the line ("|") makes turns the output of that line into the input of the next command
## The 2nd line changes the name of the column containing betas from "b" to "BETA"
## The 3rd line adds two extra dummy columns where the first line is "CHR" and "BP"
## The 4th line zips the output and saves it into a file named "Height_Wood2014_edited.gz"
# View edited file
zcat Height_Wood2014_edited.gz | head
# Note, we already have a copy of this in /data/module1/7_metaPrac/
## MarkerName Allele1 Allele2 Freq.Allele1.HapMapCEU BETA SE p N CHR BP
## rs4747841 A G 0.551 -0.0011 0.0029 0.70 253213 1 1
## rs4749917 T C 0.436 0.0011 0.0029 0.70 253213 1 1
## rs737656 A G 0.367 -0.0062 0.0030 0.042 253116 1 1
## rs737657 A G 0.358 -0.0062 0.0030 0.041 252156 1 1
## rs7086391 T C 0.12 -0.0087 0.0038 0.024 248425 1 1
## rs878177 T C 0.3 0.014 0.0031 8.2e-06 251271 1 1
## rs878178 A T 0.644 0.0067 0.0031 0.029 253086 1 1
## rs12219605 T G 0.427 0.0011 0.0029 0.70 253213 1 1
## rs3763688 C G 0.144 -0.0022 0.0045 0.62 253056 1 1
Now we will use PLINK to run an inverse variance-based meta-analysis.
The main flag to do this is
--meta-analysis <file1> <file2>
. PLINK’s
website includes a section with a comprehensive description
of PLINK’s meta-analysis options.
We will use the following options in our analysis:
+ qt
: to specify that the phenotype is
quantitative--meta-analysis-snp-field <field name(s)...>
: to
specify columns with variant IDs--meta-analysis-bp-field <field name(s)...>
: to
specify columns with variant base pair position--meta-analysis-a1-field <field name(s)...>
: to
specify columns with variant A1 allele--meta-analysis-a2-field <field name(s)...>
: to
specify columns with variant A1 allele--out
: to specify the output file namePLINK’s output file (with results from the meta-analysis) will report
the CHR and BP from the first input file. Since we do not have accurate
CHR and BP in the Wood et al. summary statistics, we will provide the
summary statistics from Jiang et al. as the first input file to
--meta-analysis
.
# Change to home directory to run analysis
cd ~
# Define directory where data are
data=/data/module1/7_metaPrac
# Files to meta-analyse
sumstats1=$data/Height_Jiang2019.gz
sumstats2=$data/Height_Wood2014_edited.gz
# Meta-analysis
plink --meta-analysis $sumstats1 $sumstats2 + qt \
--meta-analysis-snp-field MarkerName SNP \
--meta-analysis-bp-field POS BP \
--meta-analysis-a1-field Allele1 A1 \
--meta-analysis-a2-field Allele2 A2 \
--out Height
There are 3 new files in the working directory:
Here we look at the first lines of the Height.meta
file:
head Height.meta
## CHR BP SNP A1 A2 N P P(R) BETA BETA(R) Q I
## 1 753541 rs2073813 G A 2 0.6133 0.6133 0.0011 0.0011 0.7719 0.00
## 1 754192 rs3131968 A G 2 0.4227 0.4227 -0.0017 -0.0017 0.3584 0.00
## 1 768448 rs12562034 G A 2 0.2176 0.2176 -0.0028 -0.0028 0.3395 0.00
## 1 775659 rs2905035 A G 2 0.4751 0.4751 -0.0015 -0.0015 0.8957 0.00
## 1 777122 rs2980319 A T 2 0.3915 0.3915 -0.0018 -0.0018 0.8668 0.00
## 1 779322 rs4040617 A G 2 0.6368 0.6368 0.0010 0.0010 0.9100 0.00
## 1 780785 rs2977612 T A 2 0.4819 0.4819 -0.0014 -0.0014 0.8212 0.00
## 1 785050 rs2905062 G A 2 0.5731 0.5731 -0.0012 -0.0012 0.9654 0.00
## 1 785989 rs2980300 T C 2 0.5467 0.5467 -0.0012 -0.0012 0.9677 0.00
Columns are: CHR, chromosome; BP, base pair position; SNP, SNP rsID; A1, effect allele; A2, alternative allele; N, number of valid studies for variant; P, Fixed-effects meta-analysis p-value; P(R), Random-effects meta-analysis p-value; BETA, Fixed-effects BETA estimate; BETA(R), Random-effects BETA estimate; Q, p-value for Cochran’s Q statistic; I, I2 heterogeneity index (0-100 scale).
Now we can compare the results from the individual studies with the meta-analysis. What is the number of genome-wide significant (GWS) associations (P<5x10-8) in each individual study? And in the meta-analysis? We will focus on the results from the fixed-effect meta-analysis. Remember to restrict this analysis to the SNPs in common between Wood et al. and Jiang et al. (SNPs for which we have meta-analysis results for).
# *********************
# Load data in R
# *********************
# load tidyverse library, which is much faster that native R for large files
library(tidyverse)
# Load results from Wood et al.
wood2014 <- read_table("/data/module1/7_metaPrac/Height_Wood2014.gz")
# Load results from Jiang et al.
jiang2019 <- read_table("/data/module1/7_metaPrac/Height_Jiang2019.gz")
# Load mta-analysis results
meta <- read_table("Height.meta")
# ***********************************************
# Determine number of GWS (P<5e-8) associations
# ***********************************************
# Define SNPs in common between the two GWAS (those meta-analysed)
meta_snps <- select(meta, SNP)
# Restrict datasets to common set of SNPs
wood2014 <- wood2014 %>% inner_join(meta_snps, by = c("MarkerName" = "SNP"))
jiang2019 <- jiang2019 %>% inner_join(meta_snps, by = c("SNP" = "SNP"))
# Determine number of GWS (P<5-8) associations
## In Wood et al.
nGWS_wood <- wood2014 %>% filter(p < 5e-8) %>% count
## In Jiang et al.
nGWS_jiang <- jiang2019 %>% filter(P < 5e-8) %>% count
## In meta-analysis
nGWS_meta <- meta %>% filter(P < 5e-8) %>% count
paste0("Number of GWS (P<5-8) associations in Wood et al.: ", nGWS_wood)
paste0("Number of GWS (P<5-8) associations in Jiang et al.: ", nGWS_jiang)
paste0("Number of GWS (P<5-8) associations in meta-analysis: ", nGWS_meta)
## [1] "Number of GWS (P<5-8) associations in Wood et al.: 25884"
## [1] "Number of GWS (P<5-8) associations in Jiang et al.: 77692"
## [1] "Number of GWS (P<5-8) associations in meta-analysis: 108203"
We see that the meta-analysis increased the number of GWS associations detected due to increase power of the larger sample size relative to the individual studies.
Now, as a last exercise, we will meta-analyse one SNP (rs10000940) in R to understand how PLINK works. We will use the formulas presented in Table 1 of Willer et al. 2010 for an inverse variance-based meta-analysis:
snp <- "rs10000940"
# Get rs10000940 summary statistics from Wood et al.
rs10000940_wood <- wood2014 %>% filter(MarkerName == snp)
b_wood <- rs10000940_wood$b
se_wood <- rs10000940_wood$SE
# Get rs10000940 summary statistics from Jiang et al.
rs10000940_jiang <- jiang2019 %>% filter(SNP == snp)
b_jiang <- rs10000940_jiang$BETA
se_jiang <- rs10000940_jiang$SE
# Compute weights for each study: the inverse of the variance
w_wood <- 1 / se_wood^2
w_jiang <- 1 / se_jiang^2
# Calculate the SE
se_meta <- sqrt(1 / (w_wood + w_jiang))
# Calculate the effect size BETA
b_meta <- (b_wood * w_wood + b_jiang * w_jiang) / (w_wood + w_jiang)
# Calculate the Z-score
z <- b_meta / se_meta
# Calculate the p-value
p_meta <- 2 * pnorm(z)
# Compare with plink results
filter(meta, SNP == snp) %>% select(BETA, P)
## # A tibble: 1 × 2
## BETA P
## <dbl> <dbl>
## 1 -0.0074 0.0000000166
b_meta
## [1] -0.007358003
p_meta
## [1] 1.654751e-08