In the previous practical we meta-analysed two GWAS of height (by Jiang et al. and by Wood et al.) and saw an increase in the number of genome-wide significant (P < 5 x 10-8) associations compared to the individual studies. Here we will explore two approaches to identify which of those associations are independent. We will also use an R package (LocusZoom) to see associations at a particular locus in greater detail.
We saw before that there were 108,203 genome-wide significant (GWS) associations in the height GWAS meta-analysis. Here, we will use two different methods to assess how many of those are independent:
NOTE: In the interest of time, we will only run these analyses on chromosome 22.
LD clumping can be performed with PLINK using --clump
.
PLINK includes a number of flags that allow the user to provide further
details about the input file. It also includes several flags for the
user to define the parameters used in the analysis (see PLINK clump
options).
Two files are required as input for this analysis:
/data/module1/8_independentLociPrac/g1000_eur_chr22.*
). In real data
analysis, you could use the same sample that was used for GWAS as your
LD reference. If that is not avaiable, you need to use a reference
sample that is genetically similar to the GWAS sample to reflect the
same LD patterns.We will use the following settings in our analysis:
--clump-p1 5e-8
: Clumps are formed around central “lead
variants”, starting with those with the lowest P-value. This
flag specifies the maximum P-value for those SNPs, i.e. no
clumps will be formed around variants that are not GWS.--clump-r2 0.5
(default): Variants with LD
r2>0.5 with the lead variant will be included in the clump
if they are at within a distance specified by
--clump-kb
--clump-kb 500
: The maximum distance from the lead
variant for SNPs to be considered to be clumped with it. This distance
is provided in kb.# Change to home directory to run analysis
cd ~
# ***************************************
# Clump meta-analysis results
# ***************************************
# Files to use
ld_ref=/data/module1/8_independentLociPrac/g1000_eur_chr22
sumstats=Height.meta
# Run clumping
plink --bfile $ld_ref \
--clump $sumstats \
--clump-p1 5e-8 \
--clump-r2 0.5 \
--clump-kb 500 \
--clump-field "P" \
--clump-snp-field "SNP" \
--out Height_meta
There are 2 new files in the working directory:
Here is what the Height_meta.clumped
looks like (columns
are described here.):
awk '{$NF=""}1' Height_meta.clumped | head
CHR | F | SNP | BP | P | TOTAL | NSIG | S05 | S01 | S001 | S0001 |
---|---|---|---|---|---|---|---|---|---|---|
22 | 1 | rs4821083 | 33056341 | 0 | 16 | 0 | 0 | 0 | 0 | 16 |
22 | 1 | rs6007043 | 45838646 | 0 | 14 | 0 | 0 | 0 | 0 | 14 |
22 | 1 | rs7291090 | 33108150 | 0 | 9 | 0 | 0 | 0 | 0 | 9 |
22 | 1 | rs7286215 | 28397709 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
22 | 1 | rs226493 | 45715887 | 0 | 65 | 0 | 0 | 0 | 0 | 65 |
22 | 1 | rs2899193 | 33109382 | 0 | 9 | 0 | 0 | 0 | 0 | 9 |
COJO is a summary-statistics-based method implemented in GCTA. As outlined in the GCTA website, the GWAS summary statistics need to be in the format shown below (.ma format). The columns can have other names (since GCTA does not use that information) but it is crucial to have columns in the correct order.
SNP | A1 | A2 | freq | b | se | p | N |
---|---|---|---|---|---|---|---|
rs1001 | A | G | 0.8493 | 0.0024 | 0.0055 | 0.6653 | 129850 |
rs1002 | C | G | 0.0306 | 0.0034 | 0.0115 | 0.7659 | 129799 |
rs1003 | A | C | 0.5128 | 0.0045 | 0.0038 | 0.2319 | 129830 |
… |
Columns are: SNP, variant ID; A1, the effect allele; A2, the other allele; freq, frequency of the effect allele; b, effect size; se, standard error; p, p-value; N, sample size.
Below we provide example code to format the summary statistics in R. Note that the meta-analysis file does not include three of the fields required to convert it to .ma format. Here is how we can obtain information for those fields:
where p is the allele frequency, b is the effect size estimate, and N is the sample size.
NOTE: For efficiency, these steps have already been run for you.
# ***************************************
# Format meta-analysis sumstats in R
# ***************************************
# Load meta-analysis results
meta=read.table("Height.meta",h=T)
# Get allele frequencies from Jiang et al.
jiang=read.table("/data/module1/7_metaPrac/Height_Jiang2019.gz",h=T)
meta=merge(meta, jiang[,c("SNP","A1","A2","AF1")], , by="SNP", suffixes=c(".meta",".jiang"))
## Make sure the alleles match
identical(tmp$A1.meta,tmp$A1.jiang)
# Calculate sample size from N reported in both Wood et al. and Jiang et al.
wood=read.table("/data/module1/7_metaPrac/Height_Wood2014.gz",h=T)
names(wood)=c("SNP","A1","A2","AF1","BETA","SE","P","N")
sample_sizes=merge(jiang[,c("SNP","N")], wood[,c("SNP","N")], by="SNP", suffixes=c(".jiang",".wood"))
sample_sizes$N=sample_sizes$N.jiang+sample_sizes$N.wood
meta$N=NULL
meta=merge(meta, sample_sizes)
# Compute standard error
p=meta$AF1
b=meta$BETA
n=meta$N
meta$SE=sqrt((1-2*p*(1-p)*b^2) / (2*p*(1-p)*n))
# Select columns in the correct order
maFile=meta[,c("SNP","A1.meta","A2.meta","AF1","BETA","SE","P","N")]
names(maFile)=c("SNP","A1","A2","AF1","BETA","SE","P","N")
# Save
write.table(maFile, "/data/module1/8_independentLociPrac/Height_meta.ma", quote=F, row.names=F)
Now we can see how the new sumstats file look like:
# Meta-analysis .ma file
head /data/module1/8_independentLociPrac/Height_meta.ma
## SNP A1 A2 AF1 BETA SE P N
## rs10 A C 0.0525684 -0.0062 0.00407606036351674 0.04768 604247
## rs1000000 G A 0.773915 -1e-04 0.00201206059303702 0.9689 705867
## rs10000010 T C 0.517192 -0.0025 0.00169574979053677 0.05427 696336
## rs10000012 C G 0.860014 0.0116 0.00242123808881992 7.137e-10 708420
## rs10000013 C A 0.214239 -8e-04 0.00205426839374803 0.6185 703828
## rs10000017 C T 0.77966 -0.0014 0.00203613604904229 0.3637 702033
## rs1000002 C T 0.514329 -0.003 0.00168091044034944 0.02269 708428
## rs10000023 G T 0.420343 -0.0017 0.00172348646770316 0.2018 690842
## rs10000027 C G 0.779381 -0.001 0.00235347133390067 0.5581 525000
We will use the formated sumstats file to identify independent associations using the settings below:
--cojo-file <file_name>
: To specify the file with
the GWAS summary statistics (.ma file created above)--bfile
: Like PLINK, COJO estimates the LD between
genetic variants using individual-level data from a reference sample. We
will use the same reference sample that we used for LD clumping
(/data/module1/8_independentLociPrac/g1000_eur_chr22.*
)--cojo-slct
: Performs a step-wise model selection of
independently associated SNPs--cojo-wind 10000
(default): Distance (in KB) from
which SNPs are considered in linkage equilibrium# Change to home directory to run analysis
cd ~
# ***************************************
# Apply COJO to meta-analysis results
# ***************************************
# Define files to use
ld_ref=/data/module1/8_independentLociPrac/g1000_eur_chr22
sumstats=/data/module1/8_independentLociPrac/Height_meta.ma
# Run COJO
gcta64 --bfile $ld_ref \
--cojo-slct \
--cojo-file $sumstats \
--out Height_meta
There are 4 new files in the working directory:
Here is what the Height_meta.jma.cojo
looks like
(columns are described here):
head Height_meta.jma.cojo
Chr | SNP | bp | refA | freq | b | se | p | n | freq_geno | bJ | bJ_se | pJ | LD_r |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22 | rs2540653 | 18973549 | G | 0.908917 | -0.0087 | 0.0029209 | 0.0028961 | 707907 | 0.912525 | -0.0397809 | 0.0029702 | 0 | 0.0034146 |
22 | rs1155419 | 20787850 | T | 0.569456 | -0.0133 | 0.0017065 | 0.0000000 | 700238 | 0.536779 | -0.0264867 | 0.0017225 | 0 | 0.0486505 |
22 | rs5754102 | 21916272 | C | 0.822584 | 0.0121 | 0.0022163 | 0.0000000 | 697485 | 0.824056 | 0.0317249 | 0.0022486 | 0 | -0.0341618 |
22 | rs552212 | 27644949 | A | 0.259075 | 0.0047 | 0.0020007 | 0.0188120 | 650766 | 0.241551 | 0.1205900 | 0.0022494 | 0 | -0.0068221 |
22 | rs10222213 | 27873353 | G | 0.596638 | -0.0067 | 0.0017248 | 0.0001025 | 698392 | 0.562624 | -0.0707486 | 0.0018392 | 0 | -0.0283411 |
22 | rs9613498 | 28044325 | G | 0.555965 | -0.0100 | 0.0017015 | 0.0000000 | 699544 | 0.553678 | -0.1070160 | 0.0019497 | 0 | -0.0087659 |
Note: Even though we used a significance threshold of
P<5x10-8, we can see variants above this
threshold in the p
column (marginal association results
from the meta-analysis). This is because this threshold is used to
select variants in the conditional and joint analysis. You can see the
P-values from that analysis in the column pJ
(all
<5x10-8, as we requested in our GCTA command).
Now we will summarise the number of independent loci identified with clumping and COJO.
# ********************************
# Restrict GWAS results to chr 22
# ********************************
## This will make reading sumstats into R faster
awk 'NR==1 || $1==22' Height.meta > Height.meta_chr22
# ********************************
# Summarise results in R
# ********************************
# Load GWAS results
meta=read.table("Height.meta_chr22",h=T)
# Load clump results
clump_meta=read.table("Height_meta.clumped", h=T)
# Load COJO results
cojo_meta=read.table("Height_meta.jma.cojo", h=T)
# Assess number of GWS associations on chr 22
nGWS_meta=nrow(meta[meta$P<5e-8,])
# Assess number of clumps formed
nClumps_meta=length(unique(clump_meta$SNP))
# Assess number of independent variants identified with COJO
nCojo_meta=length(unique(cojo_meta$SNP))
paste0("In the meta-analysis we have: ", nGWS_meta, " GWS associations (P<5e-8); ",nClumps_meta," clumps; and ", nCojo_meta, " idependent asociations identified with COJO.")
In the meta-analysis we have:
As we see, the two methods identify similar number of independent loci. However, with clumping the user can change settings such as the LD r2 between variants that are grouped, which will make the definition of independent loci more or less stringent. For example, clumping with an LD r2 threshold of 0.01, would you expect more or less independent associations? You can test this by changing the settings in PLINK.
Now we will zoom in on particular loci to see what the associations
look like at particular regions. We will do this with a LocusZoom R script,
which plots associations along with useful information about the locus,
such as the genes it includes and linkage disequilibrium coefficients.
The R script can be downloaded from functions/locus_zoom.R
in this GitHub
repository. For the purpose of this tutorial, a copy of this script
is already in
/data/module1/8_independentLociPrac/locus_zoom.R
.
As an example, we will focus on the two loci mentioned in the clump section:
We will use the locus.zoom
R function with following
settings:
data <file_name>
: The GWAS summary statistics. In
this case, the meta-analysis results for chromosome 22.snp <SNP_ID>
: A SNP of interest to be
annotatedignore.lead T
: This option is required when using the
--snp
option aboveoffset_bp 500000
: To plot all variants within 500,000
bp of the SNP defined with the --snp
option aboveld.file <file_name>
: A data.frame with LD values
relevant to the SNP specified by --snp
. For this example,
we calculate the LD between the lead variants that we will analyse and
all variants within 500 Kb, i.e., in a 1-Mb region centred on those
SNPs.genes.data <file_name>
: A data.frame with gene
locations to plot beneath the graph (requires the columns “Gene”,
“Chrom”, “Start”, “End”, and “Coding”). Here we use a file available in
the GitHub
repository for this R function
(Gencode_GRCh37_Genes_UniqueList.txt
).plot.title
: a title to go above your plotfile.name
: a filename for your plot to be saved to# *********************
# Create LD files for the locus plots
# *********************
# NOTE: There is no need to run this step. We already have a copy of these files in /data/module1/8_independentLociPrac/
for snp in rs226493 rs7286215
do
ref=/data/module1/8_independentLociPrac/g1000_eur_chr22
plink --bfile $ref \
--r2 with-freqs \
--ld-window-r2 0 \
--ld-snp $snp \
--ld-window 1000000\
--out $snp.LDr2
done
# *********************
# Generate locus plot
# for rs226493
# *********************
# Load locuszoom function
source("/data/module1/8_independentLociPrac/locus_zoom.R")
# Load data
meta=read.table("Height.meta_chr22",h=T)
ld_ref=read.table("/data/module1/8_independentLociPrac/rs226493.LDr2.ld", h=T)
UCSC_GRCh37_Genes_UniqueList.txt=read.delim("/data/module1/8_independentLociPrac/UCSC_GRCh37_Genes_UniqueList.txt", stringsAsFactors=F, h=T)
# Format GWAS sumstats
# Create LocusZoom-like plot - EUR
locus.zoom(data=meta,
snp="rs226493",
ignore.lead=T,
offset_bp = 500000,
ld.file=ld_ref,
genes.data=UCSC_GRCh37_Genes_UniqueList.txt,
plot.title = paste0("Locus plot for rs226493"),
file.name=paste0("rs226493_locusPlot.jpg"))
NOTE: The plots generated by this function will be automatically saved in your working directory, i.e., they will not be displayed like other R plots. You need to open them to see them.
We can see that the lead SNP in this clump (rs226493) is close to other SNPs with stronger association (light blue dots). However, the LD with those variants (LD r2 < 0.4) was lower than the threshold that we used to consider them the same signal. Using a more stringent threshold (e.g. LD r2 > 0.1 for clumping) would have assigned all these variants to a single clump.
# *********************
# Generate locus plot
# for rs7286215
# *********************
# Load locuszoom function
source("/data/module1/8_independentLociPrac/locus_zoom.R")
# Load data
meta=read.table("Height.meta_chr22",h=T)
ld_ref=read.table("/data/module1/8_independentLociPrac/rs7286215.LDr2.ld", h=T)
UCSC_GRCh37_Genes_UniqueList.txt=read.delim("/data/module1/8_independentLociPrac/UCSC_GRCh37_Genes_UniqueList.txt", stringsAsFactors=F, h=T)
# Format GWAS sumstats
# Create LocusZoom-like plot - EUR
locus.zoom(data=meta,
snp="rs7286215",
ignore.lead=T,
offset_bp = 500000,
ld.file=ld_ref,
genes.data=UCSC_GRCh37_Genes_UniqueList.txt,
plot.title = paste0("Locus plot for rs7286215"),
file.name=paste0("rs7286215_locusPlot.jpg"))