Independent associations

In the previous practical we meta-analysed two GWAS of height (by Jiang et al. and by Wood et al.) and saw an increase in the number of genome-wide significant (P < 5 x 10^-8) associations compared to the individual studies. Here we will explore two approaches to identify which of those associations are independent. We will also use an R package (LocusZoom) to see associations at a particular locus in greater detail.

We saw before that there were 108,203 genome-wide significant (GWS) associations in the height GWAS meta-analysis. Here, we will use two different methods to assess how many of those are independent:

LD clumping: a method that groups together genetic variants that are in linkage disequilibrium (LD). This method starts by identifying the most significant variant. Then, variants that are nearby (based on a user-defined distance) and that are in LD with the lead SNP (based on a user-defined LD threshold) are clumped/grouped together with that lead variant. The next most significant variant that has not been clumped starts the next clump. The process is repeated until all variants are assigned to a clump.
Conditional & joint association analysis (COJO): to identify independent associations, COJO starts by identifying the most associated SNP. Then, it conditions association results on that SNP and finds the second most associated SNP. Further conditioning on the second SNP, COJO then finds the third most associated SNP. These cycles of conditional analyses are repeated until there are no GWS SNPs left.

NOTE: In the interest of time, we will only run these analyses on chromosome 22.

Clumping

LD clumping can be performed with PLINK using --clump. PLINK includes a number of flags that allow the user to provide further details about the input file. It also includes several flags for the user to define the parameters used in the analysis (see PLINK clump options).

Two files are required as input for this analysis:

Associations results: Two fields are required in this file: (1) variant ID, and (2) P-value.
LD reference: To estimate the LD between genetic variants, PLINK requires individual-level data from a reference sample. For this we will use a subset of European individuals from the 1000 Genomes Project. The files are in the data directory (/data/module1/Ref_bfile/g1000_eur_chr22.*). In real data analysis, you could use the same sample that was used for GWAS as your LD reference. If that is not avaiable, you need to use a reference sample that is genetically similar to the GWAS sample to reflect the same LD patterns.

We will use the following settings in our analysis:

--clump-p1 5e-8: Clumps are formed around central “lead variants”, starting with those with the lowest P-value. This flag specifies the maximum P-value for those SNPs, i.e. no clumps will be formed around variants that are not GWS.
--clump-r2 0.5 (default): Variants with LD r²>0.5 with the lead variant will be included in the clump if they are at within a distance specified by --clump-kb
--clump-kb 500 : The maximum distance from the lead variant for SNPs to be considered to be clumped with it. This distance is provided in kb.

# Change to home directory to run analysis
cd ~

# ***************************************
# Clump meta-analysis results
# ***************************************
# Files to use
ld_ref=/data/module1/Ref_bfile/g1000_eur_chr22
sumstats=Height.meta

# Run clumping
plink --bfile $ld_ref \
      --clump $sumstats \
      --clump-p1 5e-8 \
      --clump-r2 0.5 \
      --clump-kb 500 \
      --clump-field "P" \
      --clump-snp-field "SNP" \
      --out Height_meta

There are 2 new files in the working directory:

Height_meta.clumped: Meta-analysis clump results
Height_meta.log: PLINK log file detailing the command used and each of the steps in LD clumping of the meta-analysis

Here is what the Height_meta.clumped looks like (columns are described here.):

awk '{$NF=""}1' Height_meta.clumped | head

CHR	F	SNP	BP	TOTAL	S0001
22	1	rs4821083	33056341	16	16
22	1	rs6007043	45838646	14	14
22	1	rs7291090	33108150	9	9
22	1	rs7286215	28397709	0	0
22	1	rs226493	45715887	65	65
22	1	rs2899193	33109382	9	9

We omitted the last column, which is a long list of SNPs that were assigned to the same clump as the lead SNP indicated in the “SNP” column.
We can see examples of clumps with varying numbers of SNPs. For example:
- rs226493 is a lead SNP in a 1-Mb locus that includes 65 other SNPs that are in moderate LD with the lead variant (LD r² >0.5) and have P<0.0001
- rs7286215 is a lead SNP in a in a 1-Mb locus with no other SNPs in moderate LD with the lead variant and have P<0.0001

COJO

COJO is a summary-statistics-based method implemented in GCTA. As outlined in the GCTA website, the GWAS summary statistics need to be in the format shown below (.ma format). The columns can have other names (since GCTA does not use that information) but it is crucial to have columns in the correct order.

SNP	A1	A2	freq	b	se	p	N
rs1001	A	G	0.8493	0.0024	0.0055	0.6653	129850
rs1002	C	G	0.0306	0.0034	0.0115	0.7659	129799
rs1003	A	C	0.5128	0.0045	0.0038	0.2319	129830
…

Columns are: SNP, variant ID; A1, the effect allele; A2, the other allele; freq, frequency of the effect allele; b, effect size; se, standard error; p, p-value; N, sample size.

Format sumstats

Below we provide example code to format the summary statistics in R. Note that the meta-analysis file does not include three of the fields required to convert it to .ma format. Here is how we can obtain information for those fields:

Allele frequency: This could be calculated based on information across the individual studies that were meta-analysed. However, for simplicity, we assume the allele frequency in the meta-analysed sample is the same as that reported in the larger GWAS (by Jiang et al.).
Sample size (N): To calculate this, we add the N in Wood et al. GWAS and the N in the Jiang et al. GWAS.
Standard error (SE): We estimate the SE of the meta-analysed results using the formula below (from Zhu et al. 2016):

where p is the allele frequency, b is the effect size estimate, and N is the sample size.

NOTE: For efficiency, these steps have already been run for you.

# ***************************************
# Format meta-analysis sumstats in R
# ***************************************
# Load meta-analysis results
meta=read.table("Height.meta",h=T)

# Get allele frequencies from Jiang et al.
jiang=read.table("/data/module1/8_metaAnalysis/Height_Jiang2019.gz",h=T)
meta=merge(meta, jiang[,c("SNP","A1","A2","AF1")], , by="SNP", suffixes=c(".meta",".jiang"))
  ## Make sure the alleles match
  identical(tmp$A1.meta,tmp$A1.jiang)

# Calculate sample size from N reported in both Wood et al. and Jiang et al.
wood=read.table("/data/module1/8_metaAnalysis/Height_Wood2014.gz",h=T)
names(wood)=c("SNP","A1","A2","AF1","BETA","SE","P","N")
sample_sizes=merge(jiang[,c("SNP","N")], wood[,c("SNP","N")], by="SNP", suffixes=c(".jiang",".wood"))
sample_sizes$N=sample_sizes$N.jiang+sample_sizes$N.wood
meta$N=NULL
meta=merge(meta, sample_sizes)

# Compute standard error
p=meta$AF1
b=meta$BETA
n=meta$N
meta$SE=sqrt((1-2*p*(1-p)*b^2) / (2*p*(1-p)*n))

# Select columns in the correct order
maFile=meta[,c("SNP","A1.meta","A2.meta","AF1","BETA","SE","P","N")]
names(maFile)=c("SNP","A1","A2","AF1","BETA","SE","P","N")

# Save
write.table(maFile, "/data/module1/9_independentLoci/Height_meta.ma", quote=F, row.names=F)

Now we can see how the new sumstats file look like:

# Meta-analysis .ma file
head /data/module1/9_independentLoci/Height_meta.ma

## SNP A1 A2 AF1 BETA SE P N
## rs10 A C 0.0525684 -0.0062 0.00407606036351674 0.04768 604247
## rs1000000 G A 0.773915 -1e-04 0.00201206059303702 0.9689 705867
## rs10000010 T C 0.517192 -0.0025 0.00169574979053677 0.05427 696336
## rs10000012 C G 0.860014 0.0116 0.00242123808881992 7.137e-10 708420
## rs10000013 C A 0.214239 -8e-04 0.00205426839374803 0.6185 703828
## rs10000017 C T 0.77966 -0.0014 0.00203613604904229 0.3637 702033
## rs1000002 C T 0.514329 -0.003 0.00168091044034944 0.02269 708428
## rs10000023 G T 0.420343 -0.0017 0.00172348646770316 0.2018 690842
## rs10000027 C G 0.779381 -0.001 0.00235347133390067 0.5581 525000

Run COJO

We will use the formated sumstats file to identify independent associations using the settings below:

--cojo-file <file_name>: To specify the file with the GWAS summary statistics (.ma file created above)
--bfile: Like PLINK, COJO estimates the LD between genetic variants using individual-level data from a reference sample. We will use the same reference sample that we used for LD clumping (/data/module1/Ref_bfile/g1000_eur_chr22.*)
--cojo-slct: Performs a step-wise model selection of independently associated SNPs
--cojo-wind 10000 (default): Distance (in KB) from which SNPs are considered in linkage equilibrium

# Change to home directory to run analysis
cd ~

# ***************************************
# Apply COJO to meta-analysis results
# ***************************************
# Define files to use
ld_ref=/data/module1/Ref_bfile/g1000_eur_chr22
sumstats=/data/module1/9_independentLoci/Height_meta.ma
  
# Run COJO
gcta --bfile $ld_ref \
     --cojo-slct \
     --cojo-file $sumstats \
     --out Height_meta

There are 4 new files in the working directory:

Height_meta.jma.cojo: file with independent associations identified in the stepwise model selection
Height_meta.ldr.cojo: file showing the LD correlations between the SNPs
Height_meta.cma.cojo: file with conditional analysis results for all other SNPs that were not selected as independent associations
Height_meta.log: log file detailing the command used and each of the steps in the conditional analysis.

Here is what the Height_meta.jma.cojo looks like (columns are described here):

head Height_meta.jma.cojo

Chr	SNP	bp	refA	freq	b	se	p	n	freq_geno	bJ	bJ_se	LD_r
22	rs2540653	18973549	G	0.908917	-0.0087	0.0029209	0.0028961	707907	0.912525	-0.0397809	0.0029702	0.0034146
22	rs1155419	20787850	T	0.569456	-0.0133	0.0017065	0.0000000	700238	0.536779	-0.0264867	0.0017225	0.0486505
22	rs5754102	21916272	C	0.822584	0.0121	0.0022163	0.0000000	697485	0.824056	0.0317249	0.0022486	-0.0341618
22	rs552212	27644949	A	0.259075	0.0047	0.0020007	0.0188120	650766	0.241551	0.1205900	0.0022494	-0.0068221
22	rs10222213	27873353	G	0.596638	-0.0067	0.0017248	0.0001025	698392	0.562624	-0.0707486	0.0018392	-0.0283411
22	rs9613498	28044325	G	0.555965	-0.0100	0.0017015	0.0000000	699544	0.553678	-0.1070160	0.0019497	-0.0087659

Note: Even though we used a significance threshold of P<5x10^-8, we can see variants above this threshold in the p column (marginal association results from the meta-analysis). This is because this threshold is used to select variants in the conditional and joint analysis. You can see the P-values from that analysis in the column pJ (all <5x10^-8, as we requested in our GCTA command).

Clump and COJO results

Now we will summarise the number of independent loci identified with clumping and COJO.

# ********************************
# Restrict GWAS results to chr 22
# ********************************
## This will make reading sumstats into R faster
awk 'NR==1 || $1==22' Height.meta > Height.meta_chr22
 


# ********************************
# Summarise results in R
# ********************************
# Load GWAS results
meta=read.table("Height.meta_chr22",h=T)

# Load clump results
clump_meta=read.table("Height_meta.clumped", h=T)

# Load COJO results
cojo_meta=read.table("Height_meta.jma.cojo", h=T)

# Assess number of GWS associations on chr 22
nGWS_meta=nrow(meta[meta$P<5e-8,])

# Assess number of clumps formed
nClumps_meta=length(unique(clump_meta$SNP))

# Assess number of independent variants identified with COJO
nCojo_meta=length(unique(cojo_meta$SNP))


paste0("In the meta-analysis we have: ", nGWS_meta, " GWS associations (P<5e-8); ",nClumps_meta," clumps; and ", nCojo_meta, " idependent asociations identified with COJO.")

In the meta-analysis we have:

933 GWS associations (P<5e-8)
102 clumps
96 independence associations identified with COJO

As we see, the two methods identify similar number of independent loci. However, with clumping the user can change settings such as the LD r² between variants that are grouped, which will make the definition of independent loci more or less stringent. For example, clumping with an LD r² threshold of 0.01, would you expect more or less independent associations? You can test this by changing the settings in PLINK.

Locus plot

Now we will zoom in on particular loci to see what the associations look like at particular regions. We will do this with a LocusZoom R script, which plots associations along with useful information about the locus, such as the genes it includes and linkage disequilibrium coefficients. The R script can be downloaded from functions/locus_zoom.R in this GitHub repository. For the purpose of this tutorial, a copy of this script is already in /data/module1/9_independentLoci/locus_zoom.R.

As an example, we will focus on the two loci mentioned in the clump section:

rs226493: a large clump with 65 SNPs with P<0.0001 and in moderate LD with the lead variant
rs7286215: a small clump consisting only of the lead SNP, i.e., no other SNPs in a 1-Mb region were in moderate LD with the lead variant and had P<0.0001

We will use the locus.zoom R function with following settings:

data <file_name>: The GWAS summary statistics. In this case, the meta-analysis results for chromosome 22.
snp <SNP_ID>: A SNP of interest to be annotated
ignore.lead T: This option is required when using the --snp option above
offset_bp 500000: To plot all variants within 500,000 bp of the SNP defined with the --snp option above
ld.file <file_name>: A data.frame with LD values relevant to the SNP specified by --snp. For this example, we calculate the LD between the lead variants that we will analyse and all variants within 500 Kb, i.e., in a 1-Mb region centred on those SNPs.
genes.data <file_name>: A data.frame with gene locations to plot beneath the graph (requires the columns “Gene”, “Chrom”, “Start”, “End”, and “Coding”). Here we use a file available in the GitHub repository for this R function (Gencode_GRCh37_Genes_UniqueList.txt).
plot.title: a title to go above your plot
file.name: a filename for your plot to be saved to

# *********************
# Create LD files for the locus plots
# *********************
# NOTE: There is no need to run this step. We already have a copy of these files in /data/module1/9_independentLoci/

for snp in  rs226493 rs7286215
do
  ref=/data/module1/Ref_bfile/g1000_eur_chr22
  plink --bfile $ref \
        --r2 with-freqs \
        --ld-window-r2 0 \
        --ld-snp $snp \
        --ld-window 1000000\
        --out $snp.LDr2
done

# *********************
# Generate locus plot
# for rs226493
# *********************

# Load locuszoom function
source("/data/module1/9_independentLoci/locus_zoom.R")

# Load data
meta=read.table("Height.meta_chr22",h=T)
ld_ref=read.table("/data/module1/9_independentLoci/rs226493.LDr2.ld", h=T) 
UCSC_GRCh37_Genes_UniqueList.txt=read.delim("/data/module1/9_independentLoci/UCSC_GRCh37_Genes_UniqueList.txt", stringsAsFactors=F, h=T)

# Format GWAS sumstats


# Create LocusZoom-like plot - EUR
locus.zoom(data=meta,
           snp="rs226493", 
           ignore.lead=T,
           offset_bp = 500000,
           ld.file=ld_ref,
           genes.data=UCSC_GRCh37_Genes_UniqueList.txt,
           plot.title = paste0("Locus plot for rs226493"),
           file.name=paste0("rs226493_locusPlot.jpg"))

NOTE: The plots generated by this function will be automatically saved in your working directory, i.e., they will not be displayed like other R plots. You need to open them to see them.

We can see that the lead SNP in this clump (rs226493) is close to other SNPs with stronger association (light blue dots). However, the LD with those variants (LD r² < 0.4) was lower than the threshold that we used to consider them the same signal. Using a more stringent threshold (e.g. LD r² > 0.1 for clumping) would have assigned all these variants to a single clump.

# *********************
# Generate locus plot
# for rs7286215
# *********************

# Load locuszoom function
source("/data/module1/9_independentLoci/locus_zoom.R")

# Load data
meta=read.table("Height.meta_chr22",h=T)
ld_ref=read.table("/data/module1/9_independentLoci/rs7286215.LDr2.ld", h=T) 
UCSC_GRCh37_Genes_UniqueList.txt=read.delim("/data/module1/9_independentLoci/UCSC_GRCh37_Genes_UniqueList.txt", stringsAsFactors=F, h=T)

# Format GWAS sumstats


# Create LocusZoom-like plot - EUR
locus.zoom(data=meta,
           snp="rs7286215", 
           ignore.lead=T,
           offset_bp = 500000,
           ld.file=ld_ref,
           genes.data=UCSC_GRCh37_Genes_UniqueList.txt,
           plot.title = paste0("Locus plot for rs7286215"),
           file.name=paste0("rs7286215_locusPlot.jpg"))