In the previous practical we meta-analysed two GWAS of height (by Jiang et al. and by Wood et al.) and saw an increase in the number of genome-wide significant (P < 5 x 10-8) associations compared to the individual studies. Here we will explore two approaches to identify which of those associations are independent. We will also use an R package (LocusZoom) to see associations at a particular locus in greater detail.



We saw before that there were 108,203 genome-wide significant (GWS) associations in the height GWAS meta-analysis. Here, we will use two different methods to assess how many of those are independent:

  1. LD clumping: a method that groups together genetic variants that are in linkage disequilibrium (LD). This method starts by identifying the most significant variant. Then, variants that are nearby (based on a user-defined distance) and that are in LD with the lead SNP (based on a user-defined LD threshold) are clumped/grouped together with that lead variant. The next most significant variant that has not been clumped starts the next clump. The process is repeated until all variants are assigned to a clump.
  2. Conditional & joint association analysis (COJO): to identify independent associations, COJO starts by identifying the most associated SNP. Then, it conditions association results on that SNP and finds the second most associated SNP. Further conditioning on the second SNP, COJO then finds the third most associated SNP. These cycles of conditional analyses are repeated until there are no GWS SNPs left.


NOTE: In the interest of time, we will only run these analyses on chromosome 22.



Clumping

LD clumping can be performed with PLINK using --clump. PLINK includes a number of flags that allow the user to provide further details about the input file. It also includes several flags for the user to define the parameters used in the analysis (see PLINK clump options).


Two files are required as input for this analysis:

  1. Associations results: Two fields are required in this file: (1) variant ID, and (2) P-value.
  2. LD reference: To estimate the LD between genetic variants, PLINK requires individual-level data from a reference sample. For this we will use a subset of European individuals from the 1000 Genomes Project. The files are in the data directory (/data/module1/8_independentLociPrac/g1000_eur_chr22.*). In real data analysis, you could use the same sample that was used for GWAS as your LD reference. If that is not avaiable, you need to use a reference sample that is genetically similar to the GWAS sample to reflect the same LD patterns.


We will use the following settings in our analysis:

  • --clump-p1 5e-8: Clumps are formed around central “lead variants”, starting with those with the lowest P-value. This flag specifies the maximum P-value for those SNPs, i.e. no clumps will be formed around variants that are not GWS.
  • --clump-r2 0.5 (default): Variants with LD r2>0.5 with the lead variant will be included in the clump if they are at within a distance specified by --clump-kb
  • --clump-kb 500 : The maximum distance from the lead variant for SNPs to be considered to be clumped with it. This distance is provided in kb.


# Change to home directory to run analysis
cd ~

# ***************************************
# Clump meta-analysis results
# ***************************************
# Files to use
ld_ref=/data/module1/8_independentLociPrac/g1000_eur_chr22
sumstats=Height.meta

# Run clumping
plink --bfile $ld_ref \
      --clump $sumstats \
      --clump-p1 5e-8 \
      --clump-r2 0.5 \
      --clump-kb 500 \
      --clump-field "P" \
      --clump-snp-field "SNP" \
      --out Height_meta


There are 2 new files in the working directory:

  1. Height_meta.clumped: Meta-analysis clump results
  2. Height_meta.log: PLINK log file detailing the command used and each of the steps in LD clumping of the meta-analysis


Here is what the Height_meta.clumped looks like (columns are described here.):

awk '{$NF=""}1' Height_meta.clumped | head
CHR F SNP BP P TOTAL NSIG S05 S01 S001 S0001
22 1 rs4821083 33056341 0 16 0 0 0 0 16
22 1 rs6007043 45838646 0 14 0 0 0 0 14
22 1 rs7291090 33108150 0 9 0 0 0 0 9
22 1 rs7286215 28397709 0 0 0 0 0 0 0
22 1 rs226493 45715887 0 65 0 0 0 0 65
22 1 rs2899193 33109382 0 9 0 0 0 0 9
  • We omitted the last column, which is a long list of SNPs that were assigned to the same clump as the lead SNP indicated in the “SNP” column.
  • We can see examples of clumps with varying numbers of SNPs. For example:
    • rs226493 is a lead SNP in a 1-Mb locus that includes 65 other SNPs that are in moderate LD with the lead variant (LD r2 >0.5) and have P<0.0001
    • rs7286215 is a lead SNP in a in a 1-Mb locus with no other SNPs in moderate LD with the lead variant and have P<0.0001




COJO

COJO is a summary-statistics-based method implemented in GCTA. As outlined in the GCTA website, the GWAS summary statistics need to be in the format shown below (.ma format). The columns can have other names (since GCTA does not use that information) but it is crucial to have columns in the correct order.


SNP A1 A2 freq b se p N
rs1001 A G 0.8493 0.0024 0.0055 0.6653 129850
rs1002 C G 0.0306 0.0034 0.0115 0.7659 129799
rs1003 A C 0.5128 0.0045 0.0038 0.2319 129830

Columns are: SNP, variant ID; A1, the effect allele; A2, the other allele; freq, frequency of the effect allele; b, effect size; se, standard error; p, p-value; N, sample size.



Format sumstats

Below we provide example code to format the summary statistics in R. Note that the meta-analysis file does not include three of the fields required to convert it to .ma format. Here is how we can obtain information for those fields:

  • Allele frequency: This could be calculated based on information across the individual studies that were meta-analysed. However, for simplicity, we assume the allele frequency in the meta-analysed sample is the same as that reported in the larger GWAS (by Jiang et al.).
  • Sample size (N): To calculate this, we add the N in Wood et al. GWAS and the N in the Jiang et al. GWAS.
  • Standard error (SE): We estimate the SE of the meta-analysed results using the formula below (from Zhu et al. 2016):
img

where p is the allele frequency, b is the effect size estimate, and N is the sample size.


NOTE: For efficiency, these steps have already been run for you.

# ***************************************
# Format meta-analysis sumstats in R
# ***************************************
# Load meta-analysis results
meta=read.table("Height.meta",h=T)

# Get allele frequencies from Jiang et al.
jiang=read.table("/data/module1/7_metaPrac/Height_Jiang2019.gz",h=T)
meta=merge(meta, jiang[,c("SNP","A1","A2","AF1")], , by="SNP", suffixes=c(".meta",".jiang"))
  ## Make sure the alleles match
  identical(tmp$A1.meta,tmp$A1.jiang)

# Calculate sample size from N reported in both Wood et al. and Jiang et al.
wood=read.table("/data/module1/7_metaPrac/Height_Wood2014.gz",h=T)
names(wood)=c("SNP","A1","A2","AF1","BETA","SE","P","N")
sample_sizes=merge(jiang[,c("SNP","N")], wood[,c("SNP","N")], by="SNP", suffixes=c(".jiang",".wood"))
sample_sizes$N=sample_sizes$N.jiang+sample_sizes$N.wood
meta$N=NULL
meta=merge(meta, sample_sizes)

# Compute standard error
p=meta$AF1
b=meta$BETA
n=meta$N
meta$SE=sqrt((1-2*p*(1-p)*b^2) / (2*p*(1-p)*n))

# Select columns in the correct order
maFile=meta[,c("SNP","A1.meta","A2.meta","AF1","BETA","SE","P","N")]
names(maFile)=c("SNP","A1","A2","AF1","BETA","SE","P","N")

# Save
write.table(maFile, "/data/module1/8_independentLociPrac/Height_meta.ma", quote=F, row.names=F)


Now we can see how the new sumstats file look like:

# Meta-analysis .ma file
head /data/module1/8_independentLociPrac/Height_meta.ma
## SNP A1 A2 AF1 BETA SE P N
## rs10 A C 0.0525684 -0.0062 0.00407606036351674 0.04768 604247
## rs1000000 G A 0.773915 -1e-04 0.00201206059303702 0.9689 705867
## rs10000010 T C 0.517192 -0.0025 0.00169574979053677 0.05427 696336
## rs10000012 C G 0.860014 0.0116 0.00242123808881992 7.137e-10 708420
## rs10000013 C A 0.214239 -8e-04 0.00205426839374803 0.6185 703828
## rs10000017 C T 0.77966 -0.0014 0.00203613604904229 0.3637 702033
## rs1000002 C T 0.514329 -0.003 0.00168091044034944 0.02269 708428
## rs10000023 G T 0.420343 -0.0017 0.00172348646770316 0.2018 690842
## rs10000027 C G 0.779381 -0.001 0.00235347133390067 0.5581 525000



Run COJO

We will use the formated sumstats file to identify independent associations using the settings below:

  • --cojo-file <file_name>: To specify the file with the GWAS summary statistics (.ma file created above)
  • --bfile: Like PLINK, COJO estimates the LD between genetic variants using individual-level data from a reference sample. We will use the same reference sample that we used for LD clumping (/data/module1/8_independentLociPrac/g1000_eur_chr22.*)
  • --cojo-slct: Performs a step-wise model selection of independently associated SNPs
  • --cojo-wind 10000 (default): Distance (in KB) from which SNPs are considered in linkage equilibrium
# Change to home directory to run analysis
cd ~

# ***************************************
# Apply COJO to meta-analysis results
# ***************************************
# Define files to use
ld_ref=/data/module1/8_independentLociPrac/g1000_eur_chr22
sumstats=/data/module1/8_independentLociPrac/Height_meta.ma
  
# Run COJO
gcta64 --bfile $ld_ref \
     --cojo-slct \
     --cojo-file $sumstats \
     --out Height_meta

There are 4 new files in the working directory:

  1. Height_meta.jma.cojo: file with independent associations identified in the stepwise model selection
  2. Height_meta.ldr.cojo: file showing the LD correlations between the SNPs
  3. Height_meta.cma.cojo: file with conditional analysis results for all other SNPs that were not selected as independent associations
  4. Height_meta.log: log file detailing the command used and each of the steps in the conditional analysis.


Here is what the Height_meta.jma.cojo looks like (columns are described here):

head Height_meta.jma.cojo
Chr SNP bp refA freq b se p n freq_geno bJ bJ_se pJ LD_r
22 rs2540653 18973549 G 0.908917 -0.0087 0.0029209 0.0028961 707907 0.912525 -0.0397809 0.0029702 0 0.0034146
22 rs1155419 20787850 T 0.569456 -0.0133 0.0017065 0.0000000 700238 0.536779 -0.0264867 0.0017225 0 0.0486505
22 rs5754102 21916272 C 0.822584 0.0121 0.0022163 0.0000000 697485 0.824056 0.0317249 0.0022486 0 -0.0341618
22 rs552212 27644949 A 0.259075 0.0047 0.0020007 0.0188120 650766 0.241551 0.1205900 0.0022494 0 -0.0068221
22 rs10222213 27873353 G 0.596638 -0.0067 0.0017248 0.0001025 698392 0.562624 -0.0707486 0.0018392 0 -0.0283411
22 rs9613498 28044325 G 0.555965 -0.0100 0.0017015 0.0000000 699544 0.553678 -0.1070160 0.0019497 0 -0.0087659

Note: Even though we used a significance threshold of P<5x10-8, we can see variants above this threshold in the p column (marginal association results from the meta-analysis). This is because this threshold is used to select variants in the conditional and joint analysis. You can see the P-values from that analysis in the column pJ (all <5x10-8, as we requested in our GCTA command).



Clump and COJO results

Now we will summarise the number of independent loci identified with clumping and COJO.

# ********************************
# Restrict GWAS results to chr 22
# ********************************
## This will make reading sumstats into R faster
awk 'NR==1 || $1==22' Height.meta > Height.meta_chr22
 


# ********************************
# Summarise results in R
# ********************************
# Load GWAS results
meta=read.table("Height.meta_chr22",h=T)

# Load clump results
clump_meta=read.table("Height_meta.clumped", h=T)

# Load COJO results
cojo_meta=read.table("Height_meta.jma.cojo", h=T)

# Assess number of GWS associations on chr 22
nGWS_meta=nrow(meta[meta$P<5e-8,])

# Assess number of clumps formed
nClumps_meta=length(unique(clump_meta$SNP))

# Assess number of independent variants identified with COJO
nCojo_meta=length(unique(cojo_meta$SNP))


paste0("In the meta-analysis we have: ", nGWS_meta, " GWS associations (P<5e-8); ",nClumps_meta," clumps; and ", nCojo_meta, " idependent asociations identified with COJO.")

In the meta-analysis we have:

  • 933 GWS associations (P<5e-8)
  • 102 clumps
  • 96 independence associations identified with COJO

As we see, the two methods identify similar number of independent loci. However, with clumping the user can change settings such as the LD r2 between variants that are grouped, which will make the definition of independent loci more or less stringent. For example, clumping with an LD r2 threshold of 0.01, would you expect more or less independent associations? You can test this by changing the settings in PLINK.



Locus plot

Now we will zoom in on particular loci to see what the associations look like at particular regions. We will do this with a LocusZoom R script, which plots associations along with useful information about the locus, such as the genes it includes and linkage disequilibrium coefficients. The R script can be downloaded from functions/locus_zoom.R in this GitHub repository. For the purpose of this tutorial, a copy of this script is already in /data/module1/8_independentLociPrac/locus_zoom.R.


As an example, we will focus on the two loci mentioned in the clump section:

  • rs226493: a large clump with 65 SNPs with P<0.0001 and in moderate LD with the lead variant
  • rs7286215: a small clump consisting only of the lead SNP, i.e., no other SNPs in a 1-Mb region were in moderate LD with the lead variant and had P<0.0001


We will use the locus.zoom R function with following settings:

  • data <file_name>: The GWAS summary statistics. In this case, the meta-analysis results for chromosome 22.
  • snp <SNP_ID>: A SNP of interest to be annotated
  • ignore.lead T: This option is required when using the --snp option above
  • offset_bp 500000: To plot all variants within 500,000 bp of the SNP defined with the --snp option above
  • ld.file <file_name>: A data.frame with LD values relevant to the SNP specified by --snp. For this example, we calculate the LD between the lead variants that we will analyse and all variants within 500 Kb, i.e., in a 1-Mb region centred on those SNPs.
  • genes.data <file_name>: A data.frame with gene locations to plot beneath the graph (requires the columns “Gene”, “Chrom”, “Start”, “End”, and “Coding”). Here we use a file available in the GitHub repository for this R function (Gencode_GRCh37_Genes_UniqueList.txt).
  • plot.title: a title to go above your plot
  • file.name: a filename for your plot to be saved to


# *********************
# Create LD files for the locus plots
# *********************
# NOTE: There is no need to run this step. We already have a copy of these files in /data/module1/8_independentLociPrac/

for snp in  rs226493 rs7286215
do
  ref=/data/module1/8_independentLociPrac/g1000_eur_chr22
  plink --bfile $ref \
        --r2 with-freqs \
        --ld-window-r2 0 \
        --ld-snp $snp \
        --ld-window 1000000\
        --out $snp.LDr2
done 


# *********************
# Generate locus plot
# for rs226493
# *********************

# Load locuszoom function
source("/data/module1/8_independentLociPrac/locus_zoom.R")

# Load data
meta=read.table("Height.meta_chr22",h=T)
ld_ref=read.table("/data/module1/8_independentLociPrac/rs226493.LDr2.ld", h=T) 
UCSC_GRCh37_Genes_UniqueList.txt=read.delim("/data/module1/8_independentLociPrac/UCSC_GRCh37_Genes_UniqueList.txt", stringsAsFactors=F, h=T)

# Format GWAS sumstats


# Create LocusZoom-like plot - EUR
locus.zoom(data=meta,
           snp="rs226493", 
           ignore.lead=T,
           offset_bp = 500000,
           ld.file=ld_ref,
           genes.data=UCSC_GRCh37_Genes_UniqueList.txt,
           plot.title = paste0("Locus plot for rs226493"),
           file.name=paste0("rs226493_locusPlot.jpg"))

NOTE: The plots generated by this function will be automatically saved in your working directory, i.e., they will not be displayed like other R plots. You need to open them to see them.

We can see that the lead SNP in this clump (rs226493) is close to other SNPs with stronger association (light blue dots). However, the LD with those variants (LD r2 < 0.4) was lower than the threshold that we used to consider them the same signal. Using a more stringent threshold (e.g. LD r2 > 0.1 for clumping) would have assigned all these variants to a single clump.


# *********************
# Generate locus plot
# for rs7286215
# *********************

# Load locuszoom function
source("/data/module1/8_independentLociPrac/locus_zoom.R")

# Load data
meta=read.table("Height.meta_chr22",h=T)
ld_ref=read.table("/data/module1/8_independentLociPrac/rs7286215.LDr2.ld", h=T) 
UCSC_GRCh37_Genes_UniqueList.txt=read.delim("/data/module1/8_independentLociPrac/UCSC_GRCh37_Genes_UniqueList.txt", stringsAsFactors=F, h=T)

# Format GWAS sumstats


# Create LocusZoom-like plot - EUR
locus.zoom(data=meta,
           snp="rs7286215", 
           ignore.lead=T,
           offset_bp = 500000,
           ld.file=ld_ref,
           genes.data=UCSC_GRCh37_Genes_UniqueList.txt,
           plot.title = paste0("Locus plot for rs7286215"),
           file.name=paste0("rs7286215_locusPlot.jpg"))