GCTA-mBAT-combo: performs a set-based association analysis for human complex traits using summary-level data from GWAS and linkage disequilibrium (LD) data from a reference sample with individual-level genotypes. Please see Li et al. (doi: https://doi.org/10.1016/j.ajhg.2022.12.006 ) for details about the method.

In this practical we will run through the basic steps of performing a mBAT-combo gene-based analysis: Gene analysis and subsequent comparison of its results with traditional GWAS TOP-SNP analysis to find the novel genes, genes identified only by mBAT or mBAT-combo was called if it passed genome-wide significance threshold at gene level (around p-mBAT < 2.7 × 10−6) and there was no genome-wide significant SNP (around p-GWAS > 5 × 10−8) within 50kb of the gene.

mBAT-combo performs both mBAT and fastBAT tests and produces the combined p-value, which is implemented in GCTA latest version, https://yanglab.westlake.edu.cn/software/gcta/#Download, using the following command to download it. Note: Please do not run below, we have prepared the software.

wget https://yanglab.westlake.edu.cn/software/gcta/bin/gcta-1.94.1-linux-kernel-3-x86_64.zip
unzip gcta-1.94.1-linux-kernel-3-x86_64.zip

The following bash command in Terminal outputs the P-values for all mBAT-combo, mBAT and fastBAT tests.

Run mBAT-combo in human height (Wood et al data) in Terminal using the following bash command to give a try. For time saving, if no errors report, please use the keyboard shortcuts Ctrl+Z to stop the running command. We have provided the final outputHeight_Wood_2014.gene.assoc.mbat in the directory: /data/module1/10_geneTests/

cd /data/module1/10_geneTests/
summary_data="/data/module1/10_geneTests/Height_Wood_2014.ma"
path="/data/module1/10_geneTests/gcta-1.94.1-linux-kernel-3-x86_64"
bfile_prefix="/data/module1/Ref_bfile/g1000_eur"
op_path="/home/username" # Or the path you want to save the output
# Output mBAT-combo p-values:
${path}/gcta-1.94.1 --bfile $bfile_prefix --mBAT-combo $summary_data --mBAT-gene-list hg19.pos.ensgid.gene.loc --mBAT-print-all-p --out $op_path/Height_Wood_2014

0.1 Options

–mBAT-combo summary_data

Invoke mBAT-combo test and input the summary-level statistics from a meta-analysis GWAS (or a single GWAS). The input file summary_data has columns including SNP ID, the effect allele, the other allele, frequency of the effect allele, effect size, standard error, p-value and sample size. The headers are not keywords and will be omitted by the program. Important: A1 needs to be the effect allele with A2 being the other allele and freq should be the frequency of A1.

–bfile test

Input the plink bfile as the reference set for LD estimation.

–mBAT-gene-list gene_list.txt

Input gene list with gene start and end positions. To gain the largest number of genes without duplicates, we used ensgid (ENSEMBL gene identifier) in our gene list for annoation.

We offer protein coding genes with unique gene ensgid based on the release from GENCODE (v40) in the genome build of hg19 and hg38 (https://www.gencodegenes.org/human/) (credit to Patrick F. Sullivan). The columns are chromosome ID, gene start position, gene end position, and ensgid.

You can use the gene ID convert files below to map ensgid to symbol gene names for the genome build of hg19 and hg38:

–mBAT-svd-gamma 0.9

The minimum proportion of the total variance in the LD matrix explained by the top principal components used in the mBAT test. The default value is 0.9.

–mBAT-wind 50

Used in conjunction with –mBAT-gene-list to define a gene region. By default, a gene region is defined as +-50kb of UTRs of a gene.

–diff-freq 0.2

Threshold for allele frequency difference at effect alleles (A1) between GWAS and reference data sets. SNPs with allele frequency difference greater than the threshold will be removed. The default value is 0.2. You can turn off the allele frequency difference filtering by setting this value to 1.

–fastBAT-ld-cutoff 0.9

Threshold for LD r-squared value for LD pruning in fastBAT only. The default value is 0.9. You can turn off LD pruning by setting this value to 1.

–mBAT-write-snpset

Write the sets of SNPs (and the corresponding Genes) included in the analysis.

0.1.1 Output file format

Height_Wood_2014.gene.assoc.mbat

# Please set directory firstly
# setwd("/data/module1/10_geneTests/")

library(dplyr)
library(data.table)
library(ggplot2)
library("CMplot")
library(ggvenn)
library(miamiplot)
requireNamespace("RColorBrewer", quietly = TRUE)
library(tidyverse)

res = fread("Height_Wood_2014.gene.assoc.mbat")
head(res)

When --mBAT-print-all-p is used, this output file will include the results from all mBAT, fastBAT, and mBAT-combo tests. Without this flag, output of .gene.assoc.mbat file will only give results of mBAT-combo.

GCTA-mBAT-combo perform QC as well and output the sets of bad SNPs in summary data that did not pass QC: .badsnps and .freq.badsnps

1 Gene-based test[mBAT-combo] is more powerful than the standard GWAS approach

1.1 Q: Why gene-based test has a higher power than the GWAS approach?

1.2 Pre-processing GWAS data: Keep only SNPs on chromosomes and have been kept in mBAT-combo analysis after QC

Codes and output given for time saving: summary_data_fig_pre.Rdata

# The following do not need to be run
#chr_snps = fread("g1000_eur.bim")

#Removed_snps1 = fread("Height_Wood_2014.freq.badsnps")
#Removed_snps2 = fread("Height_Wood_2014.badsnps")

#QCed_chr_snps = chr_snps %>% 
# filter(!V1 %in% Removed_snps1$SNP) %>% 
# filter(!V1 %in% Removed_snps2$SNP)
#names(QCed_chr_snps) = c("Chromosome","SNP","BP","Position","A1","A2")

#summary_data = fread("Height_Wood_2014.ma") %>% filter(SNP %in% QCed_chr_snps$SNP)

#summary_data_fig_pre = inner_join(summary_data,QCed_chr_snps %>% 
#                       select(Chromosome,SNP,Position))
#save(summary_data_fig_pre,file = "summary_data_fig_pre.Rdata")

1.3 Visualize SNP-level signals by standard GWAS by Manhattan plot

Show GWAS SNPs on 22 Chromosomes

Height, Wood et al, 2014 GWAS based Manhattan plot Chromosome density is shown on the bottom of Manhattan plot, density values shown in legend by color.

1.4 Visualize the gene-level signals based on results by gene-based test(mBAT-combo)

Sometimes we want to highlight genes(just use random genes as example below) that show significance by mBAT-combo with biologincal mechanisms.

# Load the pre-processed GWAS summary data
load("summary_data_fig_pre.Rdata")

# Calculate the Genome-wide significance threshold in SNP-level
threshold_sig = 0.05/nrow(summary_data_fig_pre)

# Calculate the significance threshold in gene-level
threshold_sig_gene = 0.05/nrow(res)

Highlight_genes = c("ENSG00000159640",
                    "ENSG00000187626")
CMplot(
  res %>% select(Gene, Chr, End, P_mBATcombo), 
  plot.type = "m",
  LOG10 = TRUE,
  highlight = Highlight_genes,
  highlight.text = Highlight_genes,
  highlight.col = "#FC4E07",
  highlight.cex = 1,
  highlight.pch = 19,
  threshold = c(threshold_sig_gene),
  threshold.lty = c(1,2),
  threshold.lwd = c(1,2),
  width = 23,
  height = 6,
  threshold.col = c("red","darkblue"),
  amplify = FALSE,
  dpi = 300,
  file.output = FALSE,
  verbose = TRUE
)
##  Rectangular Manhattan plotting P_mBATcombo.

## Compare power of mBAT-combo to the standard GWAS approach

The standard GWAS approach: signals based TOP-P value of SNP within each gene region vs Aggregating all SNPs’ signals within the gene by mBAT-combo

ggplot(res, aes(-log10(TopSNP_Pvalue), -log10(P_mBATcombo))) +
  geom_point() +
  geom_abline(slope = 1, linetype = "dashed") +
  geom_hline(yintercept = -log10(2e-6), linetype = "dotted", color = "red") +
  geom_vline(xintercept = -log10(5e-8), linetype = "dotted", color = "blue") +
  theme(
    axis.text.y = element_text(face = "bold"),
    legend.title = element_blank(),
    text = element_text(size = 14),
    strip.background = element_rect(color = "black", linetype = "solid"),
    strip.text.x = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank(),
    axis.line = element_line(colour = "black")
  )+ 
  labs(x = "-log(P) from TOP-P SNP",y="-log(P) from mBAT-combo")

The significance threshold: blue line is based on the genome-wide significance threshold of 5e-8(0.05/number of SNPs) and red line is based on Bonferroni correction at gene level (0.05/number of genes).

1.5 Overlap identified by Gene-based tests (mBAT-combo) vs. GWAS Top-P values within the gene

We could see that 72.2% significant genes, regardless methods, could be detected by both methods, using mBAT-combo have identified more genes than just using only Top-P value within the genes, which gains power from aggregating all SNPs’ signal within the gene region.

res_topSNP = res %>% select(Gene, Chr, End,TopSNP_Pvalue) %>% filter(TopSNP_Pvalue<threshold_sig)
res_mBAT_combo = res %>% select(Gene, Chr, End,P_mBATcombo) %>% filter(P_mBATcombo<threshold_sig_gene)

# Create a list of two sets of genes
gene_sets <- list(
  "GWAS Top-P value of SNP" = res_topSNP$Gene,
  "mBAT-combo" = res_mBAT_combo$Gene
)

# Generate the Venn diagram
ggvenn(gene_sets)

2 Detect genes with masking effects

Compare signals based on results by gene-based tests, mBAT and fastBAT ## Highlight genes that show large difference by the two methods

ggplot(res, aes( -log10(P_fastBAT),-log10(P_mBAT))) +
  geom_point() +
  geom_abline(slope = 1, linetype = "dashed") +
  geom_hline(yintercept = -log10(2e-6), linetype = "dotted", color = "red") +
  geom_vline(xintercept = -log10(2e-6), linetype = "dotted", color = "red") +
  theme(
    axis.text.y = element_text(face = "bold"),
    legend.title = element_blank(),
    text = element_text(size = 14),
    strip.background = element_rect(color = "black", linetype = "solid"),
    strip.text.x = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank(),
    axis.line = element_line(colour = "black")
  )+ 
  labs(x = "-log(P) from fastBAT",y="-log(P) from mBAT")

df <- res %>%
  pivot_longer(
    cols = c(P_mBAT, P_fastBAT),
    names_to = "method",
    values_to = "pval"
  ) %>%
  mutate(
    method = case_when(
      method == "P_mBAT" ~ "mBAT",
      method == "P_fastBAT" ~ "fastBAT",
      TRUE ~ method
    )
  ) %>%
  select(Gene,Chr,End,Chisq_mBAT,Chisq_fastBAT,No.Eigenvalues,pval,method)

names(df) = c("rsid","chr","pos","beta","se","stat","pval","study")
df = as.data.frame(df)
#Difg = res%>%filter(P_mBAT<1e-6, P_fastBAT>1e-6)

Big_dif_genes = "ENSG00000275052"
mycolors = grDevices::colorRampPalette(RColorBrewer::brewer.pal(12, "Paired"))(22)

plot_data <- prep_miami_data(df, split_by = "study", split_at = "mBAT", p = "pval")

mBAT_labels = plot_data$upper %>%
  filter(rsid%in%Big_dif_genes) %>%
  mutate(label = paste0(rsid, "\n", pval)) %>%
  select(rel_pos, logged_p, label) 

fastBAT_labels <- plot_data$lower %>%
  filter(rsid%in%Big_dif_genes) %>%
  mutate(label = paste0(rsid, "\n", pval)) %>%
  select(rel_pos, logged_p, label) 

ggmiami(df, split_by = "study", split_at = "mBAT", p = "pval", 
        upper_ylab = "mBAT", lower_ylab = "fastBAT", chr_colors = mycolors, 
        genome_line_color = "black", suggestive_line_color = "#A9A9A9",
        upper_labels_df = mBAT_labels, lower_labels_df = fastBAT_labels)

We could see that both the consistency and difference between the two methods. Contrasting p-values between mBAT and fastBAT tests can shed light on the presence of masking effects, because mBAT test is more powerful than fastBAT in this case. These genes are likely to have secondary signals with masking effects. Further investigation could be illustrated by COJO. mBAT unique results are more related to masking effect. For more details please check (doi: https://doi.org/10.1016/j.ajhg.2022.12.006).

We further checked one gene that showing big difference in results of mBAT and fastBAT with COJO.

COJO = fread("COJO_ENSG00000275052.tat")
sample = COJO %>% select(!c("Chr","bp","freq","b","se","n","freq_geno"))
sample
Masking = sample$bJ[1] * sample$bJ[2] * sample$LD_r[sample$LD_r != 0]

print(paste0("example masking gene: Effect_snp1 * LD[snp1,snp2] * Effect_snp2 = ",round(Masking, 6)," < 0"))
## [1] "example masking gene: Effect_snp1 * LD[snp1,snp2] * Effect_snp2 = -9.3e-05 < 0"

3 Q: How could we use the prioritized genes and test for enrichment of the set of genes in pre-defined pathways?