GCTA-mBAT-combo: performs a set-based association analysis for human complex traits using summary-level data from GWAS and linkage disequilibrium (LD) data from a reference sample with individual-level genotypes. Please see Li et al. (doi: https://doi.org/10.1016/j.ajhg.2022.12.006 ) for details about the method.
In this practical we will run through the basic steps of performing a mBAT-combo gene-based analysis: Gene analysis and subsequent comparison of its results with traditional GWAS TOP-SNP analysis to find the novel genes, genes identified only by mBAT or mBAT-combo was called if it passed genome-wide significance threshold at gene level (around p-mBAT < 2.7 × 10−6) and there was no genome-wide significant SNP (around p-GWAS > 5 × 10−8) within 50kb of the gene.
mBAT-combo performs both mBAT and fastBAT tests and produces the combined p-value, which is implemented in GCTA latest version, https://yanglab.westlake.edu.cn/software/gcta/#Download, using the following command to download it. Note: Please do not run below, we have prepared the software.
wget https://yanglab.westlake.edu.cn/software/gcta/bin/gcta-1.94.1-linux-kernel-3-x86_64.zip
unzip gcta-1.94.1-linux-kernel-3-x86_64.zip
The following bash command in Terminal outputs the P-values for all mBAT-combo, mBAT and fastBAT tests.
Run mBAT-combo in human height (Wood et al data) in Terminal using
the following bash command to give a try. For time saving, if no errors
report, please use the keyboard shortcuts Ctrl+Z
to stop
the running command. We have provided the final
outputHeight_Wood_2014.gene.assoc.mbat
in the directory:
/data/module1/10_geneTests/
cd /data/module1/10_geneTests/
summary_data="/data/module1/10_geneTests/Height_Wood_2014.ma"
path="/data/module1/10_geneTests/gcta-1.94.1-linux-kernel-3-x86_64"
bfile_prefix="/data/module1/Ref_bfile/g1000_eur"
op_path="/home/username" # Or the path you want to save the output
# Output mBAT-combo p-values:
${path}/gcta-1.94.1 --bfile $bfile_prefix --mBAT-combo $summary_data --mBAT-gene-list hg19.pos.ensgid.gene.loc --mBAT-print-all-p --out $op_path/Height_Wood_2014
–mBAT-combo summary_data
Invoke mBAT-combo test and input the summary-level statistics from a
meta-analysis GWAS (or a single GWAS). The input file
summary_data
has columns including SNP ID, the effect
allele, the other allele, frequency of the effect allele, effect size,
standard error, p-value and sample size. The headers are not keywords
and will be omitted by the program. Important: A1 needs to be
the effect allele with A2 being the other allele and
freq should be the frequency of A1.
–bfile test
Input the plink bfile as the reference set for LD estimation.
–mBAT-gene-list gene_list.txt
Input gene list with gene start and end positions. To gain the largest number of genes without duplicates, we used ensgid (ENSEMBL gene identifier) in our gene list for annoation.
We offer protein coding genes with unique gene ensgid based on the release from GENCODE (v40) in the genome build of hg19 and hg38 (https://www.gencodegenes.org/human/) (credit to Patrick F. Sullivan). The columns are chromosome ID, gene start position, gene end position, and ensgid.
You can use the gene ID convert files below to map ensgid to symbol gene names for the genome build of hg19 and hg38:
–mBAT-svd-gamma 0.9
The minimum proportion of the total variance in the LD matrix explained by the top principal components used in the mBAT test. The default value is 0.9.
–mBAT-wind 50
Used in conjunction with –mBAT-gene-list to define a gene region. By default, a gene region is defined as +-50kb of UTRs of a gene.
–diff-freq 0.2
Threshold for allele frequency difference at effect alleles (A1) between GWAS and reference data sets. SNPs with allele frequency difference greater than the threshold will be removed. The default value is 0.2. You can turn off the allele frequency difference filtering by setting this value to 1.
–fastBAT-ld-cutoff 0.9
Threshold for LD r-squared value for LD pruning in fastBAT only. The default value is 0.9. You can turn off LD pruning by setting this value to 1.
–mBAT-write-snpset
Write the sets of SNPs (and the corresponding Genes) included in the analysis.
Height_Wood_2014.gene.assoc.mbat
# Please set directory firstly
# setwd("/data/module1/10_geneTests/")
library(dplyr)
library(data.table)
library(ggplot2)
library("CMplot")
library(ggvenn)
library(miamiplot)
requireNamespace("RColorBrewer", quietly = TRUE)
library(tidyverse)
res = fread("Height_Wood_2014.gene.assoc.mbat")
head(res)
When --mBAT-print-all-p
is used, this output file will
include the results from all mBAT, fastBAT, and mBAT-combo tests.
Without this flag, output of .gene.assoc.mbat file will only give
results of mBAT-combo.
GCTA-mBAT-combo perform QC as well and output the sets of bad SNPs in
summary data that did not pass QC: .badsnps
and
.freq.badsnps
Codes and output given for time saving: summary_data_fig_pre.Rdata
# The following do not need to be run
#chr_snps = fread("g1000_eur.bim")
#Removed_snps1 = fread("Height_Wood_2014.freq.badsnps")
#Removed_snps2 = fread("Height_Wood_2014.badsnps")
#QCed_chr_snps = chr_snps %>%
# filter(!V1 %in% Removed_snps1$SNP) %>%
# filter(!V1 %in% Removed_snps2$SNP)
#names(QCed_chr_snps) = c("Chromosome","SNP","BP","Position","A1","A2")
#summary_data = fread("Height_Wood_2014.ma") %>% filter(SNP %in% QCed_chr_snps$SNP)
#summary_data_fig_pre = inner_join(summary_data,QCed_chr_snps %>%
# select(Chromosome,SNP,Position))
#save(summary_data_fig_pre,file = "summary_data_fig_pre.Rdata")
Show GWAS SNPs on 22 Chromosomes
Chromosome density is shown on the bottom of Manhattan plot, density values shown in legend by color.
Sometimes we want to highlight genes(just use random genes as example below) that show significance by mBAT-combo with biologincal mechanisms.
# Load the pre-processed GWAS summary data
load("summary_data_fig_pre.Rdata")
# Calculate the Genome-wide significance threshold in SNP-level
threshold_sig = 0.05/nrow(summary_data_fig_pre)
# Calculate the significance threshold in gene-level
threshold_sig_gene = 0.05/nrow(res)
Highlight_genes = c("ENSG00000159640",
"ENSG00000187626")
CMplot(
res %>% select(Gene, Chr, End, P_mBATcombo),
plot.type = "m",
LOG10 = TRUE,
highlight = Highlight_genes,
highlight.text = Highlight_genes,
highlight.col = "#FC4E07",
highlight.cex = 1,
highlight.pch = 19,
threshold = c(threshold_sig_gene),
threshold.lty = c(1,2),
threshold.lwd = c(1,2),
width = 23,
height = 6,
threshold.col = c("red","darkblue"),
amplify = FALSE,
dpi = 300,
file.output = FALSE,
verbose = TRUE
)
## Rectangular Manhattan plotting P_mBATcombo.
## Compare power of mBAT-combo to the standard GWAS approach
The standard GWAS approach: signals based TOP-P value of SNP within each gene region vs Aggregating all SNPs’ signals within the gene by mBAT-combo
ggplot(res, aes(-log10(TopSNP_Pvalue), -log10(P_mBATcombo))) +
geom_point() +
geom_abline(slope = 1, linetype = "dashed") +
geom_hline(yintercept = -log10(2e-6), linetype = "dotted", color = "red") +
geom_vline(xintercept = -log10(5e-8), linetype = "dotted", color = "blue") +
theme(
axis.text.y = element_text(face = "bold"),
legend.title = element_blank(),
text = element_text(size = 14),
strip.background = element_rect(color = "black", linetype = "solid"),
strip.text.x = element_text(face = "bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")
)+
labs(x = "-log(P) from TOP-P SNP",y="-log(P) from mBAT-combo")
The significance threshold: blue line is based on the genome-wide significance threshold of 5e-8(0.05/number of SNPs) and red line is based on Bonferroni correction at gene level (0.05/number of genes).
We could see that 72.2% significant genes, regardless methods, could be detected by both methods, using mBAT-combo have identified more genes than just using only Top-P value within the genes, which gains power from aggregating all SNPs’ signal within the gene region.
res_topSNP = res %>% select(Gene, Chr, End,TopSNP_Pvalue) %>% filter(TopSNP_Pvalue<threshold_sig)
res_mBAT_combo = res %>% select(Gene, Chr, End,P_mBATcombo) %>% filter(P_mBATcombo<threshold_sig_gene)
# Create a list of two sets of genes
gene_sets <- list(
"GWAS Top-P value of SNP" = res_topSNP$Gene,
"mBAT-combo" = res_mBAT_combo$Gene
)
# Generate the Venn diagram
ggvenn(gene_sets)
Compare signals based on results by gene-based tests, mBAT and fastBAT ## Highlight genes that show large difference by the two methods
ggplot(res, aes( -log10(P_fastBAT),-log10(P_mBAT))) +
geom_point() +
geom_abline(slope = 1, linetype = "dashed") +
geom_hline(yintercept = -log10(2e-6), linetype = "dotted", color = "red") +
geom_vline(xintercept = -log10(2e-6), linetype = "dotted", color = "red") +
theme(
axis.text.y = element_text(face = "bold"),
legend.title = element_blank(),
text = element_text(size = 14),
strip.background = element_rect(color = "black", linetype = "solid"),
strip.text.x = element_text(face = "bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")
)+
labs(x = "-log(P) from fastBAT",y="-log(P) from mBAT")
df <- res %>%
pivot_longer(
cols = c(P_mBAT, P_fastBAT),
names_to = "method",
values_to = "pval"
) %>%
mutate(
method = case_when(
method == "P_mBAT" ~ "mBAT",
method == "P_fastBAT" ~ "fastBAT",
TRUE ~ method
)
) %>%
select(Gene,Chr,End,Chisq_mBAT,Chisq_fastBAT,No.Eigenvalues,pval,method)
names(df) = c("rsid","chr","pos","beta","se","stat","pval","study")
df = as.data.frame(df)
#Difg = res%>%filter(P_mBAT<1e-6, P_fastBAT>1e-6)
Big_dif_genes = "ENSG00000275052"
mycolors = grDevices::colorRampPalette(RColorBrewer::brewer.pal(12, "Paired"))(22)
plot_data <- prep_miami_data(df, split_by = "study", split_at = "mBAT", p = "pval")
mBAT_labels = plot_data$upper %>%
filter(rsid%in%Big_dif_genes) %>%
mutate(label = paste0(rsid, "\n", pval)) %>%
select(rel_pos, logged_p, label)
fastBAT_labels <- plot_data$lower %>%
filter(rsid%in%Big_dif_genes) %>%
mutate(label = paste0(rsid, "\n", pval)) %>%
select(rel_pos, logged_p, label)
ggmiami(df, split_by = "study", split_at = "mBAT", p = "pval",
upper_ylab = "mBAT", lower_ylab = "fastBAT", chr_colors = mycolors,
genome_line_color = "black", suggestive_line_color = "#A9A9A9",
upper_labels_df = mBAT_labels, lower_labels_df = fastBAT_labels)
We could see that both the consistency and difference between the two methods. Contrasting p-values between mBAT and fastBAT tests can shed light on the presence of masking effects, because mBAT test is more powerful than fastBAT in this case. These genes are likely to have secondary signals with masking effects. Further investigation could be illustrated by COJO. mBAT unique results are more related to masking effect. For more details please check (doi: https://doi.org/10.1016/j.ajhg.2022.12.006).
We further checked one gene that showing big difference in results of mBAT and fastBAT with COJO.
COJO = fread("COJO_ENSG00000275052.tat")
sample = COJO %>% select(!c("Chr","bp","freq","b","se","n","freq_geno"))
sample
Masking = sample$bJ[1] * sample$bJ[2] * sample$LD_r[sample$LD_r != 0]
print(paste0("example masking gene: Effect_snp1 * LD[snp1,snp2] * Effect_snp2 = ",round(Masking, 6)," < 0"))
## [1] "example masking gene: Effect_snp1 * LD[snp1,snp2] * Effect_snp2 = -9.3e-05 < 0"