Summary-level data sharing is standard for GWAS publications, but formats vary widely. For downstream analysis such as SBayesRC or COJO, properly formatted input is essential.
This practice covers formatting summary statistics to COJO format with basic quality control.
Example data: GWAS summary statistics from Wightman et al., 2021 on Alzheimer’s disease.
Download from: https://cncr.nl/research/summary_statistics/
library(data.table)
# read in the data
file.name = "PGCALZ2sumstatsExcluding23andMe.txt"
inputpath = "/data/module5/alz/"
gwas = data.frame(fread(paste0(inputpath, file.name)))
head(gwas)
## chr PosGRCh37 testedAllele otherAllele z p N
## 1 1 13668 A G -0.33742041 0.7358 74004
## 2 1 14506 A G -0.51121574 0.6092 74004
## 3 1 14773 T C 1.37124049 0.1703 74004
## 4 1 14860 T G -1.27587418 0.2020 74004
## 5 1 17101 C G -0.47105704 0.6376 74004
## 6 1 17147 A G 0.08532879 0.9320 74004
Start with an empty COJO format table. Column names and order are crucial for a lot of softwares.
# make a new table in standard format
n.snp = nrow(gwas)
formatted.gwas = data.frame(
matrix(NA, nrow = n.snp, ncol = 8)
)
colnames(formatted.gwas) = c("SNP", "A1", "A2", "freq", "b", "se", "p", "N")
The commonly used SNP ID is dbSNP IDs, while sometimes there are only physical position of the SNPs and alleles reported, just like the example data we are using here.
There are a lot of dbSNP resources, such as in NCBI, Resource bundle from GATK, 1000Genome study, UK Biobank resource, or HRC imputation panel.
Here we will borrow the snp.info file from an eigen-decomposition data of LD matrices (HapMap3 SNPs only, 1154522 SNPs).
Several LD matrices are available to download from GCTB website.
# read in a reference data for missing information
freq.file="/data/module5/alz/snp.info"
ref.freq = data.frame(fread(freq.file))
head(ref.freq)
## Chrom ID GenPos PhysPos A1 A2 A1Freq N blk
## 1 1 rs4970383 0 838555 A C 0.247625 20000 1
## 2 1 rs4475691 0 846808 T C 0.199300 20000 1
## 3 1 rs1806509 0 853954 A C 0.609000 20000 1
## 4 1 rs7537756 0 854250 G A 0.209375 20000 1
## 5 1 rs13302982 0 861808 G A 0.980450 20000 1
## 6 1 rs28576697 0 870645 C T 0.290475 20000 1
The scripts in this practice are designed for handling common situations. What we need to do is only to input the column names used in raw data.
colname.SNP = "missing"
colname.chr = "chr"
colname.pos = "PosGRCh37"
colname.A1 = "testedAllele"
colname.A2 = "otherAllele"
By matching position and alleles, we will get the dbSNP ID for the SNPs in our predictor, and put it in the formatted data frame.
if( (colname.SNP != "missing") ){
## if dbSNP ID have been provided, just take it
formatted.gwas$SNP = gwas[,colname.SNP]
}else{
## if not, match with position and alleles
gwas$A = pmin(toupper(gwas[,colname.A1]), toupper(gwas[,colname.A2]))
gwas$B = pmax(toupper(gwas[,colname.A1]), toupper(gwas[,colname.A2]))
gwas$chrbpAB = paste0(gwas[,colname.chr], "_", gwas[,colname.pos], "_", gwas$A, "_", gwas$B)
ref.freq$A = with(ref.freq, pmin(A1, A2))
ref.freq$B = with(ref.freq, pmax(A1, A2))
ref.freq$chrbpAB = paste0(ref.freq$Chrom, "_", ref.freq$PhysPos, "_", ref.freq$A, "_", ref.freq$B)
gwas$matched_RSID = ref.freq[match(gwas$chrbpAB , ref.freq$chrbpAB), "ID"]
gwas[ !(gwas$chrbpAB %in%ref.freq$chrbpAB), "matched_RSID"] = gwas[ !(gwas$chrbpAB %in%ref.freq$chrbpAB), "chrbpAB" ]
formatted.gwas$SNP = gwas$matched_RSID
}
For the SNPs existing in our LD reference, their dbSNP ID are matched successfully.
head(gwas[(grep("rs", gwas$matched_RSID)),])
## chr PosGRCh37 testedAllele otherAllele z p N A B
## 657 1 838555 A C 1.7094917 0.08735992 746978 A C
## 695 1 846808 T C 1.2342346 0.21711548 761686 C T
## 748 1 853954 C A 0.1018936 0.91884112 509430 A C
## 749 1 854250 G A 1.2301642 0.21863562 724514 A G
## 796 1 861808 A G 0.6042566 0.54567306 661050 A G
## 863 1 870645 C T 0.6176414 0.53681172 168217 C T
## chrbpAB matched_RSID
## 657 1_838555_A_C rs4970383
## 695 1_846808_C_T rs4475691
## 748 1_853954_A_C rs1806509
## 749 1_854250_A_G rs7537756
## 796 1_861808_A_G rs13302982
## 863 1_870645_C_T rs28576697
head(formatted.gwas[(grep("rs", formatted.gwas$SNP)),])
## SNP A1 A2 freq b se p N
## 657 rs4970383 NA NA NA NA NA NA NA
## 695 rs4475691 NA NA NA NA NA NA NA
## 748 rs1806509 NA NA NA NA NA NA NA
## 749 rs7537756 NA NA NA NA NA NA NA
## 796 rs13302982 NA NA NA NA NA NA NA
## 863 rs28576697 NA NA NA NA NA NA NA
Both the target data and reference data used here are from human genome build 37. Make sure you also use a reference data from the same build as your GWAS sumstat.
Naming of alleles can be various. They could be named as “EffectAllele” vs “NonEffectAllele”, or “A1” vs. “A2”, or “ReferenceAllele” vs. “AlternativeAllele”, etc. We need to make it clear which allele is the effect allele, by reading the README files or online instructions for the data.
In COJO format GWAS file, we always have effect allele in the column A1.
Ensure alleles are uppercase.
formatted.gwas$A1= toupper(gwas[,colname.A1])
If non-effect allele (A2) is missing, refill from reference:
if(colname.A2 != "missing"){
formatted.gwas$A2 = toupper(gwas[,colname.A2])
}else{
ref.freq$A1A2= paste0(ref.freq$A1, ref.freq$A2)
gwas$A1A2 = ref.freq[match(gwas[,colname.SNP] , ref.freq$ID),"A1A2"]
gwas$A1A2 = as.character(gwas$A1A2)
gwas[,colname.A1] = as.character(gwas[,colname.A1])
# Create new column 'other_allele'
gwas$other_allele <- mapply(function(a1a2, test_allele) {
paste0(setdiff(strsplit(a1a2, "")[[1]], test_allele), collapse = "")
}, gwas$A1A2, gwas[,colname.A1])
formatted.gwas$A2 = gwas$other_allele
}
P value is usually reported in the summary data. When p value gets very small values, it is often in scientific format. You may need to convert it if your softwares would recognize it as characters.
colname.p = "p"
formatted.gwas$p = as.numeric(gwas[,colname.p])
head(formatted.gwas[grep("rs", formatted.gwas$SNP),])
## SNP A1 A2 freq b se p N
## 657 rs4970383 A C NA NA NA 0.08735992 NA
## 695 rs4475691 T C NA NA NA 0.21711548 NA
## 748 rs1806509 C A NA NA NA 0.91884112 NA
## 749 rs7537756 G A NA NA NA 0.21863562 NA
## 796 rs13302982 A G NA NA NA 0.54567306 NA
## 863 rs28576697 C T NA NA NA 0.53681172 NA
If p value column is in -log10 scale, we need convert it back to p, and be careful with the values of “inf”. Although this situation is very rare.
Per SNP sample size is often included as a column named “N”. If it is missing in the summary data, we will have to find the sample size of the study from the publication.
Sometimes disease data includes the sample size as two columns, with cases and controls separately. If that’s the case, we will add them up. Be aware that, it’s not recommended to use effective sample size in SBayesRC analysis.
If it is NCHROBS (number chromosomes observed) reported in summary data, we will divide the number by 2.
colname.N = "N"
sample.size.names=unlist(strsplit(colname.N, ","))
if(length(sample.size.names)==1){
if(is.na(as.numeric(colname.N) )== T) {
if(colname.N == "NCHROBS"){
formatted.gwas$N = 0.5* (gwas[, colname.N])
}else{
formatted.gwas$N = gwas[, colname.N]
}
}else{
formatted.gwas$N = as.numeric(colname.N)
}
}else{
formatted.gwas$N= gwas[, sample.size.names[1]] + gwas[,sample.size.names[2]]
}
head(formatted.gwas[grep("rs", formatted.gwas$SNP),] )
## SNP A1 A2 freq b se p N
## 657 rs4970383 A C NA NA NA 0.08735992 746978
## 695 rs4475691 T C NA NA NA 0.21711548 761686
## 748 rs1806509 C A NA NA NA 0.91884112 509430
## 749 rs7537756 G A NA NA NA 0.21863562 724514
## 796 rs13302982 A G NA NA NA 0.54567306 661050
## 863 rs28576697 C T NA NA NA 0.53681172 168217
Per SNP sample size can be very different when the data is from a meta analysis.
hist(formatted.gwas[grep("rs", formatted.gwas$SNP),]$N, breaks = 100, main = "", xlab = "N")
We can do a QC by excluding SNPs with too outlying sample size. Since the distribution is different for each study, it can be arbitrary to give it a range of sample size, or a specified number of SD (such as 3 SD).
# example code to use SD as threshold.
a = mean(formatted.gwas$N)
b = sd(formatted.gwas$N)
N.outlier.snps = formatted.gwas[which(formatted.gwas$N > (a+1*b) | formatted.gwas$N < (a-1*b) ),"SNP"]
Another way of QC is to exclude the SNPs with very low number of sample size. At the same time, we don’t want to lose too many SNPs.
a = mean(formatted.gwas$N)
b = sd(formatted.gwas$N)
N.outlier.snps = formatted.gwas[which(formatted.gwas$N < quantile(formatted.gwas$N, 0.1) ),"SNP"]
We will exclude the SNPs with bottom 10% quantile of sample size.
Allele frequency is sometimes separately included for cases and controls. If that’s the case, we will do a calculation with the number of cases and controls.
If allele frequency is available, we can compare it to the reference data in a plot, and remove the outliers.
If allele frequency is missing, we can borrow that information from a reference data. This is what happened in our toy data.
colname.freq = "missing"
allele.frequency.names=unlist(strsplit(colname.freq, ","))
This script is designed to solve all the situations.
library(ggplot2)
ref.freq$gwasA1 = formatted.gwas[match(ref.freq$ID, formatted.gwas$SNP),"A1"]
ref.freq$sign = sign(( as.numeric(ref.freq$gwasA1 == ref.freq$A1))-0.5)
ref.freq$gwasA1freq= abs(as.numeric(ref.freq$gwasA1 != ref.freq$A1) - ref.freq$A1Freq)
if(length(allele.frequency.names) == 2) {
## Allele frequency is sometimes separately included for cases and controls.
freq.case = allele.frequency.names[1]
freq.control = allele.frequency.names[2]
N.case = sample.size.names[1]
N.control = sample.size.names[2]
formatted.gwas$freq = ( (gwas[,freq.case] * gwas[,N.case]) + (gwas[,freq.control] * gwas[,N.control]) ) /(gwas[,N.case] + gwas[,N.control])
ref.freq$freq.in.gwas = formatted.gwas[match(ref.freq$ID, formatted.gwas$SNP) , "freq"]
## after calculation, we can compare it with reference data and make a plot
freq.plot = ggplot(data = ref.freq, aes(x = gwasA1freq, y =freq.in.gwas )) + geom_point(size = 0.2) + xlab("AF in reference data") + ylab("AF in GWAS summary statistic")
ggsave(paste0(file.name, "_AF_plot.png"), freq.plot, height = 8, width = 8)
}else{
## allele frequency is sometimes missing
if(colname.freq == "missing"){
formatted.gwas$freq = ref.freq[match( formatted.gwas$SNP, ref.freq$ID),"gwasA1freq"]
}else{
# if freq is available, we will compare it with reference data and make a plot
formatted.gwas$freq = as.numeric( gwas[,colname.freq] )
ref.freq$freq.in.gwas = formatted.gwas[match(ref.freq$ID, formatted.gwas$SNP) , "freq"]
freq.plot = ggplot(data = ref.freq, aes(x = gwasA1freq, y =freq.in.gwas )) +
geom_point(size = 0.2) +
xlab("AF in reference data") +
ylab("AF in GWAS summary statistic")
ggsave(paste0(file.name, "_AF_plot.png"), freq.plot, height = 8, width = 8)
}
}
head(formatted.gwas[(grep("rs", formatted.gwas$SNP)),])
## SNP A1 A2 freq b se p N
## 657 rs4970383 A C 0.247625 NA NA 0.08735992 746978
## 695 rs4475691 T C 0.199300 NA NA 0.21711548 761686
## 748 rs1806509 C A 0.391000 NA NA 0.91884112 509430
## 749 rs7537756 G A 0.209375 NA NA 0.21863562 724514
## 796 rs13302982 A G 0.019550 NA NA 0.54567306 661050
## 863 rs28576697 C T 0.290475 NA NA 0.53681172 168217
Here is an example plot from another data. We can use a threshold (such as 0.2) to exclude outlying SNPs.
This is not the case of our practical data, so I’m just giving an example code here. No need to run it.
freq.outlier = ref.freq[which(abs(ref.freq$gwasA1freq - ref.freq$freq.in.gwas) > 0.2), "ID"]
Effect size is usually named as “effect” or “b”.
If it’s reported as OR (odds ratio), we will do a log transformation.
If effect size is missing, but z score is supplied, we will calculate effect size with z score, allele frequency and sample size.
where:
- z is the z-score.
- f is the allele frequency.
- N is the sample size.
colname.b = "missing"
colname.z = "z"
if(colname.b == "missing" & colname.z != "missing"){
## calculate if missing
formatted.gwas$z = gwas[,colname.z]
formatted.gwas$b = ( formatted.gwas$z )/( ((2 *(formatted.gwas$freq) * (1 - formatted.gwas$freq) * ( formatted.gwas$N + (formatted.gwas$z)^2)))^0.5 )
} else if(colname.b == "OR" | colname.b == "odds_ratio"){
## convert if OR is provided
formatted.gwas$b = log(gwas[,colname.b])
}else{
## take it if provided
formatted.gwas$b = gwas[,colname.b]
}
head(formatted.gwas[(grep("rs", formatted.gwas$SNP)),])
## SNP A1 A2 freq b se p N z
## 657 rs4970383 A C 0.247625 0.0032402820 NA 0.08735992 746978 1.7094917
## 695 rs4475691 T C 0.199300 0.0025032588 NA 0.21711548 761686 1.2342346
## 748 rs1806509 C A 0.391000 0.0002068677 NA 0.91884112 509430 0.1018936
## 749 rs7537756 G A 0.209375 0.0025117509 NA 0.21863562 724514 1.2301642
## 796 rs13302982 A G 0.019550 0.0037957974 NA 0.54567306 661050 0.6042566
## 863 rs28576697 C T 0.290475 0.0023455666 NA 0.53681172 168217 0.6176414
If se is missing, we calculate it from effect size, allele frequency and sample size using the formula:
where:
- z is the z-score.
- f is the allele frequency.
- N is the sample size.
colname.se = "missing"
if(colname.se == "missing"){
## calculate if missing
formatted.gwas$z = sign(formatted.gwas$b) * abs( qnorm(formatted.gwas$p/2) )
formatted.gwas$se =1/sqrt(2 * formatted.gwas$freq *(1 - formatted.gwas$freq) *(formatted.gwas$N + (formatted.gwas$z ^ 2) ) )
}else{
## take it if provided
formatted.gwas$se = gwas[,colname.se]
}
head(formatted.gwas[(grep("rs", formatted.gwas$SNP)),])
## SNP A1 A2 freq b se p N
## 657 rs4970383 A C 0.247625 0.0032402820 0.001895465 0.08735992 746978
## 695 rs4475691 T C 0.199300 0.0025032588 0.002028187 0.21711548 761686
## 748 rs1806509 C A 0.391000 0.0002068677 0.002030232 0.91884112 509430
## 749 rs7537756 G A 0.209375 0.0025117509 0.002041801 0.21863562 724514
## 796 rs13302982 A G 0.019550 0.0037957974 0.006281764 0.54567306 661050
## 863 rs28576697 C T 0.290475 0.0023455666 0.003797619 0.53681172 168217
## z
## 657 1.7094917
## 695 1.2342346
## 748 0.1018936
## 749 1.2301642
## 796 0.6042566
## 863 0.6176414
We can exclude the SNPs with missing information, or outlying sample size and allele frequency, before write it into a COJO file.
We usually use .ma as the extension name for COJO files.
## remove SNPs missing allele frequency (here they miss the dbSNP IDs too)
formatted.gwas = formatted.gwas[which(is.na(formatted.gwas$freq) == F) ,]
## remove sample size outliers
formatted.gwas = formatted.gwas[which(!formatted.gwas$SNP %in% N.outlier.snps) ,]
## keep the useful columns only
formatted.gwas = formatted.gwas[,1:8]
## output the COJO file
output = paste0(file.name, ".ma")
write.table(formatted.gwas, file=output, quote = F, sep ="\t", row.names = F)
Now we have the data in COJO format.
head(formatted.gwas)
## SNP A1 A2 freq b se p N
## 657 rs4970383 A C 0.247625 0.0032402820 0.001895465 0.08735992 746978
## 695 rs4475691 T C 0.199300 0.0025032588 0.002028187 0.21711548 761686
## 748 rs1806509 C A 0.391000 0.0002068677 0.002030232 0.91884112 509430
## 749 rs7537756 G A 0.209375 0.0025117509 0.002041801 0.21863562 724514
## 796 rs13302982 A G 0.019550 0.0037957974 0.006281764 0.54567306 661050
## 863 rs28576697 C T 0.290475 0.0023455666 0.003797619 0.53681172 168217
For automation, use cojo_format.R
with column name
arguments as demonstrated. We didn’t include QC in this tool since
SBayesRC in GCTB has included it.
## input data path
input="/data/module5/alz/PGCALZ2sumstatsExcluding23andMe.txt"
info="/data/module5/alz/snp.info"
exedir="/data/module5/alz/"
## output
output="PGCALZ2sumstatsExcluding23andMe.txt.ma"
## input column names
A1="testedAllele"
A2="otherAllele"
PVAL="p"
Nsize="N"
## these columns are missing, so we declare them as missing.
SNP="missing"
BETA="missing"
SE="missing"
AF="missing"
## these parameters and their flag in the code should be skipped if they are not provided or needed
Z="z"
chr="chr"
pos="PosGRCh37"
## run the code
Rscript ${exedir}/cojo_format.R \
--file ${input} \
--out ${output} \
--SNP $SNP \
--chr $chr \
--pos $pos \
--A1 $A1 \
--A2 $A2 \
--freq $AF \
--pvalue $PVAL \
--beta $BETA \
--se $SE \
--z $Z \
--samplesize $Nsize \
--ref ${info}