1 Objectives

Summary-level data sharing is standard for GWAS publications, but formats vary widely. For downstream analysis such as SBayesRC or COJO, properly formatted input is essential.

This practice covers formatting summary statistics to COJO format with basic quality control.


2 Data

Example data: GWAS summary statistics from Wightman et al., 2021 on Alzheimer’s disease.

Download from: https://cncr.nl/research/summary_statistics/

library(data.table)
# read in the data
file.name = "PGCALZ2sumstatsExcluding23andMe.txt"
inputpath = "/data/module5/alz/"
gwas = data.frame(fread(paste0(inputpath, file.name)))
head(gwas)
##   chr PosGRCh37 testedAllele otherAllele           z      p     N
## 1   1     13668            A           G -0.33742041 0.7358 74004
## 2   1     14506            A           G -0.51121574 0.6092 74004
## 3   1     14773            T           C  1.37124049 0.1703 74004
## 4   1     14860            T           G -1.27587418 0.2020 74004
## 5   1     17101            C           G -0.47105704 0.6376 74004
## 6   1     17147            A           G  0.08532879 0.9320 74004


3 Create COJO Format Table

Start with an empty COJO format table. Column names and order are crucial for a lot of softwares.

# make a new table in standard format
n.snp = nrow(gwas)
formatted.gwas = data.frame(
  matrix(NA, nrow = n.snp, ncol = 8)
)
colnames(formatted.gwas) = c("SNP", "A1", "A2", "freq", "b", "se", "p", "N")


4 SNP ID

The commonly used SNP ID is dbSNP IDs, while sometimes there are only physical position of the SNPs and alleles reported, just like the example data we are using here.

There are a lot of dbSNP resources, such as in NCBI, Resource bundle from GATK, 1000Genome study, UK Biobank resource, or HRC imputation panel.

Here we will borrow the snp.info file from an eigen-decomposition data of LD matrices (HapMap3 SNPs only, 1154522 SNPs).

Several LD matrices are available to download from GCTB website.

# read in a reference data for missing information
freq.file="/data/module5/alz/snp.info"
ref.freq = data.frame(fread(freq.file))
head(ref.freq)
##   Chrom         ID GenPos PhysPos A1 A2   A1Freq     N blk
## 1     1  rs4970383      0  838555  A  C 0.247625 20000   1
## 2     1  rs4475691      0  846808  T  C 0.199300 20000   1
## 3     1  rs1806509      0  853954  A  C 0.609000 20000   1
## 4     1  rs7537756      0  854250  G  A 0.209375 20000   1
## 5     1 rs13302982      0  861808  G  A 0.980450 20000   1
## 6     1 rs28576697      0  870645  C  T 0.290475 20000   1

The scripts in this practice are designed for handling common situations. What we need to do is only to input the column names used in raw data.

colname.SNP = "missing"

colname.chr = "chr"
colname.pos = "PosGRCh37"
colname.A1 = "testedAllele"
colname.A2 = "otherAllele"

By matching position and alleles, we will get the dbSNP ID for the SNPs in our predictor, and put it in the formatted data frame.

if(   (colname.SNP != "missing")  ){
  ## if dbSNP ID have been provided, just take it
  formatted.gwas$SNP = gwas[,colname.SNP]
  
}else{
  ## if not, match with position and alleles
  
  gwas$A = pmin(toupper(gwas[,colname.A1]), toupper(gwas[,colname.A2]))
  gwas$B = pmax(toupper(gwas[,colname.A1]), toupper(gwas[,colname.A2]))
  gwas$chrbpAB = paste0(gwas[,colname.chr], "_", gwas[,colname.pos], "_", gwas$A, "_", gwas$B)
  
  ref.freq$A = with(ref.freq, pmin(A1, A2))
  ref.freq$B = with(ref.freq, pmax(A1, A2))
  ref.freq$chrbpAB = paste0(ref.freq$Chrom, "_", ref.freq$PhysPos, "_", ref.freq$A, "_", ref.freq$B)
  
  gwas$matched_RSID = ref.freq[match(gwas$chrbpAB , ref.freq$chrbpAB), "ID"] 
  gwas[ !(gwas$chrbpAB %in%ref.freq$chrbpAB), "matched_RSID"]  =  gwas[ !(gwas$chrbpAB %in%ref.freq$chrbpAB), "chrbpAB" ]  
  
  formatted.gwas$SNP = gwas$matched_RSID
  
}

For the SNPs existing in our LD reference, their dbSNP ID are matched successfully.

head(gwas[(grep("rs", gwas$matched_RSID)),])
##     chr PosGRCh37 testedAllele otherAllele         z          p      N A B
## 657   1    838555            A           C 1.7094917 0.08735992 746978 A C
## 695   1    846808            T           C 1.2342346 0.21711548 761686 C T
## 748   1    853954            C           A 0.1018936 0.91884112 509430 A C
## 749   1    854250            G           A 1.2301642 0.21863562 724514 A G
## 796   1    861808            A           G 0.6042566 0.54567306 661050 A G
## 863   1    870645            C           T 0.6176414 0.53681172 168217 C T
##          chrbpAB matched_RSID
## 657 1_838555_A_C    rs4970383
## 695 1_846808_C_T    rs4475691
## 748 1_853954_A_C    rs1806509
## 749 1_854250_A_G    rs7537756
## 796 1_861808_A_G   rs13302982
## 863 1_870645_C_T   rs28576697
head(formatted.gwas[(grep("rs", formatted.gwas$SNP)),])
##            SNP A1 A2 freq  b se  p  N
## 657  rs4970383 NA NA   NA NA NA NA NA
## 695  rs4475691 NA NA   NA NA NA NA NA
## 748  rs1806509 NA NA   NA NA NA NA NA
## 749  rs7537756 NA NA   NA NA NA NA NA
## 796 rs13302982 NA NA   NA NA NA NA NA
## 863 rs28576697 NA NA   NA NA NA NA NA

Both the target data and reference data used here are from human genome build 37. Make sure you also use a reference data from the same build as your GWAS sumstat.


5 Alleles

Naming of alleles can be various. They could be named as “EffectAllele” vs “NonEffectAllele”, or “A1” vs. “A2”, or “ReferenceAllele” vs. “AlternativeAllele”, etc. We need to make it clear which allele is the effect allele, by reading the README files or online instructions for the data.

In COJO format GWAS file, we always have effect allele in the column A1.

Ensure alleles are uppercase.

formatted.gwas$A1= toupper(gwas[,colname.A1])

If non-effect allele (A2) is missing, refill from reference:

if(colname.A2 != "missing"){
  
    formatted.gwas$A2 = toupper(gwas[,colname.A2])
  
}else{
    ref.freq$A1A2= paste0(ref.freq$A1, ref.freq$A2)
    gwas$A1A2 = ref.freq[match(gwas[,colname.SNP] , ref.freq$ID),"A1A2"]
    gwas$A1A2 = as.character(gwas$A1A2)
    gwas[,colname.A1] = as.character(gwas[,colname.A1])
    
    # Create new column 'other_allele'
    gwas$other_allele <- mapply(function(a1a2, test_allele) {
      paste0(setdiff(strsplit(a1a2, "")[[1]], test_allele), collapse = "")
    }, gwas$A1A2, gwas[,colname.A1])
    
    formatted.gwas$A2 = gwas$other_allele

}


6 P value

P value is usually reported in the summary data. When p value gets very small values, it is often in scientific format. You may need to convert it if your softwares would recognize it as characters.

colname.p = "p"
formatted.gwas$p = as.numeric(gwas[,colname.p])

head(formatted.gwas[grep("rs", formatted.gwas$SNP),])
##            SNP A1 A2 freq  b se          p  N
## 657  rs4970383  A  C   NA NA NA 0.08735992 NA
## 695  rs4475691  T  C   NA NA NA 0.21711548 NA
## 748  rs1806509  C  A   NA NA NA 0.91884112 NA
## 749  rs7537756  G  A   NA NA NA 0.21863562 NA
## 796 rs13302982  A  G   NA NA NA 0.54567306 NA
## 863 rs28576697  C  T   NA NA NA 0.53681172 NA

If p value column is in -log10 scale, we need convert it back to p, and be careful with the values of “inf”. Although this situation is very rare.


7 Sample Size

Per SNP sample size is often included as a column named “N”. If it is missing in the summary data, we will have to find the sample size of the study from the publication.

Sometimes disease data includes the sample size as two columns, with cases and controls separately. If that’s the case, we will add them up. Be aware that, it’s not recommended to use effective sample size in SBayesRC analysis.

If it is NCHROBS (number chromosomes observed) reported in summary data, we will divide the number by 2.

colname.N = "N"

sample.size.names=unlist(strsplit(colname.N, ","))

if(length(sample.size.names)==1){
  if(is.na(as.numeric(colname.N) )== T) {
    if(colname.N == "NCHROBS"){
      formatted.gwas$N = 0.5* (gwas[, colname.N])
    }else{
      formatted.gwas$N = gwas[, colname.N]
    }
  }else{
    formatted.gwas$N = as.numeric(colname.N)
  }
}else{
  formatted.gwas$N= gwas[, sample.size.names[1]] + gwas[,sample.size.names[2]]
}

head(formatted.gwas[grep("rs", formatted.gwas$SNP),] )
##            SNP A1 A2 freq  b se          p      N
## 657  rs4970383  A  C   NA NA NA 0.08735992 746978
## 695  rs4475691  T  C   NA NA NA 0.21711548 761686
## 748  rs1806509  C  A   NA NA NA 0.91884112 509430
## 749  rs7537756  G  A   NA NA NA 0.21863562 724514
## 796 rs13302982  A  G   NA NA NA 0.54567306 661050
## 863 rs28576697  C  T   NA NA NA 0.53681172 168217


Per SNP sample size can be very different when the data is from a meta analysis.

hist(formatted.gwas[grep("rs", formatted.gwas$SNP),]$N, breaks = 100, main = "", xlab = "N")

We can do a QC by excluding SNPs with too outlying sample size. Since the distribution is different for each study, it can be arbitrary to give it a range of sample size, or a specified number of SD (such as 3 SD).

# example code to use SD as threshold. 

a = mean(formatted.gwas$N)
b = sd(formatted.gwas$N)

N.outlier.snps = formatted.gwas[which(formatted.gwas$N > (a+1*b) | formatted.gwas$N < (a-1*b) ),"SNP"]


Another way of QC is to exclude the SNPs with very low number of sample size. At the same time, we don’t want to lose too many SNPs.

a = mean(formatted.gwas$N)
b = sd(formatted.gwas$N)

N.outlier.snps = formatted.gwas[which(formatted.gwas$N < quantile(formatted.gwas$N, 0.1) ),"SNP"]

We will exclude the SNPs with bottom 10% quantile of sample size.


8 Allele Frequency

Allele frequency is sometimes separately included for cases and controls. If that’s the case, we will do a calculation with the number of cases and controls.

If allele frequency is available, we can compare it to the reference data in a plot, and remove the outliers.

If allele frequency is missing, we can borrow that information from a reference data. This is what happened in our toy data.

colname.freq = "missing"

allele.frequency.names=unlist(strsplit(colname.freq, ","))

This script is designed to solve all the situations.

library(ggplot2)

ref.freq$gwasA1 = formatted.gwas[match(ref.freq$ID, formatted.gwas$SNP),"A1"]
ref.freq$sign = sign(( as.numeric(ref.freq$gwasA1  == ref.freq$A1))-0.5)
ref.freq$gwasA1freq= abs(as.numeric(ref.freq$gwasA1  != ref.freq$A1) - ref.freq$A1Freq)

if(length(allele.frequency.names) == 2) {
  
  ## Allele frequency is sometimes separately included for cases and controls. 
  freq.case = allele.frequency.names[1]
  freq.control = allele.frequency.names[2]
  N.case = sample.size.names[1]
  N.control = sample.size.names[2]
  
  formatted.gwas$freq = ( (gwas[,freq.case] * gwas[,N.case]) + (gwas[,freq.control] * gwas[,N.control]) ) /(gwas[,N.case] + gwas[,N.control])
  ref.freq$freq.in.gwas = formatted.gwas[match(ref.freq$ID, formatted.gwas$SNP) , "freq"]
  
  ## after calculation, we can compare it with reference data and make a plot
  freq.plot = ggplot(data = ref.freq, aes(x = gwasA1freq, y =freq.in.gwas )) + geom_point(size = 0.2) + xlab("AF in reference data") + ylab("AF in GWAS summary statistic")
  ggsave(paste0(file.name, "_AF_plot.png"), freq.plot, height = 8, width = 8)
  
}else{
  
  ## allele frequency is sometimes missing
  if(colname.freq == "missing"){
    formatted.gwas$freq = ref.freq[match( formatted.gwas$SNP, ref.freq$ID),"gwasA1freq"]
    
  }else{
    
    # if freq is available, we will compare it with reference data and make a plot
    formatted.gwas$freq = as.numeric( gwas[,colname.freq] )
    ref.freq$freq.in.gwas = formatted.gwas[match(ref.freq$ID, formatted.gwas$SNP) , "freq"]
    freq.plot = ggplot(data = ref.freq, aes(x = gwasA1freq, y =freq.in.gwas )) + 
      geom_point(size = 0.2) + 
      xlab("AF in reference data") + 
      ylab("AF in GWAS summary statistic")
    ggsave(paste0(file.name, "_AF_plot.png"), freq.plot, height = 8, width = 8)
    
  }
}

head(formatted.gwas[(grep("rs", formatted.gwas$SNP)),])
##            SNP A1 A2     freq  b se          p      N
## 657  rs4970383  A  C 0.247625 NA NA 0.08735992 746978
## 695  rs4475691  T  C 0.199300 NA NA 0.21711548 761686
## 748  rs1806509  C  A 0.391000 NA NA 0.91884112 509430
## 749  rs7537756  G  A 0.209375 NA NA 0.21863562 724514
## 796 rs13302982  A  G 0.019550 NA NA 0.54567306 661050
## 863 rs28576697  C  T 0.290475 NA NA 0.53681172 168217


Here is an example plot from another data. We can use a threshold (such as 0.2) to exclude outlying SNPs.

This is not the case of our practical data, so I’m just giving an example code here. No need to run it.

freq.outlier = ref.freq[which(abs(ref.freq$gwasA1freq - ref.freq$freq.in.gwas) > 0.2), "ID"]


9 Effect Size

Effect size is usually named as “effect” or “b”.

If it’s reported as OR (odds ratio), we will do a log transformation.

If effect size is missing, but z score is supplied, we will calculate effect size with z score, allele frequency and sample size.

where:
- z is the z-score.
- f is the allele frequency.
- N is the sample size.

colname.b = "missing" 
colname.z = "z"


if(colname.b == "missing" & colname.z != "missing"){
  ## calculate if missing
  formatted.gwas$z = gwas[,colname.z]
  formatted.gwas$b = ( formatted.gwas$z )/( ((2 *(formatted.gwas$freq) * (1 - formatted.gwas$freq) * ( formatted.gwas$N + (formatted.gwas$z)^2)))^0.5 )
}  else if(colname.b == "OR" | colname.b == "odds_ratio"){
  ## convert if OR is provided
  formatted.gwas$b = log(gwas[,colname.b])
}else{
  ## take it if provided
  formatted.gwas$b = gwas[,colname.b]
}

head(formatted.gwas[(grep("rs", formatted.gwas$SNP)),])
##            SNP A1 A2     freq            b se          p      N         z
## 657  rs4970383  A  C 0.247625 0.0032402820 NA 0.08735992 746978 1.7094917
## 695  rs4475691  T  C 0.199300 0.0025032588 NA 0.21711548 761686 1.2342346
## 748  rs1806509  C  A 0.391000 0.0002068677 NA 0.91884112 509430 0.1018936
## 749  rs7537756  G  A 0.209375 0.0025117509 NA 0.21863562 724514 1.2301642
## 796 rs13302982  A  G 0.019550 0.0037957974 NA 0.54567306 661050 0.6042566
## 863 rs28576697  C  T 0.290475 0.0023455666 NA 0.53681172 168217 0.6176414


10 Standard error

If se is missing, we calculate it from effect size, allele frequency and sample size using the formula:

where:
- z is the z-score.
- f is the allele frequency.
- N is the sample size.

colname.se = "missing"

if(colname.se == "missing"){
  ## calculate if missing
  formatted.gwas$z = sign(formatted.gwas$b) * abs( qnorm(formatted.gwas$p/2) )
  formatted.gwas$se =1/sqrt(2 * formatted.gwas$freq *(1 - formatted.gwas$freq) *(formatted.gwas$N +  (formatted.gwas$z ^ 2)  )  )
  
}else{
  ## take it if provided
  formatted.gwas$se = gwas[,colname.se]
}

head(formatted.gwas[(grep("rs", formatted.gwas$SNP)),])
##            SNP A1 A2     freq            b          se          p      N
## 657  rs4970383  A  C 0.247625 0.0032402820 0.001895465 0.08735992 746978
## 695  rs4475691  T  C 0.199300 0.0025032588 0.002028187 0.21711548 761686
## 748  rs1806509  C  A 0.391000 0.0002068677 0.002030232 0.91884112 509430
## 749  rs7537756  G  A 0.209375 0.0025117509 0.002041801 0.21863562 724514
## 796 rs13302982  A  G 0.019550 0.0037957974 0.006281764 0.54567306 661050
## 863 rs28576697  C  T 0.290475 0.0023455666 0.003797619 0.53681172 168217
##             z
## 657 1.7094917
## 695 1.2342346
## 748 0.1018936
## 749 1.2301642
## 796 0.6042566
## 863 0.6176414


11 Output

We can exclude the SNPs with missing information, or outlying sample size and allele frequency, before write it into a COJO file.

We usually use .ma as the extension name for COJO files.

## remove SNPs missing allele frequency (here they miss the dbSNP IDs too)
formatted.gwas = formatted.gwas[which(is.na(formatted.gwas$freq) == F) ,]

## remove sample size outliers
formatted.gwas = formatted.gwas[which(!formatted.gwas$SNP %in% N.outlier.snps) ,]

## keep the useful columns only
formatted.gwas = formatted.gwas[,1:8]

## output the COJO file
output = paste0(file.name, ".ma")
write.table(formatted.gwas, file=output, quote = F, sep ="\t", row.names = F)

Now we have the data in COJO format.

head(formatted.gwas)
##            SNP A1 A2     freq            b          se          p      N
## 657  rs4970383  A  C 0.247625 0.0032402820 0.001895465 0.08735992 746978
## 695  rs4475691  T  C 0.199300 0.0025032588 0.002028187 0.21711548 761686
## 748  rs1806509  C  A 0.391000 0.0002068677 0.002030232 0.91884112 509430
## 749  rs7537756  G  A 0.209375 0.0025117509 0.002041801 0.21863562 724514
## 796 rs13302982  A  G 0.019550 0.0037957974 0.006281764 0.54567306 661050
## 863 rs28576697  C  T 0.290475 0.0023455666 0.003797619 0.53681172 168217


12 Final Notes

For automation, use cojo_format.R with column name arguments as demonstrated. We didn’t include QC in this tool since SBayesRC in GCTB has included it.


## input data path
input="/data/module5/alz/PGCALZ2sumstatsExcluding23andMe.txt"
info="/data/module5/alz/snp.info"
exedir="/data/module5/alz/"

## output 
output="PGCALZ2sumstatsExcluding23andMe.txt.ma"

## input column names
A1="testedAllele"
A2="otherAllele"
PVAL="p"
Nsize="N"

## these columns are missing, so we declare them as missing.
SNP="missing"
BETA="missing"
SE="missing"
AF="missing"

## these parameters and their flag in the code should be skipped if they are not provided or needed
Z="z"      
chr="chr"
pos="PosGRCh37"

## run the code
Rscript  ${exedir}/cojo_format.R  \
  --file  ${input}  \
  --out  ${output}   \
  --SNP  $SNP   \
  --chr $chr \
  --pos $pos \
  --A1  $A1    \
  --A2   $A2    \
  --freq  $AF   \
  --pvalue  $PVAL  \
  --beta  $BETA  \
  --se  $SE   \
  --z $Z  \
  --samplesize  $Nsize  \
  --ref  ${info}