1 Objective

It has become a standard to share and make publicly available the summary-level data when publishing a GWAS study. However, there is not a unified format of the summay-level data. To make a gwas summary data into a proper input file for downstream analysis, such as SBayesRC and COJO, we need to format it and check the quality.

In this practice, we will go through all the aspects in formatting, QC and solving missing information. .

2 Data

The example data used here is the summary statistics from A genome-wide association study with 1,126,563 individuals identifies new risk loci for Alzheimer’s disease from Wightman et al., 2021.

It’s available to download from https://cncr.nl/research/summary_statistics/. For the ease of computation, I only kept chromosome 22 in the practice.

# read in the data
file.name = "/data/module5/alz/chr22_from_PGCALZ2sumstatsExcluding23andMe.txt"
gwas = data.frame(fread(file.name))
head(gwas)
##   chr PosGRCh37 testedAllele otherAllele           z         p     N
## 1  22  16054713            T           C -0.02557039 0.9796000 63926
## 2  22  16055122            T           G  0.84605261 0.3975234  4305
## 3  22  16199921            T           C  0.22625898 0.8210000  1070
## 4  22  16208755            T           C  2.34046003 0.0192600 63926
## 5  22  16281208            T           G  0.04312737 0.9656000 63926
## 6  22  16289869            C           G  0.46420623 0.6425000 63926



We can then have a check at how many SNPs are in the data. The example data is chr22 only. You may see 1 million to 20 million SNPs in a full GWAS data, but it really varies. If your data shows only thousands of SNPs, it’s likely to contain genome-wide significant SNPs only. You can also find it out from the publication or a README file.

nrow(gwas)
## [1] 173882

We will start a new data frame for the formatted data. We will transfer the corresponding information from raw data to each column in the following sections.

# make a new table in stardard format
n.snp = nrow(gwas)
formatted.gwas = data.frame(
  matrix(NA, nrow = n.snp, ncol = 8)
)
colnames(formatted.gwas) = c("SNP", "A1", "A2", "freq", "b", "se", "p", "N")

3 SNP ID

The commonly used SNP ID is dbSNP IDs, while sometimes there are only physical position of the SNPs and alleles, such as the example data we are using here.

There are a lot of resources for dbSNP information, such as in NCBI, Resource bundle from GATK, 1000Genome study or Haplotype Reference Consortium.

The example data used here are from human genome build 37, so we will use a reference data from the same build to infer the IDs.

If the gwas data you are working are is using build 38, make sure you also use a reference data from build 38.

For convenience, I will borrow the snp.info file from eigen-decomposed LD reference to refill the dbSNP IDs. Some of the SNPs in the summary data don’t exist in this reference data, so their information can’t be refilled in this practice, but it doesn’t matter to my further analysis of profiling PRS using the SBayesRC predictors.

# read in a reference data for missing information
freq.file="/data/module5/alz/chr22_snp.info"
ref.freq = data.frame(fread(freq.file))
head(ref.freq)
##   Chrom          ID GenPos  PhysPos A1 A2   A1Freq     N blk
## 1    22   rs4010437      0 16924779  C  T 0.010000 20000 582
## 2    22 rs139656795      0 16926925  C  G 0.013300 20000 582
## 3    22  rs62042126      0 16929771  G  C 0.020775 20000 582
## 4    22   rs4010372      0 16940346  T  G 0.010000 20000 582
## 5    22 rs143291223      0 16959721  T  C 0.010000 20000 582
## 6    22  rs75959072      0 17011656  C  G 0.012725 20000 582

By matching chromosome, bp position and allelele, we will get the dbSNP ID for the SNPs in our predictor, and put it in the formatted data frame.
Since the two alleles of a SNP can be regarded as effect allele or reference allele by chance, we will do a bit of trick in the matching by aligning them in character sequence.

colname.SNP = "missing"

colname.chr = "chr"
colname.pos = "PosGRCh37"
colname.A1 = "testedAllele"
colname.A2 = "otherAllele"
if(   (colname.SNP != "missing")  ){
  
  formatted.gwas$SNP = gwas[,colname.SNP]
  
}else{
  
  gwas$A = pmin(toupper(gwas[,colname.A1]), toupper(gwas[,colname.A2]))
  gwas$B = pmax(toupper(gwas[,colname.A1]), toupper(gwas[,colname.A2]))
  gwas$chrbpAB = paste0(gwas[,colname.chr], "_", gwas[,colname.pos], "_", gwas$A, "_", gwas$B)
  
  ref.freq$A = with(ref.freq, pmin(A1, A2))
  ref.freq$B = with(ref.freq, pmax(A1, A2))
  ref.freq$chrbpAB = paste0(ref.freq$Chrom, "_", ref.freq$PhysPos, "_", ref.freq$A, "_", ref.freq$B)
  
  gwas$matched_RSID = ref.freq[match(gwas$chrbpAB , ref.freq$chrbpAB), "ID"] 
  gwas[ !(gwas$chrbpAB %in%ref.freq$chrbpAB), "matched_RSID"]  =  gwas[ !(gwas$chrbpAB %in%ref.freq$chrbpAB), "chrbpAB" ]  
  
  formatted.gwas$SNP = gwas$matched_RSID
  
}

head(formatted.gwas[(grep("rs", formatted.gwas$SNP)),])
##             SNP A1 A2 freq  b se  p  N
## 664   rs4010437 NA NA   NA NA NA NA NA
## 672 rs139656795 NA NA   NA NA NA NA NA
## 683  rs62042126 NA NA   NA NA NA NA NA
## 751   rs4010372 NA NA   NA NA NA NA NA
## 840 rs143291223 NA NA   NA NA NA NA NA
## 891  rs75959072 NA NA   NA NA NA NA NA

4 Alleles

Naming of alleles can be various. They could be named as “EffectAllele” vs “NonEffectAllele”, or “A1” vs. “A2”, or “ReferenceAllele” vs. “AlternativeAllele”, etc. We need to make it clear which allele is the effect allele, by reading the README files or online instructions for the data.

In cojo format gwas file, A1 is always the effect allele.

If the alleles are in lower case, make sure you make them uppercase in the formattted data.

formatted.gwas$A1= toupper(gwas[,colname.A1])
formatted.gwas$A2 = toupper(gwas[,colname.A2])

Alleles in a SNP must match up to the alleles in LD reference data. Most softwares do the allele match up for you, but you can also give it a check to make sure the same dbSNP ID in your data have the exactly same two alleles as in the LD reference data.

5 p value

P value is commonly reported in the summary data. We can just take the column.
The p value can get very small values, and it is in scientific format. Sometimes some softwares would recognize it as characters. We can use as.numeric to force it as numeric.

colname.p = "p"
formatted.gwas$p = as.numeric(gwas[,colname.p])

head(formatted.gwas[(grep("rs", formatted.gwas$SNP)),])
##             SNP A1 A2 freq  b se         p  N
## 664   rs4010437  C  T   NA NA NA 0.0557100 NA
## 672 rs139656795  C  G   NA NA NA 0.4451000 NA
## 683  rs62042126  G  C   NA NA NA 0.8018050 NA
## 751   rs4010372  T  G   NA NA NA 0.4560000 NA
## 840 rs143291223  T  C   NA NA NA 0.4588000 NA
## 891  rs75959072  C  G   NA NA NA 0.9132982 NA

6 Sample Size

Sample size is sometimes missing in the summary data, when we will have to find it out from the publication.

Sometimes disease data would like to include the sample size as two columns, as cases and controls separately. If that’s the case, we will add them up, or calculate effective sample size. Be aware that, it’s recommended to use the real sample size (case + control) in SBayesRC method.

If it is NCHROBS (number chromosomes observed) reported in summary data, we will divide the number by 2.

colname.N = "N"

sample.size.names=unlist(strsplit(colname.N, ","))

if(length(sample.size.names)==1){
  if(is.na(as.numeric(colname.N) )== T) {
    if(colname.N == "NCHROBS"){
      formatted.gwas$N = 0.5* (gwas[, colname.N])
    }else{
      formatted.gwas$N = gwas[, colname.N]
    }
  }else{
    formatted.gwas$N = as.numeric(colname.N)
  }
}else{
  formatted.gwas$N= gwas[, sample.size.names[1]] + gwas[,sample.size.names[2]]
}

head(formatted.gwas)
##               SNP A1 A2 freq  b se         p     N
## 1 22_16054713_C_T  T  C   NA NA NA 0.9796000 63926
## 2 22_16055122_G_T  T  G   NA NA NA 0.3975234  4305
## 3 22_16199921_C_T  T  C   NA NA NA 0.8210000  1070
## 4 22_16208755_C_T  T  C   NA NA NA 0.0192600 63926
## 5 22_16281208_G_T  T  G   NA NA NA 0.9656000 63926
## 6 22_16289869_C_G  C  G   NA NA NA 0.6425000 63926



Per SNP sample size can be very different when the data is from a study of meta analysis.

hist(formatted.gwas$N, breaks = 200)

We can do a QC by excluding SNPs with too different sample size. Since the distribution is different for each study, it can be arbitrary to give it a cutline of sample size, or a specified number of SD.

In the example data, SD of sample size is very big, and the distribution is skewed. It’s not a good idea to use SD to defined the threshold.

# example code to use SD as threshold. 
a = mean(formatted.gwas$N)
b = sd(formatted.gwas$N)
N.outlier.snps = formatted.gwas[which(formatted.gwas$N > (a+1*b) | formatted.gwas$N < (a-1*b) ),"SNP"]



Another way of QC is to exclude the SNPs with very low number of sample size. At the same time, we don’t want to lose too many SNPs.

N.outlier.snps.1 = formatted.gwas[which(formatted.gwas$N < quantile(formatted.gwas$N, 0.1) ),"SNP"]
N.outlier.snps.2 = formatted.gwas[which(formatted.gwas$N < quantile(formatted.gwas$N, 0.3) ),"SNP"]

By excluding the bottom 10% quantile, we will exclude 15080, and the lowest sample size will be 74004. The threshold is marked in blue in the figure below.

By excluding the bottom 30% quantile, we will exclude 52163, and the lowest sample size will be 262579. As shown in red in the figure below.

ggplot(data = formatted.gwas, aes(x= N)) + 
  geom_histogram() + 
  geom_vline(xintercept = quantile(formatted.gwas$N, 0.1), color = "blue") + 
  geom_vline(xintercept = quantile(formatted.gwas$N, 0.3), color = "red") 

There is not a standardized way of sample size QC, and the decision also depends on the downstream analysis. By checking the distribution of it, we can get a clue of what to fix when you get a killed job in the next step.

7 Allele Frequency

The two alleles in reference data and in gwas summary data could be swapped. We will firstly calculate the effect allele frequency in the reference data.

ref.freq$gwasA1 = formatted.gwas[match(ref.freq$ID, formatted.gwas$SNP),"A1"]
ref.freq$sign = sign(( as.numeric(ref.freq$gwasA1  == ref.freq$A1))-0.5)
ref.freq$gwasA1freq= abs(as.numeric(ref.freq$gwasA1  != ref.freq$A1) - ref.freq$A1Freq)



Allele frequency is sometimes separately included for cases and controls. If that’s the case, we will do a calculation with the number of cases and controls.

\[ f = \frac{N_{\text{case}} f_{\text{case}} + N_{\text{ctrl}} f_{\text{ctrl}}}{N_{\text{case}} + N_{\text{ctrl}}} \] - f is the allele frequency.
- N is the sample size

If allele frequency is available, we can compare it to the reference data in a plot, and remove the outliers.

If allele frequency is missing, we can borrow that information from a reference data. This is what happened in our toy data.

colname.freq = "missing"

allele.frequency.names=unlist(strsplit(colname.freq, ","))

if(length(allele.frequency.names) == 2) {
  
  ## Allele frequency is sometimes separately included for cases and controls. 
  freq.case = allele.frequency.names[1]
  freq.control = allele.frequency.names[2]
  N.case = sample.size.names[1]
  N.control = sample.size.names[2]
  
  formatted.gwas$freq = ( (gwas[,freq.case] * gwas[,N.case]) + (gwas[,freq.control] * gwas[,N.control]) ) /(gwas[,N.case] + gwas[,N.control])
  ref.freq$freq.in.gwas = formatted.gwas[match(ref.freq$ID, formatted.gwas$SNP) , "freq"]
  freq.plot = ggplot(data = ref.freq, aes(x = gwasA1freq, y =freq.in.gwas )) + geom_point(size = 0.2) + xlab("AF in reference data") + ylab("AF in GWAS summary statistic")
  ggsave(paste0(file.name, "_AF_plot.png"), freq.plot, height = 8, width = 8)
  
}else{
  
  ## allele frequency is sometimes missing
  if(colname.freq == "missing"){
    formatted.gwas$freq = ref.freq[match( formatted.gwas$SNP, ref.freq$ID),"gwasA1freq"]
    
  }else{
    
    # if freq is available, we will compare it with reference data and make a plot
    formatted.gwas$freq = as.numeric( gwas[,colname.freq] )
    ref.freq$freq.in.gwas = formatted.gwas[match(ref.freq$ID, formatted.gwas$SNP) , "freq"]
    freq.plot = ggplot(data = ref.freq, aes(x = gwasA1freq, y =freq.in.gwas )) + 
      geom_point(size = 0.2) + 
      xlab("AF in reference data") + 
      ylab("AF in GWAS summary statistic")
    ggsave(paste0(file.name, "_AF_plot.png"), freq.plot, height = 8, width = 8)
    
  }
}

head(formatted.gwas[(grep("rs", formatted.gwas$SNP)),])
##             SNP A1 A2     freq  b se         p      N
## 664   rs4010437  C  T 0.010000 NA NA 0.0557100   7409
## 672 rs139656795  C  G 0.013300 NA NA 0.4451000  74004
## 683  rs62042126  G  C 0.020775 NA NA 0.8018050 145339
## 751   rs4010372  T  G 0.010000 NA NA 0.4560000  74004
## 840 rs143291223  T  C 0.010000 NA NA 0.4588000  74004
## 891  rs75959072  C  G 0.012725 NA NA 0.9132982 145339



Here is an example plot from another data. You can use a difference threshold 0.2 to exclude outlying SNPs.

This is not the case of our practical data, so I’m just giving an example code here:

## example command. Don't run it.
freq.outlier = ref.freq[which(abs(ref.freq$gwasA1freq - ref.freq$freq.in.gwas) > 0.2), "ID"]

8 Effect Size

Effect size is usually named as “effect” or “b”. If it’s OR, odds ratio, we will do a log transformation.

If effect size is missing, and only z score is supplied, we will calculate effect size with z score, allele frequency and sample size, assuming phenotypic variance is equal to 1.

\[ b = \frac{z}{\sqrt{2 \cdot f \cdot (1 - f) \cdot (N + z^2)}} \]

where:
- z is the z-score.
- f is the allele frequency.
- N is the sample size.

colname.b = "missing" 
colname.z = "z"


if(colname.b == "missing" & colname.z != "missing"){
  formatted.gwas$z = gwas[,colname.z]
  formatted.gwas$b = ( formatted.gwas$z )/( ((2 *(formatted.gwas$freq) * (1 - formatted.gwas$freq) * ( formatted.gwas$N + (formatted.gwas$z)^2)))^0.5 )
}  else if(colname.b == "OR" | colname.b == "odds_ratio"){
  formatted.gwas$b = log(gwas[,colname.b])
}else{
  formatted.gwas$b = gwas[,colname.b]
}

head(formatted.gwas[(grep("rs", formatted.gwas$SNP)),])
##             SNP A1 A2     freq            b se         p      N          z
## 664   rs4010437  C  T 0.010000  0.157929336 NA 0.0557100   7409  1.9132973
## 672 rs139656795  C  G 0.013300 -0.017326411 NA 0.4451000  74004 -0.7636095
## 683  rs62042126  G  C 0.020775  0.003264196 NA 0.8018050 145339  0.2510118
## 751   rs4010372  T  G 0.010000 -0.019474057 NA 0.4560000  74004 -0.7454495
## 840 rs143291223  T  C 0.010000 -0.019353227 NA 0.4588000  74004 -0.7408243
## 891  rs75959072  C  G 0.012725  0.001801737 NA 0.9132982 145339  0.1088793

9 Standard error of effect size

If se is missing, we calculate it from effect size, allele frequency and sample size using the formula, assuming phenotypic variance is equal to 1.

\[ \text{se} = \frac{1}{\sqrt{2 \cdot f \cdot (1 - f) \cdot (N + z^2)}} \]

where:
- z is the z-score.
- f is the allele frequency.
- N is the sample size.

colname.se = "missing"

if(colname.se == "missing"){
  
  formatted.gwas$z = sign(formatted.gwas$b) * abs( qnorm(formatted.gwas$p/2) )
  formatted.gwas$se =1/sqrt(2 * formatted.gwas$freq *(1 - formatted.gwas$freq) *(formatted.gwas$N +  (formatted.gwas$z ^ 2)  )  )
  
}else{
  
  formatted.gwas$se = gwas[,colname.se]
  
}

head(formatted.gwas[(grep("rs", formatted.gwas$SNP)),])
##             SNP A1 A2     freq            b         se         p      N
## 664   rs4010437  C  T 0.010000  0.157929336 0.08254302 0.0557100   7409
## 672 rs139656795  C  G 0.013300 -0.017326411 0.02269015 0.4451000  74004
## 683  rs62042126  G  C 0.020775  0.003264196 0.01300415 0.8018050 145339
## 751   rs4010372  T  G 0.010000 -0.019474057 0.02612391 0.4560000  74004
## 840 rs143291223  T  C 0.010000 -0.019353227 0.02612391 0.4588000  74004
## 891  rs75959072  C  G 0.012725  0.001801737 0.01654802 0.9132982 145339
##              z
## 664  1.9132973
## 672 -0.7636095
## 683  0.2510118
## 751 -0.7454495
## 840 -0.7408243
## 891  0.1088793

10 output into cojo format

We can exclude the SNPs with missing information, and outliers in sample size and allele frequency, before write it into a cojo file.

output = paste0(file.name, ".ma")

formatted.gwas = formatted.gwas[which(is.na(formatted.gwas$freq) == F) ,]

formatted.gwas = formatted.gwas[which(!formatted.gwas$SNP %in% N.outlier.snps) ,]

formatted.gwas = formatted.gwas[,1:8]

write.table(formatted.gwas, file=output, quote = F, sep ="\t", row.names = F)

This is the first few lines of the formatted and QCed data:

head(formatted.gwas)
##              SNP A1 A2     freq             b          se         p      N
## 1080 rs111578332  G  T 0.033525 -0.0038004768 0.005508167 0.4902123 508623
## 1084   rs7287620  G  A 0.033650 -0.0034497079 0.005498126 0.5303752 508652
## 1096 rs117062864  A  T 0.033125 -0.0026175227 0.005540070 0.6365906 508643
## 1115   rs5746647  T  G 0.947300  0.0007610451 0.003623959 0.8336647 762615
## 1188   rs5747988  G  A 0.927300  0.0014918604 0.004318165 0.7297299 397756
## 1199   rs5747999  A  C 0.831600  0.0001294686 0.003000802 0.9655862 396496



We usually do extra QC and impute the missing SNPs before doing SBayesRC analysis, but both the R version and GCTB version of SBayesRC provide functions to do them for you.