## Overview

### About

GCTB is a software tool that comprises a family of Bayesian linear mixed models for complex trait analyses using genome-wide SNPs. It was developed to simultaneously estimate the joint effects of all SNPs and the genetic architecture parameters for a complex trait, including SNP-based heritability, polygenicity and the joint distribution of effect sizes and minor allele frequencies.

Version 2.0 of the GCTB software includes summary-data-based versions of the individual-level data Bayesian linear mixed models previsouly implemented.

### Credits

Jian Zeng developed the software with supports from Jian Yang, Futao Zhang and Zhili Zheng.

Part of the code are adopted from *GCTA* and *GenSel*.

Luke Lloyd-Jones contributed to the BayesR module with supports from Jian Zeng and Mike Goddard.

Version 2.0 of the software was developed by Jian Zeng with contributions from Luke Lloyd-Jones.

### Questions and Help Requests

If you have any bug reports or questions, please send an email to Jian Zeng (j.zeng@uq.edu.au), Luke Lloyd-Jones (luke.lloydjones@uqconnect.edu.au) or Jian Yang (jian.yang@uq.edu.au).

### Citation

Zeng et al. (2018) Signatures of negative selection in the genetic architecture of human complex traits.
*Nature Genetics*, doi: 10.1038/s41588-018-0101-4.

#### SBayesR

Lloyd-Jones, Zeng et al. (2019) Improved polygenic prediction by Bayesian multiple regression on summary statistics. *Nature Communications*, doi: 10.1038/s41467-019-12653-0.

#### SBayesS

Zeng et al. (2021) Bayesian analysis of GWAS summary data reveals differential signatures of natural selection across human complex traits and functional genomic categories. *Nature Communications*, dio: 10.1038/s41467-021-21446-3.

## Download

### Executable files

gctb_2.03beta_Linux.zip (*Lastest version*)

### Tutorial data

### LD matrices

The following LD matrices were computed based on 1.1 million common SNPs in a random sample of 50K unrelated individuals of European ancestry in UK Biobank dataset unless otherwise noted.

In the shrunk sparse matrices, described in Lloyd-Jones et al. (2019), the observed LD correlations computed from a reference sample were shrunk toward the expected values defined by a genetic map, following the algorithm in Wen and Stephens (2010). After shrinkage, LD correlations smaller than a threshold (default 1e-5) were set to be zero to give a sparse format, which is more efficient in storage and computation.

The sparse matrices described in Zeng et al. (2021) were computed by setting the likely chance LD to zero based on a chi-squared test (default threshold at chi-squared test statistic of 10).

While the shrunk sparse matrices were used in our original SBayesR paper, Prive et al. (2021) found that using a banded matrix with a window size of 3 cM per SNP can improve prediction accuracy. Therefore, we have created such a LD matrix in GCTB format for SBayesR analysis.

### Source code

GCTB 2.03beta standard version

The MPI version implements a distributed computing strategy that scales the analysis to very large sample sizes. A significant improvement in computing time is expected for a sample size > 10,000. The MPI version needs to be compiled on user’s machine. See README.html in the tarball for instructions of compilation and usage. A testing dataset is also included in each tarball.

### Update log

**1.** 1 Dec, 2017: first release.

**2.** 8 Feb, 2019: version 2.0 includes summary-data-based Bayesian methods.

**3.** 31 Jul, 2020: version 2.02 reports a message for convergence issue.

**4.** 15 Oct, 2021: version 2.03beta includes a robust parameterisation that attempts to address convergence issue.

## Basic options

### Input and output

**--bfile** test

Input PLINK binary PED files, e.g. test.fam, test.bim and test.bed (see PLINK user manual for details).

**--pheno** test.phen

Input phenotype data from a plain text file, e.g. test.phen.

**--out** test

Specify output root filename.

### Data management

**--keep** test.indi.list

Specify a list of individuals to be included in the analysis.

**--chr** 1

Include SNPs on a specific chromosome in the analysis, e.g. chromosome 1.

**--extract** test.snplist

Specify a list of SNPs to be included in the analysis.

**--exclude** test.snplist

Specify a list of SNPs to be excluded from the analysis.

**--mpheno** 2

If the phenotype file contains more than one trait, by default, GCTB takes the first trait for analysis (the third column of the file) unless this option is specified. For example, **--mpheno** 2 tells GCTB to take the second trait for analysis (the fourth column of the file).

**--covar** test.qcovar

Input quantitative covariates from a plain text file, e.g. test.qcovar. Each quantitative covariate is recognized as a continuous variable.

### MCMC settings

**--seed** 123

Specify the seed for random number generation, e.g. 123. Note that giving the same seed value would result in exactly the same results between two runs.

**--chain-length** 21000

Specify the total number of iterations in MCMC, e.g. 21000 (default).

**--burn-in** 1000

Specify the number of iterations to be discarded, e.g. 1000 (default).

**--out-freq** 100

Display the intermediate results for every 100 iterations (default).

**--thin** 10

Output the sampled values for SNP effects and genetic architecture parameters for every 10 iterations (default). Only non-zero sampled values of SNP effects are written into a binary file.

**--no-mcmc-bin**

Suppress the output of MCMC samples of SNP effects.

## Bayesian alphabet

**--bayes** S

Specify the Bayesian alphabet for the analysis, e.g. `S`. Different alphabet launch different models, which differ in the prior specification for the SNP effects. The available alphabet include

B: Each SNP effect is assumed to have an i.i.d. mixture prior of a t-distribution `t(0, \tau^2, \nu)` with a probability `\pi` and a point mass at zero with a probability `1-\pi`.

C: Each SNP effect is assumed to have an i.i.d. mixture prior of a normal distribution `N(0, \sigma^2)` with a probability `\pi` and a point mass at zero with a probability `1-\pi`.

S: Similar to C but the variance of SNP effects is related to minor allele frequency (`p`) through a parameter `S`, i.e. `\sigma_j^2 = [2p_j(1-p_j)]^S \sigma^2`.

N: nested BayesC. SNPs within a 0.2 Mb non-overlapping genomic region are collectively considered as a window (specify the distance by

**--wind**0.2). This nested approach speeds up the analysis by skipping over windows with zero effect.NS: nested BayesS.

R: BayesR. Each SNP effect is assumed to have an i.i.d. mixture prior of multiple normal distributions `N(0, \gamma_k \sigma_k^2)` with a probability `\pi_k` and a point mass at zero with a probability `1-\sum_k \pi_k`, where `\gamma_k` is a given constant.

**--fix-pi**

An option to fix `\pi` to a constant (the value is specified by the option --pi below). The default setting is to treat π as random and estimate it from the data.

**--pi** 0.05

A starting value for the sampling of π when it is estimated from the data, or a given value for π when it is fixed. The default value is 0.05. When BayesR is used, it is a string seperated by comma where the number of values defines the number of mixture components and each value defines the starting value for each component (the first value is reserved for the zero component); the default values are 0.95,0.03,0.01,0.01.

**--gamma** 0,0.01,0.1,1

When BayesR is used, this speficies the gamma values seperated by comma, each representing the scaling factor for the variance of a mixture component. Note that the number of values should match that in --pi.

**--hsq** 0.5

A starting value for the sampling of SNP-based heritability, which may improve the mixing of MCMC algorithm if it starts with a good estimate. The default value is 0.5.

**--S** 0

A starting value for the sampling of the parameter S (relationship between MAF and variance of SNP effects) in BayesS, which may improve the mixing of MCMC algorithm if it starts with a good estimate. The default value is 0.

**--wind** 0.2

Specify the window width in Mb for the non-overlapping windows in the nested models, e.g. 0.2 Mb. The default value is 1 Mb.

Examples

Standard version of gctb:

```
gctb --bfile test --pheno test.phen --bayes S --pi 0.1 --hsq 0.5 --chain-length 25000 --burn-in 5000 --out test > test.log 2>&1
```

MPI version of gctb (when using intelMPI libraries and two nodes):

```
mpirun -f $PBS_NODEFILE -np 2 gctb_mpi --bfile test --pheno test.phen --bayes S --pi 0.1 --hsq 0.5 --chain-length 25000 --burn-in 5000 --out test > test.log 2>&1
```

The output files include:

**test.log**: a text file of running status, intermediate output and final results;

**test.snpRes**: a text file of posterior statistics of SNP effects;

**test.covRes**: a text file of posterior statistics of covariates;

**test.parRes**: a text file of posterior statistics of key model parameters;

**test.mcmcsamples.CovEffects**: a text file of MCMC samples for the covariates fitted in the model;

**test.mcmcsamples.SnpEffects**: a binary file of MCMC samples for the SNP effects;

**test.mcmcsamples.Par**: a text file of MCMC samples for the key model parameters;

Citations

**GCTB software and BayesS method**

Zeng et al. (2018) Signatures of negative selection in the genetic architecture of human complex traits.
*Nature Genetics*, doi: 10.1038/s41588-018-0101-4.

**BayesR**

Moser et al. (2015) Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model. *PLoS Genetics*, 11: e1004969.

**BayesN**

Zeng et al. (2018) A nested mixture model for genomic prediction using whole-genome SNP genotypes. *PLoS One*, 13: e0194683.

**BayesC`\pi`**

Habier et al. (2011) Extension of the Bayesian alphabet for genomic selection. *BMC Bioinformatics*, 12: 186.

**BayesB**

Meuwissen et al. (2001) Prediction of total genetic value using genome-wide dense marker maps. *Genetics*, 157: 1819-1829.

## Summary Bayesian Alphabet

### Options

**--sbayes** R

Available models are R, S and C.

**--ldm** test.ldm.sparse

Input the LD matrix generated by **--make-sparse-ldm** option. This option tells GCTB to read two files: a text file that contains SNP information (test.ldm.sparse.info) and a binary file that contains the sparse LD matrix data (test.ldm.sparse.bin). The suffix ".sparse" tells GCTB the format of the LD matrix. The valid suffixes include ".full", ".band", ".shrunk" and ".sparse", which are generated from the corresponding options that make the LD matrix in sush a format, e.g. **--make-full-ldm** for ".full" files.

**--mldm** mldm.list

Input a list of LD matrices, e.g. from chromosome 1 to 22. The LD matrices need to be in the correct order in mldm.list.

**--make-full-ldm**

Compute a full LD matrix from the genotype data provided by **--bfile** option. This option will generate two files: a text file that contains SNP information (test.ldm.full.info) and a binary file that contains the full LD matrix data (test.ldm.full.bin). The full LD matrix contains all pairwise LD correlations between SNPs. The memory required is approximately the size of the n-by-m genotype matrix (4 * n * m)/1e9 GB + the size of the m-by-m LD matrix (4 * m^2)/1e9 GB (n is the number of individuals and m is the number of SNPs). For a more efficient computation, it is recommended to use **--snp** option to compute the matrix by parts.

**--part** 5,1

Work only with --make-full-ldm. Compute a full LD matrix part by part to save memory. The first number before comma indicates the total parts to divide, the second number indicates the current part to calculate. Run gctb with the --mldm output.mldm to merge, after all parts are calculated successfully.

**--make-sparse-ldm**

Compute a sparse LD matrix from the genotype data provided by **--bfile** option or from the dense type LD matrix, such as full or shrunk LD matrix provided by **--ldm test.ldm.full** or **--ldm test.ldm.shrunk** option. If the genotype data is provided, GCTB will first compute the chromosome-wide full LD matrix, then set the chance LD (small LD correlations due to sampling) to be zero and output the sparse matrix. The chance LD is detected by a chi-squared test with the threshold defined by **--chisq**. The chi-squared threshold value is 10 by default. The memory usage is the same as computing a full LD matrix.

**--make-shrunk-ldm**

Compute a shrunk LD matrix from the genotype data provided by **--bfile** option or from the full LD matrix provided by **--ldm test.ldm.full** option. It will use the method proposed by Wen and Stephens (2010) to shrunk the off-diagonal entries of the sample LD matrix toward zero based on a genetic map provided by **--gen-map** option (see Tutorial for the format of the genetic map). Other parameters required by the shrinkage method include the effective population size provided by **--ne** option (default value is 11400), the genetic map sample size provided by **--genmap-n** option (default value is 183) and the shrinkage cutoff threshold provided by **--shrunk-cutoff** option (default value is 1e-5). The output format of the shrunk LD matrix is ".shrunk", which contains all the zero elements, so it will take a similar storage space as the full LD matrix. We recommend to further make the shrunk LD matrix parse by eliminating elements equal to zero for efficient storage and computation, which can be done by reading the generated shrunk matrix and making it sparse using **--make-sparse-ldm --chisq 0** options. The memory usage is the same as computing a full LD matrix.

**--make-band-ldm**

Compute a banded LD matrix from the genotype data provided by **--bfile** option or from the full LD matrix provided by **--ldm test.ldm.full** option. GCTB will only compute and include the LD correlations between all paris of SNPs within a sliding window which has a window width defined by **--wind** option (in the unit of megabase).

**--snp** 1-5000

Compute the LD matrix by rows (or columns as it is a symmetric square matrix). When this option is invoked, GCTB will read in the genotype matrix from the bed file and compute a part of the LD matrix corresponding to the range of SNPs specified. GCTB will append `.snp1-5000`

, for example, to the output files. Once all parts have been computed individually, make sure to include all the files as a list in the ascending order as a separate file and use **--mldm** option to merge the parts into a single matrix. See Toturial for more detail.

**--gwas-summary** test.ma

Input the GWAS summary statistics in the GCTA-COJO `.ma`

format.

```
SNP A1 A2 freq b se p N
rs1001 A G 0.8493 0.0024 0.0055 0.6653 129850
rs1002 C G 0.0306 0.0034 0.0115 0.7659 129799
rs1003 A C 0.5128 0.0045 0.0038 0.2319 129830
```

Columns are SNP identifier, the effect allele, the other allele, frequency of the effect allele, effect size, standard error, p-value and sample size. The headers are not keywords and will be omitted by the program. Important: "A1" needs to be the effect allele with "A2" being the other allele and "freq" should be the frequency of "A1". See Tutorial for more information.

**--chisq** 10

Specify the chi-squared test threshold for computing the sparse LD matrix. The LD correlations that are less than the threshold (default value is 10) is regarded due to sampling variation and therefore removed from the LD matrix. It is equivalent to setting the LD rsq smaller than (chi-squared threshold)/(LD reference sample size).

**--ld** 0.01

Specify the absolute value of LD correlation threshold for computing the sparse LD matrix. Any absolute value of the LD correlation below the threshold will be set to be zero. It cannot be used together with **--chisq**.

**--write-ldm-txt**

Output the LD matrix in a text file.

**--annot** annotation.txt

Input the genomic annotation data and launch the stratified SBayes analysis (SBayesS-strat). The format of the annotation file is as below.

```
SNP Coding Conserved DHS Promoter
rs1001 0 1 0 0 0
rs1002 1 0 0 0 0
rs1003 0 0 0 1 1
```

The first column is the SNP identifier, followed by annotation categories. The first row is the header line that contains the names of the annotation categories. The current version only allows binary annotations (i.e. 0/1) and only available for SBayesS model (i.e. together with **--sbayes S** option), which allows the SNP-heritability, polygenicity and the S parameter (relationship between effect size and MAF) to be different across annotation categories.

**--exclude-mhc**

This option will remove SNPs in the MHC regions (chr6:28-34 Mb) from the analysis.

**--ambiguous-snp**

This option will remove SNPs with ambiguous nucleotides, i.e. A/T or G/C.

**--maf** 0.01

This option will remove SNPs with minor allele frequency from the GWAS sample smaller than the specified value.

**--overlap**

This option tells GCTB that the LD reference sample is in complete overlap with the GWAS sample. That is, the LD matrix is computed from the GWAS sample itself. This will suppress the modeling of sampling variance of LD correlations in the Sbayes analysis.

**--impute-n**

This option will impute per-SNP sample size in GWAS based on the GWAS effect estimates, SE and allele frequencies in the `.ma`

file, and remove SNPs with the imputed sample size greater/smaller than 3 standard deviations from the median value.

**--num-chains** 4

This option invokes a multi-chain MCMC process, e.g. 4 chains. GCTB will run multiple chains in sequence each with the starting values of the model parameters randomly sampled from their prior distributions. When the MCMC chains are completed, it will calculate the Gelman-Rubin convergence diagnostic (also known as potential scale reduction factor) for each of the model parameters and report them in the `.parRes`

file. For example,

```
Posterior statistics from multiple chains:
Mean SD R_GelmanRubin
Pi 0.044129 0.001628 1.005
NnzSnp 49875.335938 1825.153442 1.005
SigmaSq 0.000009 0.000000 1.007
S -0.545913 0.029914 1.003
ResVar 0.727659 0.002897 1.000
GenVar 0.257742 0.002645 1.000
SigmaSqG 0.258967 0.003865 1.000
hsq 0.261560 0.002507 1.000
```

In the current verion, the multi-chain MCMC is only available for SBayesC, SBayesS and SBayesS-strat analysis.

**--rsq** 0.9

For a pair of SNPs with the squared LD correlation greater than the specified value, e.g. 0.9, based on the provided LD matrix, this option will remove the SNP with the higher GWAS P value and keep the other SNP in the analysis.

**--p-value** 0.4

This option will remove SNPs with GWAS P value greater than the specified value, e.g. 0.4, from the analysis.

**--robust**

This option will invoke the following parameterisation for the common factor of SNP effect variance in SBayesR, i.e., \$\sigma_{\beta}^2 = \frac{\sigma_g^2 }{ m {\boldsymbol \gamma}'{\boldsymbol \pi} }\$ where \$\sigma_g^2 = {\boldsymbol \beta} ' {\bf R} {\boldsymbol \beta}\$ is the sampled total genetic variance in each MCMC iteration. Comparing to the original model that assumes \$\sigma_{\beta}^2\$ following a scaled-inverse chi-square distribution, this parameterisation is observed to be more robust to the potential convergence issue which sometimes occurs when the GWAS summary statistics are from a meta-analysis. In GCTB 2.03 beta version, the program will start with the original model but if a convergence issue is detected (indicated by a negative sampled value of residual variance) it will restart the MCMC process with this more robust parameterisation. Note that this parameterisation is identical to LDpred (Vilhjálmsson et al. 2015) when using a point-normal prior (i.e., two-component SBayesR or SBayesC).

### Tutorial

This updated version of the GCTB software (version 2.0) includes summary-data based versions of the individual-data Bayesian linear mixed models previously implemented. These methods require summary statistics from genome-wide association studies, which typically include the estimated univariate effect for each single nucleotide polymorphism (SNP), the standard error of the effect, the sample size used to estimate the effect for each SNP, the allele frequency of an allele, and an estimate of LD among SNPs, all of which are easily accessible from public databases.

This tutorial will outline how to use the summary data based methods and accompanies the manuscript Improved polygenic prediction by Bayesian multiple regression on summary statistics. To recreate the tutorial you will need the PLINK 2 software and the updated version of the GCTB software compiled from source or the binary executable in your path. The data for the tutorial are available at Download. The tutorial is designed so that each part can be reconstructed or picked up at any point with the following directory structure.

```
tutorial
gctb # Static binary executable for Linux 64-bit systems
data # 1000 Genomes data in PLINK format
pheno_generation # R scripts for generating simulated phenotypes
pheno # Simulated phenotypes
gwas # PLINK GWAS summary statistics
ldm # LD correlation matrices and scripts to calculate them
ma # Summary statistics in GCTB compatible format
sbayesr # Scripts for running and SBayesR analysis using GCTB and subsequent results
```

#### Data

The tutorial will run through all aspects of the software using genotype data from chromosome 22 of Phase 3 of the 1000 Genomes Project (1000G). The genotype data have been filtered to exclude variants with minor allele frequency (MAF) < 0.01, which left 15,938 SNPs available for analysis. These data include a subset of 378 individuals of European ancestry from the CEU, TSI, GBR and FIN populations.

N.B. The results generated from this example tutorial using data from the 1000G are designed to be lightweight and should be only used to explore how to run summary data based analyses using GCTB. The results do not make the best use of GCTB capabilities given the small sample size \$(N=378)\$ in the example.

#### Phenotype simulation and GWAS

Using these genotypes, phenotypes were generated under the multiple regression model \$y_i = \sum_{j=1}^pw_{ij}\beta_j + \epsilon_i\$, where \$w_{ij} = (x_{ij} − 2q_j)/ \sqrt{2q_j(1 − q_j)}\$
with \$x_{ij}\$ being the reference allele count for the \$i\$th individual at the \$j\$th SNP, \$q_j\$ the allele
frequency of the \$j\$th SNP and \$\epsilon_i\$ was sampled from a normal distribution with mean 0 and variance
\$\text{Var}({\bf W}\boldsymbol\beta)(1/h_{SNP}^2 − 1)\$ such that \$h_{SNP}^2 = 0.1\$ for each of 20 simulation replicates,
which is much larger than the contribution to the genome-wide SNP-based heritability (\$h^2_{SNP}\$) estimate for chromosome 22
for most quantitative traits. All phenotypes were generated using the R programming language with scripts used available in the
`pheno_generation`

subdirectory.

For each scenario replicate, we randomly sampled a new set of 1,500 causal variants. The genetic architecture simulated contained two causal variants of large effect explaining 3% and 2% of the phenotypic variance respectively and a polygenic tail of 1,498 causal variants sampled from a N(0, 0.05/1, 498) distribution such that the expected total genetic variance explained by all variants was 0.1.

For each of the 20 simulation replicates, simple linear regression for each variant was run using the PLINK 2 software to generate summary statistics. Code for running the GWAS and the output is available in the `gwas`

subdirectory.

#### GCTB summary statistics input format

The GCTB summary-based methods have inherited the GCTA-COJO `.ma`

format.

```
SNP A1 A2 freq b se p N
rs1001 A G 0.8493 0.0024 0.0055 0.6653 129850
rs1002 C G 0.0306 0.0034 0.0115 0.7659 129799
rs1003 A C 0.5128 0.0045 0.0038 0.2319 129830
```

Columns are SNP identifier, the effect allele, the other allele, frequency of the effect allele, effect size, standard error, p-value and sample size. The headers are not keywords and will be omitted by the program. Important: "A1" needs to be the effect allele with "A2" being the other allele and "freq" should be the frequency of "A1".

The transformation of your summary statistics file to the `.ma`

format will depend on the software used to analyse your data. The
`ma`

subdirectory contains a simple R script for constructing `.ma`

files from PLINK 2 `--linear`

output. There is no need to filter
you summary statistics to match the LD reference as GCTB will perform this data management for you.

#### GCTB linkage disequilibrium (LD) correlation matrix

The construction of an LD correlation matrix is a key component of the summary-data based methods in GCTB. Theoretically, the summary-data based methods are equivalent to the individual data methods if the LD matrix is constructed from all variants and the individuals used in the GWAS. Calculating and storing LD matrices that contain all pairwise entries is computationally prohibitive and thus a subset of the matrix entries are stored. Typically, all inter-chromosomal LD is ignored and a within chromosome block diagonal matrix structure is used. Futhermore, the individual level data from the GWAS is not available, which is often the case, and a reference population that is similar in ancestry to the GWAS cohort is used to construct the LD correlation matrix. The use of a subset of the correlation matrix and the construction of the LD matrix from a reference leads to an approximation to the individual data mode. The level of deterioration in the model parameter estimates introduced by these approximations is assessed empirically through simulation, real data analysis and comparison with individual-data method estimates.

*Full chromosome-wise LD matrices*

Although possible to create genome-wide LD matrices in GCTB, we recommend to only
calculate matrices for each chromosome. The following is the basic execution of
GCTB LD matrix calculation. The `ldm`

subdirectory has example BASH scripts to
perform this on a HPC system. N.B. the HPC commands may not be compatible with your HPC scheduler
and will require alteration.

```
# Build full LD matrix for chromosome 22
gctb --bfile 1000G_eur_chr22 --make-full-ldm --out 1000G_eur_chr22
```

GCTB produces a `.bin`

file, which contains the matrix elements stored in binary, and a `.info`

file, which looks like

```
Chrom ID GenPos PhysPos A1 A2 A2Freq Index WindStart WindEnd WindSize WindWidth N SamplVar LDsum
22 rs131538 0 16871137 A G 0.066138 0 0 15937 15938 34347869 378 41.92552 10.18005
22 rs9605903 0 17054720 C T 0.264550 1 0 15937 15938 34347869 378 41.95537 -33.15312
22 rs5746647 0 17057138 G T 0.052910 2 0 15937 15938 34347869 378 41.93396 15.90974
22 rs16980739 0 17058616 T C 0.148148 3 0 15937 15938 34347869 378 41.94403 -12.93685
```

The columns headings are chromosome, SNP identifier, genetic position, physical base pair position, SNP allele 1, SNP allele 2, frequency of the A2 allele, variant index, for this SNP (row) the first column that is non-zero in the LD matrix, the last column that is non-zero, how many non-zero elements for this row, difference in base pair position between WindStart and WindEnd, the sample size used to calculate the correlation, sum of samping variance of the LD correlation estimate for the SNP and all other SNPs on the chromosome and the sum of the non-zero LD correlation estimates between this SNP and all other SNPs.

*Full chromosome-wise LD matrices using multiple CPUs*

To reduce the computational burden of computing LD correlation matrices, GCTB can partition the genotype file into user-specified chunks.

```
# Build full LD matrix for chromosome 22 for the first 5000 variants
gctb --bfile 1000G_eur_chr22 --make-full-ldm --snp 1-5000 --out 1000G_eur_chr22
```

When the `--snp`

option is invoked the GCTB software will append `.snp1-5000`

, for example, to the output files
so no separate extension to `--out`

is required.

For the 1000G chromsome 22 example the following files are generated

```
1000G_eur_chr22.snp1-5000.ldm.full.bin
1000G_eur_chr22.snp1-5000.ldm.full.info
1000G_eur_chr22.snp5001-10000.ldm.full.bin
1000G_eur_chr22.snp5001-10000.ldm.full.info
1000G_eur_chr22.snp10001-15000.ldm.full.bin
1000G_eur_chr22.snp10001-15000.ldm.full.info
1000G_eur_chr22.snp15001-20000.ldm.full.bin
1000G_eur_chr22.snp15001-20000.ldm.full.info
```

To merge the files, a separate file with the list of files to be joined is required. For our example is looks like

```
cat 1000G_eur_chr22.mldmlist
1000G_eur_chr22.snp1-5000.ldm.full
1000G_eur_chr22.snp5001-10000.ldm.full
1000G_eur_chr22.snp10001-15000.ldm.full
1000G_eur_chr22.snp15001-20000.ldm.full
```

```
# Merge the LD matrix chunks into a single file
gctb --mldm 1000G_eur_chr22.mldmlist --make-full-ldm --out 1000G_eur_chr22
```

*Shrunk LD matrix*

The shrinkage estimator for the LD correlation matrix was originally proposed by Wen and Stephens (2010). The estimator shrinks the off-diagonal entries of the sample LD matrix toward zero. Zhu and Stephens (2017) used the shrinkage estimator in their Regression with Summary Statistics (RSS) methodology and showed empirically that it can provide improved inference. The shrinkage estimator overcomes some of the issues arising from approximating the full LD matrix in the summary-data based model using a subset of the full LD matrix and constructing the matrix from a reference. The GCTB implementation is a C++ port from that provided with the RSS software and has been adapted for use with the GCTB software.

The calculation of the LD correlation matrix shrinkage estimate requires a genetic map. The genetic map files can be provided by the user but the tutorial uses interpolated map positions for the CEU population generated from the 1000G OMNI arrays that were downloaded from the from https://github.com/joepickrell/1000-genomes-genetic-maps. The file format looks like

```
rs149201999 16050408 0.0
rs146752890 16050612 0.0
rs139377059 16050678 0.0
rs188945759 16050984 0.0
rs6518357 16051107 0.0
rs62224609 16051249 0.0
rs62224610 16051347 0.0
rs143503259 16051453 0.0
rs192339082 16051477 0.0
rs79725552 16051480 0.0
```

and contains SNP identifier, physical base pair position and genetic position (cumulative). GCTB can use any genetic map provided it adheres to this format.

In this tutorial, the calculation of the shrunk LD matrix requires the effective population sample size, which we set to be 11,400 (as in Zhu and Stephens (2017)), the sample size of the genetic map reference, which corresponds to the 183 individuals from the CEU cohort of the 1000G used to create the genetic map, and the hard threshold on the shrinkage value, which we set to \$10^{−5}\$ and is the default value. These three parameters can be changed with

```
--ne # Effective population size. Default is 11,400
--genmap-n # Genetic map sample size. Default is 183
--shrunk-cutoff # Shrinkage hard threshold. Default is 10^-5
```

The constuction of the shrunk LD matrix is similar to that for the full LD matrix

```
# Build shrunk LD matrix for chromosome 22
gctb --bfile 1000G_eur_chr22 \
--make-shrunk-ldm \
--gen-map chr22.OMNI.interpolated_genetic_map \
--out 1000G_eur_chr22
```

The GCTB can also perform shrunk matrix calculation using multiple CPUs with a similar process to the above calculation of the full LD matrix.

```
# Build shrunk LD matrix for chromosome 22 for the first 5000 variants
gctb --bfile 1000G_eur_chr22 \
--make-shrunk-ldm \
--gen-map chr22.OMNI.interpolated_genetic_map \
--snp 1-5000 \
--out 1000G_eur_chr22
```

Please see the `/ldm`

subdirectory for example BASH scripts of performing these tasks
both locally and on a HPC.

*Sparse LD matrix*

Motivated by the computational practicality of not storing the genome-wide LD correlation matrix we wish to set a subset of elements of the LD correlation to zero. The calculation of an LD correlation matrix from genotype data is an estimate of the true LD correlation matrix for a population. It therefore contains a mix of true LD and those values that differ from zero due to sampling variation, which we wish to ignore. By default we ignore interchromsomal LD between markers. Within chromosome, we set elements to zero if their chi-squared statistic under the sampling distribution of the correlation coefficient does not exceed a user-chosen threshold.

Specifically, from \citet{lynch1998genetics} the sampling variation of the correlation coefficient \$r = \frac{\text{cov}(x, y)}{[\text{var}(x)\text{var}(y)]^{1/2}}\$ is
\$\sigma^2(r) \approx \frac{(1-\rho^2)^2}{n}\$, which under the null hypothesis that \$\rho = 0\$ is just \$\frac{1}{n}\$. Given this we can
generate a chi-squared statistic for each element of the LD matrix by \$(\frac{r}{\sqrt{1/n}})^2\$. These statistics should be \$\chi_1^2\$ distributed
and thus for each LD matrix element we set it to zero if the statistic does not exceed a user-specified arbitrary value. The chi-squared
statistic can be altered with `--chisq`

. The default value is 10, which we have found empirically to be a good balance balance between
matrix sparsity and retaining true positive LD values.

```
# Build sparse LD matrix for chromosome 22
gctb --bfile 1000G_eur_chr22 --make-sparse-ldm --out 1000G_eur_chr22
# Sparse matrices can also be built using multiple CPUs
gctb --bfile 1000G_eur_chr22 --make-sparse-ldm --snp 1-5000 --out 1000G_eur_chr22
```

GCTB can alter current LD matrices, for example, make a full matrix sparse.

```
# Make a full LD matrix sparse
gctb --ldm 1000G_eur_chr22.ldm.full --make-sparse-ldm --out 1000G_eur_chr22
```

We recommend that the shrunk LD matrices be stored in sparse format for efficient storage and computation. The shrunk LD correlation matrix estimate will have zeroes which can be eliminated as follows

```
# Make a shrunk LD matrix sparse by eliminating elements equal to 0. Output sparse
gctb --ldm 1000G_eur_chr22.ldm.shrunk --make-sparse-ldm --chisq 0 --out 1000G_eur_chr22
```

*Extra options for LD matrix calculation*

```
# For checking or comparison with other methods you can write the LD correlation out in
# text format.
--write-ldm-txt # Write LD matrix as text file as well as binary. Will generate large files.
# Examples
gctb --bfile 1000G_eur_chr22 --make-shrunk-ldm --write-ldm-txt --out 1000G_eur_chr22
gctb --bfile 1000G_eur_chr22 --make-full-ldm --write-ldm-txt --out 1000G_eur_chr22
# Read a list of LD matrix files. This would be used to read the list of chromosome wise LD
# matrices when performing genome-wide analyses. Also used for merging matrix chunks for
# multiple CPU matrix building
--mldm
```

#### Running an SBayesR analysis

With the GWAS summary statistics in `.ma`

format and the LD correlation matrix calculated, an SBayesR analysis can be performed by

```
# Calling SBayesR
gctb --sbayes R
--ldm ../ldm/sparse/chr22/1000G_eur_chr22.ldm.sparse
--pi 0.95,0.02,0.02,0.01
--gamma 0.0,0.01,0.1,1
--gwas-summary ../ma/sim_1.ma
--chain-length 10000
--burn-in 2000
--out-freq 10
--out sim_1
```

`--sbayes R`

Specify the summary data based Bayesian alphabet model for the analysis, e.g., `R`

. Different letters of the alphabet launch
different models, which differ in the prior specification for the SNP effects. `R`

and `C`

are currently available.

`--pi 0.95,0.02,0.02,0.01`

Specific to the `R`

model. A comma seperated string where the number of values defines the number of mixture components and each value defines the starting value for each component (the first value is reserved for the zero component).
These must sum to 1.
The length of the \$\pi\$ and \$\gamma\$ vectors stipulates how many components are included in the finite mixture of normals distribution prior
for the genetic effects. The default values are as specified in the example.

`--gamma 0,0.01,0.1,1`

Specific to the `R`

model. Speficies the gamma values seperated by comma, each representing the scaling factor for the
the variance of the distribution of genetic effects \$\sigma_{\beta}^2\$ for each distribution. Note that the vector length should match
that in `--pi`

. The default values are as specified in the example. These differ from the original BayesR model as the
weights are for the variance of the distribution of genetic effects \$\sigma_{\beta}^2\$ rather than the genetic variance \$\sigma_g^2\$.

`--gwas-summary`

Path to the GWAS summary statistics in `.ma`

format.

The other options are inherited from the individual-data GCTB models.

*Standard output*

Upon execution of GCTB, preliminary data management information will be outputted to standard out. A key summary of the summary is also displayed, for example,

```
Data summary:
mean sd
GWAS SNP Phenotypic variance 1.002 0.052
GWAS SNP sample size 378 0
GWAS SNP effect 0.000 0.117
GWAS SNP SE 0.106 0.049
MME left-hand-side diagonals 379.478 19.774
MME right-hand-side 0.130 19.983
LD sampling variance 1.298 0.076
LD score 17.359 15.140
```

```
```

These statistics can be used to diagnose errors between individual runs of the program.
For each row the mean and standard deviation (sd) of the parameter distribution are displayed. The
acronym MME represents mixed model equations. The left-hand-side diagonals
are the elements of \${\bf D}=\text{diag}({\bf X'}{\bf X})\$, where \$D_j=2p_jq_jn_j\$
under Hardy-Weinberg equilibrium and \$n_j\$ is the sample
size of variant \$j\$. The right-hand-side is equal to \${\bf Db}\$ where \${\bf b}\$ is the \$p\times 1\$ vector of marginal
effect estimates from GWAS. LD sampling variance and LD score are
as described in the `.info`

file.

Post launch of a summary data model the sampled values of key model parameters are printed to standard out in real time. For example, during an SBayesR analysis the following snapshot shows what is being printed

```
Iter Pi1 Pi2 Pi3 Pi4 NnzSnp SigmaSq ResVar GenVar hsq Rounding TimeLeft
10 0.9513 0.0162 0.0170 0.0155 795 0.0001 0.8905 0.0260 0.0284 0.0000 0:16:39
20 0.9507 0.0112 0.0172 0.0208 761 0.0001 0.9421 0.0213 0.0221 0.0000 0:8:19
30 0.9479 0.0041 0.0201 0.0279 761 0.0000 0.9747 0.0249 0.0249 0.0000 0:5:32
40 0.9448 0.0046 0.0187 0.0319 865 0.0000 0.9117 0.0203 0.0218 0.0000 0:4:9
50 0.9445 0.0054 0.0228 0.0272 893 0.0000 1.0019 0.0223 0.0218 0.0000 0:6:38
60 0.9409 0.0019 0.0223 0.0349 930 0.0000 1.0481 0.0265 0.0247 0.0000 0:5:31
```

The parameters printed are iteration number, which can be altered with `--out-freq`

, proportion
of variants in distribution one (zero component) to four (can be altered by changing the \$\pi\$
and \$\gamma\$ vector lengths), number of non-zero effects in the model at iteration \$i\$,
the variance of the distribution of genetic effects \$\sigma_{\beta}^2\$, the residual variance
\$\sigma_{\epsilon}^2\$, the genetic variance \$\sigma_{g}^2\$ SNP-based heritability, rounding accumulation (a value
substantially greater than zero indicates insufficent machine precision and the user should consider terminating the
program), and an estimate of
the time left to complete the analysis. Upon completion of the analysis a summary of the
posterior mean and standard deviation of the key models parameters is reported to standard
out

```
Posterior statistics from MCMC samples:
Mean SD
Pi1 0.942564 0.006791
Pi2 0.056959 0.006728
Pi3 0.000305 0.000274
Pi4 0.000172 0.000094
NnzSnp 1726.798706 199.570724
SigmaSq 0.002898 0.000375
ResVar 0.900596 0.004205
GenVar 0.103104 0.002216
hsq 0.102722 0.002039
```

The output files are the same as those from the individual data models and include:

```
test.log: # Text file of running status, intermediate output and final results
test.snpRes: # Text file of posterior statistics of SNP effects
test.parRes: # Text file of posterior statistics of key model parameters
test.mcmcsamples.SnpEffects: # Binary file of MCMC samples for the SNP effects
test.mcmcsamples.Par: # Text file of MCMC samples for the key model parameters
```

*Extra commands for running SBayes analyses*

```
--unscale-genotype # Run analyses assuming genotypes are not scaled to unit variance. Default is scaled genotypes
--exclude-mhc # Exclude variants in the Major Histocompatibility Complex from the analysis. Highly recommended
# when performing genome-wide analyses
--exclude-region # Path to file where each line had three entries: chromosome, starting base pair position of
# region and end base pair of region
--impute-n # Let GCTB impute the sample size for each variant using an algoirithm if you cannot trust
# the sample size reported. Filters variants with a sample size further than 3 SD from the median of
# the sample size distribution. The method assumes each SNP has been calculated from the same set
# of individuals.
```

#### Summary

The method implemented in GCTB assumes certain ideal data constraints such as summary data computed from a single set of individuals at fully observed genotypes as well as minimal imputation error and data processing errors such as allele coding and frequency mismatch. Summary data in the public domain often substantially deviate from these ideals and can contain residual population stratification, which is not accounted for in this model. Practical solutions to these ideal data deviations include the use of data that are imputed and the restriction of analyses to variants that are known to be imputed with high accuracy. We found that the simple filtering of SNPs with sample sizes that deviate substantially from the mean across all variants from an analysis, when using summary statistics from the public domain substantially improved model convergence. However, even the reported sample size cannot be trusted in some summary data sets.

We explored LD pruning of variants to remove variants in very high LD (\$R^2>0.99\$) but found that this did not substantially improve model convergence or parameter estimates although this was not formally assessed. However, removal of high LD regions, such as the MHC region improved model convergence for real traits. High LD regions are expected to have the potential to be extreme sources of model misspecification with the model expecting summary data in to be very similar for variants in high LD. Small deviations due to data error not expected in the model likelihood at these loci thus have high potential to lead to model divergence (see Zhu and Stephens\citep{zhu2017bayesian} for further discussion). We strongly recommend reading the current manuscript and that of Zhu and Stephens (2017) as a preliminary to running the summary data based methods in GCTB.

### FAQ

#### I obtained a poor prediction accuracy using SBayesR. Why is that and can it be improve?

If the prediction accuracy is unreasonably low using SBayesR, it is likely that a convergence problem has occurred for the trait analysed. Poor convergence will cause the SNP effect sizes to gradually "blow up" (keep increasing toward infinity) and eventually impair prediction accuracy. If you were using GCTB v2.01 or a lower version, you will usually find hsq (i.e., SNP-based heritability) increases to one and SigmaSq (i.e., the variance of SNP effects) keep increasing along with the MCMC iterations, as shown below for example.

```
Iter NumSnp1 NumSnp2 NumSnp3 NumSnp4 SigmaSq ResVar GenVar hsq Rounding
100 893956 29937 15607 927 0.0003 0.4961 0.5314 0.5172 0.0000
200 915615 23804 3 1005 0.0050 0.0024 0.7879 0.9970 0.0024
300 927659 11282 1 1485 0.0188 0.0000 0.8639 1.0000 0.0017
400 932231 6640 3 1553 0.0551 0.0000 0.9383 1.0000 0.0014
500 934298 4592 16 1521 0.1187 0.0000 1.0284 1.0000 0.0013
600 935284 3699 154 1290 0.2155 0.0000 1.1305 1.0000 0.0015
700 935920 3162 121 1224 0.3423 0.0000 1.2169 1.0000 0.0018
800 936400 2722 148 1157 0.4792 0.0000 1.2991 1.0000 0.0020
900 936671 2431 429 896 0.7588 0.0000 1.3778 1.0000 0.0023
1000 936891 2281 369 886 0.9321 0.0000 1.4472 1.0000 0.0025
```

At some stage, the program may stop and report the following error message because no SNP can be fitted due to the "blown up".

```
Error: Zero SNP effect in the model for 100 cycles of sampling
```

In GCTB v2.02, we flag the convergence issue by the following message when hsq hits 1 (or equivalently, the residual variance < 0):

```
Error: Residual variance is negative. This may indicate that effect sizes are "blowing up" likely due to a convergence problem. If SigmaSq variable is increasing with MCMC iterations, then this further indicates MCMC may not converge. Please refer to our website (https://cnsgenomics.com/software/gctb) for further information on this problem.
```

Sometimes, even though the MCMC appears to be completed, the posterior means of SNP effects can still be severely biased due to the convergence problem. It is highly recommeded to visualise the scatter plot of the SBayesR estimated SNP effects versus the GWAS marginal effects if you suspect poor convergence. The figure on the left-hand side shows such a scatter plot when MCMC has converged normally. The figure on the right-hand side shows the case when the convergence problem occurred, where some of the SBayesR effect estimates are orders of magnitude larger than the corresponding GWAS marginal effect estimates.

Further plotting the SBayesR effect estimates against the physical positions reveals that some SNPs that are in proximity tend to have their effects "blown up" in opposite directions:

The fundamental cause of the failure in convergence is not exactly clear, but it is most likely because of the inconsistency between the LD reference data and GWAS summary statistics or/and errors in these datasets. **We suggest to try the following steps to avoid the convergence problem. Please also see this doc for a detailed description of processing and running of SBayesR for Locke et al. 2015 and Wood et al. 2014.**

*1. Remove SNPs with extreme GWAS sample sizes.*

It is implicitly assumed in SBayesR that all SNPs have the same GWAS sample size; otherwise, the residual covariance stucture is misspecified. This assumption is often violated when some SNPs have large genotype missing rates or when the GWAS summary statistics are from a meta-analysis where not all SNPs are genotyped in all cohorts. Below is an example of a highly skewed distribution of per-SNP sample size from a meta-analysis.

In this case, removing the SNPs in the lowest 10% quantile led to a converged result. It is highly recommended to examine the distribution of per-SNP sample size prior to the analysis and remove the outliers with extreme sample sizes. Sometimes, the average sample size is given for all the SNPs. In such a case, one can use the **--impute-n** option in GCTB when running SBayesR, which will impute the per-SNP sample size based on the beta values, SE and allele frquenes and exclude SNPs that have the imputed sample sizes 3 standard deviations apart from the median value.

*2. Use the shrunk LD matrix.*

While in principle any type of LD matrix can be used in SBayesR, using the shurnk LD matrix can lead to the most robust result for some traits, such as height. This is likely because the shrunk LD matrix, which is much less sparse than the sparse LD matrix with the default chi-squared threshold, has a better capacity to capture true LD correlations of very small values. Ignoring the SNPs in small LD correlations with the causal variants can collectively cause a bias in the mean of the full conditional distribution of some SNP effects, and subsequently cause the convergence problem. We have provided two shrunk LD matrices based on a random sample of 50K individuals of European ancestry in UK Biobank data, one with about 1.1 million HapMap3 common SNPs and the other with about 2.8 million common SNPs using a genetic map in the public domain.

*3. Use DENTIST to detect and remove SNPs with errors in GWAS or LD reference and heterogeneity between the two.*

*DENTIST* is a software that leverages LD among genetic variants to detect and eliminate errors in GWAS or LD reference and heterogeneity between the two. It could be helpful to run DENTIST prior to SBayesR to identify SNPs that have inconsistent LD properties between GWAS and reference samples, and then use **--exclude** option in GCTB to exclude those SNPs in the SBayesR analysis.

*4. Remove null SNPs based on the GWAS P-values.*

It can be seen from the above plot of SBayesR effect against GWAS effect estimates that the "blowing up" occures more frequently to the SNPs with GWAS effects close to zero. We have also observed a significant improvement in SNP-based heritability estimation and prediction accuracy when removing the null SNPs based on the GWAS P-values for some traits. For example, as shown in the figure below, removing SNPs with GWAS P-value greater than 0.4 (or lower) in the Wood et al. (2014 Nat Genet) summary statistics data (after a quality control step on the per-SNP sample size) appeared to fix the convergence problem and led to a substantially improved SNP-based heritability estimates and a higher prediction accuracy. In GCTB, one can use **--p-value** option to specify the upper bound of the P-values of the SNPs to be included in the analysis.

*5. Filter SNPs based on the LD R-squared.*

When the convergence problem occurred, two SNPs in high LD will sometimes blow up effect sizes in opposite directions, as shown in the figure above. Removing one of the SNPs in high LD could be helpful to mitigate the problem. We have implemented **--rsq** option in GCTB to detect every pair of SNPs with LD R-square greater than the specified threshold and then remove the SNP with higher GWAS P-value.