## Overview

### About

GCTB is a software tool that comprises a family of Bayesian linear mixed models for complex trait analyses using genome-wide SNPs. It was developed to simultaneously estimate the joint effects of all SNPs and the genetic architecture parameters for a complex trait, including SNP-based heritability, polygenicity and the joint distribution of effect sizes and minor allele frequencies.

Version 2.0 of the GCTB software includes summary-data-based versions of the individual-level data Bayesian linear mixed models previously implemented.

### Credits

Jian Zeng developed the software with supports from Jian Yang, Futao Zhang and Zhili Zheng.

Part of the code are adopted from *GCTA* and *GenSel*.

Luke Lloyd-Jones contributed to the BayesR module with supports from Jian Zeng and Mike Goddard.

Version 2.0 of the software was developed by Jian Zeng with contributions from Luke Lloyd-Jones.

### Questions and Help Requests

If you have any bug reports or questions, please send an email to Jian Zeng (j.zeng@uq.edu.au), Luke Lloyd-Jones (luke.lloydjones@uq.connect.edu.au) or Jian Yang (jian.yang@uq.edu.au).

### Citation

Zeng et al. (2018) Signatures of negative selection in the genetic architecture of human complex traits.
*Nature Genetics*, doi: 10.1038/s41588-018-0101-4.

#### SBayesR

Lloyd-Jones, Zeng et al. (2019) Improved polygenic prediction by Bayesian multiple regression on summary statistics. *bioRxiv*, doi: 10.1101/522961.

## Download

### Executable files

### Tutorial data

### Source code

The MPI version implements a distributed computing strategy that scales the analysis to very large sample sizes. A significant improvement in computing time is expected for a sample size > 10,000. The MPI version needs to be compiled on user’s machine. See README.html in the tarball for instructions of compilation and usage. A testing dataset is also included in each tarball.

### Update log

**1.** 1 Dec, 2017: first release.
**2.** 8 Feb, 2019: version 2.0 includes summary-data-based Bayesian methods.

## Basic options

### Input and output

**--bfile** test

Input PLINK binary PED files, e.g. test.fam, test.bim and test.bed (see PLINK user manual for details).

**--pheno** test.phen

Input phenotype data from a plain text file, e.g. test.phen.

**--out** test

Specify output root filename.

### Data management

**--keep** test.indi.list

Specify a list of individuals to be included in the analysis.

**--chr** 1

Include SNPs on a specific chromosome in the analysis, e.g. chromosome 1.

**--extract** test.snplist

Specify a list of SNPs to be included in the analysis.

**--exclude** test.snplist

Specify a list of SNPs to be excluded from the analysis.

**--mpheno** 2

If the phenotype file contains more than one trait, by default, GCTB takes the first trait for analysis (the third column of the file) unless this option is specified. For example, **--mpheno** 2 tells GCTB to take the second trait for analysis (the fourth column of the file).

**--covar** test.qcovar

Input quantitative covariates from a plain text file, e.g. test.qcovar. Each quantitative covariate is recognized as a continuous variable.

### MCMC settings

**--seed** 123

Specify the seed for random number generation, e.g. 123. Note that giving the same seed value would result in exactly the same results between two runs.

**--chain-length** 21000

Specify the total number of iterations in MCMC, e.g. 21000 (default).

**--burn-in** 1000

Specify the number of iterations to be discarded, e.g. 1000 (default).

**--out-freq** 100

Display the intermediate results for every 100 iterations (default).

**--thin** 10

Output the sampled values for SNP effects and genetic architecture parameters for every 10 iterations (default). Only non-zero sampled values of SNP effects are written into a binary file.

**--no-mcmc-bin**

Suppress the output of MCMC samples of SNP effects.

## Bayesian alphabet

**--bayes** S

Specify the Bayesian alphabet for the analysis, e.g. \$S\$. Different alphabet launch different models, which differ in the prior specification for the SNP effects. The available alphabet include

B: Each SNP effect is assumed to have an i.i.d. mixture prior of a t-distribution \$t(0, \tau^2, \nu)\$ with a probability \$\pi\$ and a point mass at zero with a probability \$1-\pi\$.

C: Each SNP effect is assumed to have an i.i.d. mixture prior of a normal distribution \$N(0, \sigma^2)\$ with a probability \$\pi\$ and a point mass at zero with a probability \$1-\pi\$.

S: Similar to C but the variance of SNP effects is related to minor allele frequency (\$p\$) through a parameter \$S\$, i.e. \$\sigma_j^2 = [2p_j(1-p_j)]^S \sigma^2\$.

N: nested BayesC. SNPs within a 0.2 Mb non-overlapping genomic region are collectively considered as a window (specify the distance by

**--wind**0.2). This nested approach speeds up the analysis by skipping over windows with zero effect.NS: nested BayesS.

R: BayesR. Each SNP effect is assumed to have an i.i.d. mixture prior of multiple normal distributions \$N(0, \gamma_k \sigma_k^2)\$ with a probability \$\pi_k\$ and a point mass at zero with a probability \$1-\sum_k \pi_k\$, where \$\gamma_k\$ is a given constant.

**--fix-pi**

An option to fix \$\pi\$ to a constant (the value is specified by the option --pi below). The default setting is to treat π as random and estimate it from the data.

**--pi** 0.05

A starting value for the sampling of π when it is estimated from the data, or a given value for π when it is fixed. The default value is 0.05. When BayesR is used, it is a string seperated by comma where the number of values defines the number of mixture components and each value defines the starting value for each component (the first value is reserved for the zero component); the default values are 0.95,0.03,0.01,0.01.

**--gamma** 0,0.01,0.1,1

When BayesR is used, this speficies the gamma values seperated by comma, each representing the scaling factor for the variance of a mixture component. Note that the number of values should match that in --pi.

**--hsq** 0.5

A starting value for the sampling of SNP-based heritability, which may improve the mixing of MCMC algorithm if it starts with a good estimate. The default value is 0.5.

**--S** 0

A starting value for the sampling of the parameter S (relationship between MAF and variance of SNP effects) in BayesS, which may improve the mixing of MCMC algorithm if it starts with a good estimate. The default value is 0.

**--wind** 0.2

Specify the window width in Mb for the non-overlapping windows in the nested models, e.g. 0.2 Mb. The default value is 1 Mb.

Examples

Standard version of gctb:

```
gctb --bfile test --pheno test.phen --bayes S --pi 0.1 --hsq 0.5 --chain-length 25000 --burn-in 5000 --out test > test.log 2>&1
```

MPI version of gctb (when using intelMPI libraries and two nodes):

```
mpirun -f $PBS_NODEFILE -np 2 gctb_mpi --bfile test --pheno test.phen --bayes S --pi 0.1 --hsq 0.5 --chain-length 25000 --burn-in 5000 --out test > test.log 2>&1
```

The output files include:

**test.log**: a text file of running status, intermediate output and final results;

**test.snpRes**: a text file of posterior statistics of SNP effects;

**test.covRes**: a text file of posterior statistics of covariates;

**test.parRes**: a text file of posterior statistics of key model parameters;

**test.mcmcsamples.CovEffects**: a text file of MCMC samples for the covariates fitted in the model;

**test.mcmcsamples.SnpEffects**: a binary file of MCMC samples for the SNP effects;

**test.mcmcsamples.Par**: a text file of MCMC samples for the key model parameters;

Citations

**GCTB software and BayesS method**

Zeng et al. (2018) Signatures of negative selection in the genetic architecture of human complex traits.
*Nature Genetics*, doi: 10.1038/s41588-018-0101-4.

**BayesR**

Moser et al. (2015) Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model. *PLoS Genetics*, 11: e1004969.

**BayesN**

Zeng et al. (2018) A nested mixture model for genomic prediction using whole-genome SNP genotypes. *PLoS One*, 13: e0194683.

**BayesC\$\pi\$**

Habier et al. (2011) Extension of the Bayesian alphabet for genomic selection. *BMC Bioinformatics*, 12: 186.

**BayesB**

Meuwissen et al. (2001) Prediction of total genetic value using genome-wide dense marker maps. *Genetics*, 157: 1819-1829.

## Summary Bayesian Alphabet

### Tutorial

This updated version of the GCTB software (version 2.0) includes summary data based versions of the individual-data Bayesian linear mixed models previously implemented. These methods require summary statistics from genome-wide association studies, which typically include the estimated univariate effect for each single nucleotide polymorphism (SNP), the standard error of the effect, the sample size used to estimate the effect for each SNP, the allele frequency of an allele, and an estimate of LD among SNPs, all of which are easily accessible from public databases.

This tutorial will outline how to use the summary data based methods and accompanies the manuscript Improved polygenic prediction by Bayesian multiple regression on summary statistics. To recreate the tutorial you will need the PLINK 2 software and the updated version of the GCTB software compiled from source or the binary executable in your path. The data for the tutorial are available at Download. The tutorial is designed so that each part can be reconstructed or picked up at any point with the following directory structure.

```
tutorial
gctb # Static binary executable for Linux 64-bit systems
data # 1000 Genomes data in PLINK format
pheno_generation # R scripts for generating simulated phenotypes
pheno # Simulated phenotypes
gwas # PLINK GWAS summary statistics
ldm # LD correlation matrices and scripts to calculate them
ma # Summary statistics in GCTB compatible format
sbayesr # Scripts for running and SBayesR analysis using GCTB and subsequent results
```

#### Data

The tutorial will run through all aspects of the software using genotype data from chromosome 22 of Phase 3 of the 1000 Genomes Project (1000G). The genotype data have been filtered to exclude variants with minor allele frequency (MAF) < 0.01, which left 15,938 SNPs available for analysis. These data include a subset of 378 individuals of European ancestry from the CEU, TSI, GBR and FIN populations.

N.B. The results generated from this example tutorial using data from the 1000G are designed to be lightweight and should be only used to explore how to run summary data based analyses using GCTB. The results do not make the best use of GCTB capabilities given the small sample size \$(N=378)\$ in the example.

#### Phenotype simulation and GWAS

Using these genotypes, phenotypes were generated under the multiple regression model \$y_i = \sum_{j=1}^pw_{ij}\beta_j + \epsilon_i\$, where \$w_{ij} = (x_{ij} − 2q_j)/ \sqrt{2q_j(1 − q_j)}\$
with \$x_{ij}\$ being the reference allele count for the \$i\$th individual at the \$j\$th SNP, \$q_j\$ the allele
frequency of the \$j\$th SNP and \$\epsilon_i\$ was sampled from a normal distribution with mean 0 and variance
\$\text{Var}({\bf W}\boldsymbol\beta)(1/h_{SNP}^2 − 1)\$ such that \$h_{SNP}^2 = 0.1\$ for each of 20 simulation replicates,
which is much larger than the contribution to the genome-wide SNP-based heritability (\$h^2_{SNP}\$) estimate for chromosome 22
for most quantitative traits. All phenotypes were generated using the R programming language with scripts used available in the
`pheno_generation`

subdirectory.

For each scenario replicate, we randomly sampled a new set of 1,500 causal variants. The genetic architecture simulated contained two causal variants of large effect explaining 3% and 2% of the phenotypic variance respectively and a polygenic tail of 1,498 causal variants sampled from a N(0, 0.05/1, 498) distribution such that the expected total genetic variance explained by all variants was 0.1.

For each of the 20 simulation replicates, simple linear regression for each variant was run using the PLINK 2 software to generate summary statistics. Code for running the GWAS and the output is available in the `gwas`

subdirectory.

#### GCTB summary statistics input format

The GCTB summary-based methods have inherited the GCTA-COJO `.ma`

format.

```
SNP A1 A2 freq b se p N
rs1001 A G 0.8493 0.0024 0.0055 0.6653 129850
rs1002 C G 0.0306 0.0034 0.0115 0.7659 129799
rs1003 A C 0.5128 0.0045 0.0038 0.2319 129830
```

Columns are SNP identifier, the effect allele, the other allele, frequency of the effect allele, effect size, standard error, p-value and sample size. The headers are not keywords and will be omitted by the program. Important: "A1" needs to be the effect allele with "A2" being the other allele and "freq" should be the frequency of "A1".

The transformation of your summary statistics file to the `.ma`

format will depend on the software used to analyse your data. The
`ma`

subdirectory contains a simple R script for constructing `.ma`

files from PLINK 2 `--linear`

output. There is no need to filter
you summary statistics to match the LD reference as GCTB will perform this data management for you.

#### GCTB linkage disequilibrium (LD) correlation matrix

The construction of an LD correlation matrix is a key component of the summary data based methods in GCTB. Theoretically, the summary data based methods are equivalent to the individual data methods if the LD matrix is constructed from all variants and the individuals used in the GWAS. Calculating and storing LD matrices that contain all pairwise entries is computationally prohibitive and thus a subset of the matrix entries are stored. Typically, all inter-chromosomal LD is ignored and a within chromosome block diagonal matrix structure is used. Furthermore, the individual level data from the GWAS is not available, which is often the case, and a reference population that is similar in ancestry to the GWAS cohort is used to construct the LD correlation matrix. The use of a subset of the correlation matrix and the construction of the LD matrix from a reference leads to an approximation to the individual data model. The level of deterioration in the model parameter estimates introduced by these approximations is assessed empirically through simulation, real data analysis and comparison with individual-data method estimates.

*Full chromosome-wise LD matrices*

Although possible to create genome-wide LD matrices in GCTB, we recommend to only
calculate matrices for each chromosome. The following is the basic execution of
GCTB LD matrix calculation. The `ldm`

subdirectory has example BASH scripts to
perform this on a HPC system. N.B. the HPC commands may not be compatible with your HPC scheduler
and will require alteration.

```
# Build full LD matrix for chromosome 22
gctb --bfile 1000G_eur_chr22 --make-full-ldm --out 1000G_eur_chr22
```

GCTB produces a `.bin`

file, which contains the matrix elements stored in binary, and a `.info`

file, which looks like

```
Chrom ID GenPos PhysPos A1 A2 A2Freq Index WindStart WindEnd WindSize WindWidth N SamplVar LDsum
22 rs131538 0 16871137 A G 0.066138 0 0 15937 15938 34347869 378 41.92552 10.18005
22 rs9605903 0 17054720 C T 0.264550 1 0 15937 15938 34347869 378 41.95537 -33.15312
22 rs5746647 0 17057138 G T 0.052910 2 0 15937 15938 34347869 378 41.93396 15.90974
22 rs16980739 0 17058616 T C 0.148148 3 0 15937 15938 34347869 378 41.94403 -12.93685
```

The columns headings are chromosome, SNP identifier, genetic position, physical base pair position, SNP allele 1, SNP allele 2, frequency of the A2 allele, variant index, for this SNP (row) the first column that is non-zero in the LD matrix, the last column that is non-zero, how many non-zero elements for this row, difference in base pair position between WindStart and WindEnd, the sample size used to calculate the correlation, sum of sampling variance of the LD correlation estimate for the SNP and all other SNPs on the chromosome and the sum of the non-zero LD correlation estimates between this SNP and all other SNPs.

*Full chromosome-wise LD matrices using multiple CPUs*

To reduce the computational burden of computing LD correlation matrices, GCTB can partition the genotype file into user-specified chunks.

```
# Build full LD matrix for chromosome 22 for the first 5000 variants
gctb --bfile 1000G_eur_chr22 --make-full-ldm --snp 1-5000 --out 1000G_eur_chr22
```

When the `--snp`

option is invoked the GCTB software will append `.snp1-5000`

, for example, to the output files
so no separate extension to `--out`

is required.

For the 1000G chromosome 22 example the following files are generated

```
1000G_eur_chr22.snp1-5000.ldm.full.bin
1000G_eur_chr22.snp1-5000.ldm.full.info
1000G_eur_chr22.snp5001-10000.ldm.full.bin
1000G_eur_chr22.snp5001-10000.ldm.full.info
1000G_eur_chr22.snp10001-15000.ldm.full.bin
1000G_eur_chr22.snp10001-15000.ldm.full.info
1000G_eur_chr22.snp15001-20000.ldm.full.bin
1000G_eur_chr22.snp15001-20000.ldm.full.info
```

To merge the files, a separate file with the list of files to be joined is required. For our example is looks like

```
cat 1000G_eur_chr22.mldmlist
1000G_eur_chr22.snp1-5000.ldm.full
1000G_eur_chr22.snp5001-10000.ldm.full
1000G_eur_chr22.snp10001-15000.ldm.full
1000G_eur_chr22.snp15001-20000.ldm.full
```

```
# Merge the LD matrix chunks into a single file
gctb --mldm 1000G_eur_chr22.mldmlist --make-full-ldm --out 1000G_eur_chr22
```

*Shrunk LD matrix*

The shrinkage estimator for the LD correlation matrix was originally proposed by Wen and Stephens (2010). The estimator shrinks the off-diagonal entries of the sample LD matrix toward zero. Zhu and Stephens (2017) used the shrinkage estimator in their Regression with Summary Statistics (RSS) methodology and showed empirically that it can provide improved inference. The shrinkage estimator overcomes some of the issues arising from approximating the full LD matrix in the summary data based model using a subset of the full LD matrix and constructing the matrix from a reference. The GCTB implementation is a C++ port from that provided with the RSS software and has been adapted for use with the GCTB software.

The calculation of the LD correlation matrix shrinkage estimate requires a genetic map. The genetic map files can be provided by the user but the tutorial uses interpolated map positions for the CEU population generated from the 1000G OMNI arrays that were downloaded from the from https://github.com/joepickrell/1000-genomes-genetic-maps. The file format looks like

```
rs149201999 16050408 0.0
rs146752890 16050612 0.0
rs139377059 16050678 0.0
rs188945759 16050984 0.0
rs6518357 16051107 0.0
rs62224609 16051249 0.0
rs62224610 16051347 0.0
rs143503259 16051453 0.0
rs192339082 16051477 0.0
rs79725552 16051480 0.0
```

and contains SNP identifier, physical base pair position and genetic position (cumulative). GCTB can use any genetic map provided it adheres to this format.

In this tutorial, the calculation of the shrunk LD matrix requires the effective population sample size, which we set to be 11,400 (as in Zhu and Stephens (2017)), the sample size of the genetic map reference, which corresponds to the 183 individuals from the CEU cohort of the 1000G used to create the genetic map, and the hard threshold on the shrinkage value, which we set to \$10^{−5}\$ and is the default value. These three parameters can be changed with

```
--ne # Effective population size. Default is 11,400
--genmap-n # Genetic map sample size. Default is 183
--shrunk-cutoff # Shrinkage hard threshold. Default is 10^-5
```

The construction of the shrunk LD matrix is similar to that for the full LD matrix

```
# Build shrunk LD matrix for chromosome 22
gctb --bfile 1000G_eur_chr22 \
--make-shrunk-ldm \
--gen-map chr22.OMNI.interpolated_genetic_map \
--out 1000G_eur_chr22
```

The GCTB can also perform shrunk matrix calculation using multiple CPUs with a similar process to the above calculation of the full LD matrix.

```
# Build shrunk LD matrix for chromosome 22 for the first 5000 variants
gctb --bfile 1000G_eur_chr22 \
--make-shrunk-ldm \
--gen-map chr22.OMNI.interpolated_genetic_map \
--snp 1-5000 \
--out 1000G_eur_chr22
```

Please see the `/ldm`

subdirectory for example BASH scripts of performing these tasks
both locally and on a HPC.

*Sparse LD matrix*

Motivated by the computational practicality of not storing the genome-wide LD correlation matrix we wish to set a subset of elements of the LD correlation to zero. The calculation of an LD correlation matrix from genotype data is an estimate of the true LD correlation matrix for a population. It therefore contains a mix of true LD and those values that differ from zero due to sampling variation, which we wish to ignore. By default we ignore interchromsomal LD between markers. Within chromosome, we set elements to zero if their chi-squared statistic under the sampling distribution of the correlation coefficient does not exceed a user-chosen threshold.

Specifically, from Lynch and Walsh (1998) the sampling variation of the correlation coefficient \$r = \frac{\text{cov}(x, y)}{[\text{var}(x)\text{var}(y)]^{1/2}}\$ is
\$\sigma^2(r) \approx \frac{(1-\rho^2)^2}{n}\$, which under the null hypothesis that \$\rho = 0\$ is just \$\frac{1}{n}\$. Given this we can
generate a chi-squared statistic for each element of the LD matrix by \$(\frac{r}{\sqrt{1/n}})^2\$. These statistics should be \$\chi_1^2\$ distributed
and thus for each LD matrix element we set it to zero if the statistic does not exceed a user-specified arbitrary value. The chi-squared
statistic can be altered with `--chisq`

. The default value is 10, which we have found empirically to be a good balance between
matrix sparsity and retaining true positive LD values.

```
# Build sparse LD matrix for chromosome 22
gctb --bfile 1000G_eur_chr22 --make-sparse-ldm --out 1000G_eur_chr22
# Sparse matrices can also be built using multiple CPUs
gctb --bfile 1000G_eur_chr22 --make-sparse-ldm --snp 1-5000 --out 1000G_eur_chr22
```

GCTB can alter current LD matrices, for example, make a full matrix sparse.

```
# Make a full LD matrix sparse
gctb --ldm 1000G_eur_chr22.ldm.full --make-sparse-ldm --out 1000G_eur_chr22
```

We recommend that the shrunk LD matrices be stored in sparse format for efficient storage and computation. The shrunk LD correlation matrix estimate will have zeroes which can be eliminated as follows

```
# Make a shrunk LD matrix sparse by eliminating elements equal to 0. Output sparse
gctb --ldm 1000G_eur_chr22.ldm.shrunk --make-sparse-ldm --chisq 0 --out 1000G_eur_chr22
```

*Extra options for LD matrix calculation*

```
# For checking or comparison with other methods you can write the LD correlation out in
# text format.
--write-ldm-txt # Write LD matrix as text file as well as binary. Will generate large files.
# Examples
gctb --bfile 1000G_eur_chr22 --make-shrunk-ldm --write-ldm-txt --out 1000G_eur_chr22
gctb --bfile 1000G_eur_chr22 --make-full-ldm --write-ldm-txt --out 1000G_eur_chr22
# Read a list of LD matrix files. This would be used to read the list of chromosome wise LD
# matrices when performing genome-wide analyses. Also used for merging matrix chunks for
# multiple CPU matrix building
--mldm
```

#### Running an SBayesR analysis

With the GWAS summary statistics in `.ma`

format and the LD correlation matrix calculated, an SBayesR analysis can be performed by

```
# Calling SBayesR
gctb --sbayes R
--ldm ../ldm/sparse/chr22/1000G_eur_chr22.ldm.sparse
--pi 0.95,0.02,0.02,0.01
--gamma 0.0,0.01,0.1,1
--gwas-summary ../ma/sim_1.ma
--chain-length 10000
--burn-in 2000
--out-freq 10
--out sim_1
```

`--sbayes R`

Specify the summary data based Bayesian alphabet model for the analysis, e.g., `R`

. Different letters of the alphabet launch
different models, which differ in the prior specification for the SNP effects. `R`

and `C`

are currently available.

`--pi 0.95,0.02,0.02,0.01`

Specific to the `R`

model. A comma separated string where the number of values defines the number of mixture components and each value defines the starting value for each component (the first value is reserved for the zero component).
These must sum to 1.
The length of the \$\pi\$ and \$\gamma\$ vectors stipulates how many components are included in the finite mixture of normals distribution prior
for the genetic effects. The default values are as specified in the example.

`--gamma 0,0.01,0.1,1`

Specific to the `R`

model. Specifies the gamma values separated by comma, each representing the scaling factor for the
the variance of the distribution of genetic effects \$\sigma_{\beta}^2\$ for each distribution. Note that the vector length should match
that in `--pi`

. The default values are as specified in the example. These differ from the original BayesR model as the
weights are for the variance of the distribution of genetic effects \$\sigma_{\beta}^2\$ rather than the genetic variance \$\sigma_g^2\$.

`--gwas-summary`

Path to the GWAS summary statistics in `.ma`

format.

The other options are inherited from the individual-data GCTB models.

*Standard output*

Upon execution of GCTB, preliminary data management information will be outputted to standard out. A key summary of the data input is also displayed, for example,

```
Data summary:
mean sd
GWAS SNP Phenotypic variance 1.002 0.052
GWAS SNP sample size 378 0
GWAS SNP effect 0.000 0.117
GWAS SNP SE 0.106 0.049
MME left-hand-side diagonals 379.478 19.774
MME right-hand-side 0.130 19.983
LD sampling variance 1.298 0.076
LD score 17.359 15.140
```

```
```

These statistics can be used to diagnose errors between individual runs of the program.
For each row the mean and standard deviation (sd) of the parameter distribution are displayed. The
acronym MME represents mixed model equations. The left-hand-side diagonals
are the elements of \${\bf D}=\text{diag}({\bf X'}{\bf X})\$, where \$D_j=2p_jq_jn_j\$
under Hardy-Weinberg equilibrium and \$n_j\$ is the sample
size of variant \$j\$. The right-hand-side is equal to \${\bf Db}\$ where \${\bf b}\$ is the \$p\times 1\$ vector of marginal
effect estimates from GWAS. LD sampling variance and LD score are
as described in the `.info`

file.

Post launch of a summary data model the sampled values of key model parameters are printed to standard out in real time. For example, during an SBayesR analysis the following snapshot shows what is being printed

```
Iter Pi1 Pi2 Pi3 Pi4 NnzSnp SigmaSq ResVar GenVar hsq Rounding TimeLeft
10 0.9513 0.0162 0.0170 0.0155 795 0.0001 0.8905 0.0260 0.0284 0.0000 0:16:39
20 0.9507 0.0112 0.0172 0.0208 761 0.0001 0.9421 0.0213 0.0221 0.0000 0:8:19
30 0.9479 0.0041 0.0201 0.0279 761 0.0000 0.9747 0.0249 0.0249 0.0000 0:5:32
40 0.9448 0.0046 0.0187 0.0319 865 0.0000 0.9117 0.0203 0.0218 0.0000 0:4:9
50 0.9445 0.0054 0.0228 0.0272 893 0.0000 1.0019 0.0223 0.0218 0.0000 0:6:38
60 0.9409 0.0019 0.0223 0.0349 930 0.0000 1.0481 0.0265 0.0247 0.0000 0:5:31
```

The parameters printed are iteration number, which can be altered with `--out-freq`

, proportion
of variants in distribution one (zero component) to four (can be altered by changing the \$\pi\$
and \$\gamma\$ vector lengths), number of non-zero effects in the model at iteration \$i\$,
the variance of the distribution of genetic effects \$\sigma_{\beta}^2\$, the residual variance
\$\sigma_{\epsilon}^2\$, the genetic variance \$\sigma_{g}^2\$ SNP-based heritability, rounding accumulation (a value
substantially greater than zero indicates insufficient machine precision and the user should consider terminating the
program), and an estimate of
the time left to complete the analysis. Upon completion of the analysis a summary of the
posterior mean and standard deviation of the key models parameters is reported to standard
out

```
Posterior statistics from MCMC samples:
Mean SD
Pi1 0.942564 0.006791
Pi2 0.056959 0.006728
Pi3 0.000305 0.000274
Pi4 0.000172 0.000094
NnzSnp 1726.798706 199.570724
SigmaSq 0.002898 0.000375
ResVar 0.900596 0.004205
GenVar 0.103104 0.002216
hsq 0.102722 0.002039
```

The output files are the same as those from the individual data models and include:

```
test.log: # Text file of running status, intermediate output and final results
test.snpRes: # Text file of posterior statistics of SNP effects
test.parRes: # Text file of posterior statistics of key model parameters
test.mcmcsamples.SnpEffects: # Binary file of MCMC samples for the SNP effects
test.mcmcsamples.Par: # Text file of MCMC samples for the key model parameters
```

*Extra commands for running SBayes analyses*

```
--unscale-genotype # Run analyses assuming genotypes are not scaled to unit variance. Default is scaled genotypes
--exclude-mhc # Exclude variants in the Major Histocompatibility Complex from the analysis. Highly recommended
# when performing genome-wide analyses
--exclude-region # Path to file where each line had three entries: chromosome, starting base pair position of
# region and end base pair of region
--impute-n # Let GCTB impute the sample size for each variant using an algorithm if you cannot trust
# the sample size reported. Filters variants with a sample size further than 3 SD from the median of
# the sample size distribution. The method assumes each SNP has been calculated from the same set
# of individuals.
```

#### Summary

The method implemented in GCTB assumes certain ideal data constraints such as summary data computed from a single set of individuals at fully observed genotypes as well as minimal imputation error and data processing errors such as allele coding and frequency mismatch. Summary data in the public domain often substantially deviate from these ideals and can contain residual population stratification, which is not accounted for in this model. Practical solutions to these ideal data deviations include the use of data that are imputed and the restriction of analyses to variants that are known to be imputed with high accuracy. We found that the simple filtering of SNPs with sample sizes that deviate substantially from the mean across all variants from an analysis, when using summary statistics from the public domain substantially improved model convergence. However, even the reported sample size cannot be trusted in some summary data sets.

We explored LD pruning of variants to remove variants in very high LD (\$R^2>0.99\$) but found that this did not substantially improve model convergence or parameter estimates although this was not formally assessed. However, removal of high LD regions, such as the MHC region improved model convergence for real traits. High LD regions are expected to have the potential to be extreme sources of model misspecification with the model expecting summary data in to be very similar for variants in high LD. Small deviations due to data error not expected in the model likelihood at these loci thus have high potential to lead to model divergence (see Zhu and Stephens (2017) for further discussion). We strongly recommend reading the current manuscript and that of Zhu and Stephens (2017) as a preliminary to running the summary data based methods in GCTB.