Genome wide complex trait analysis (GCTA)
GCTA software is a command line tool that has the most support on the
Linux operating system. The GCTA program uses an argument interface
similar to PLINK. It is advised that you visit GCTA’s website http://cnsgenomics.com/software/gcta/ and familiarise
yourself a little with the software. GCTA has many functions but one of
its primary uses is variance component estimation via restricted maximum
likelihood (REML). GCTA is available for Linux, MAC and Windows
environments.
Similarly to previous practicals, you will most likely use GCTA in a
Linux environment (e.g. HPC), thus all command lines in this material
are for Linux users.
PART I: Estimation of non-additive genetic variance from a sample of
unrelated individuals
In this part, you will perform a multi-component GREML analysis to
jointly estimate the additive and dominance genetic variance of a trait
from a sample of unrelated individuals. The different steps to perform
the analysis are detailed below.
Since you will most likely use GCTA in a Linux environment
(e.g. HPC), all command lines in this material are for Linux users.
Because of time constraints, we already performed steps 1 to
4 and you should only run the commands from step 5.
Step 1: Data QC
Standard Data QC (see Module 1: GWAS) must be done. The data used in
this practical are already QCed and consist of 11,780 individuals
genotyped at 273,469 SNPs.
Step 5: GREML analysis and joint estimation of the proportion of
phenotypic variance explained by the additive and dominance genetic
variance
mkdir prac10 # Create a directory where you will save the analysis output
cd prac10
mgrm="/data/module4/prac10/GRMs/mgrm.txt "
pheno="/data/module4/prac10/Data/ToyExample.pheno"
gcta --mgrm ${mgrm} --pheno ${pheno} --reml --reml-no-constrain --threads 3 --out analysis_output
This GREML analysis will generate two files:
analysis_output.log: the log file with all information about the
GREML analysis that you performed
analysis_output.hsq: rows are:
- header line;
- name of genetic variance, estimate and standard error (SE);
- residual variance, estimate and SE;
- phenotypic variance, estimate and SE;
- ratio of genetic variance to phenotypic variance, estimate
and SE;
- log-likelihood;
- sample size.
If there are multiple GRMs included in the GREML analysis, there will
be multiple rows for the genetic variance (as well as their ratios to
phenotypic variance) with the names of V(1), V(2), … .
Questions
Question 1: Given the sample size N=11,730 and the
observed GRMs, what are the theoretical expectations of the standard
errors (SEs) of the estimates for \(\widehat{h_{SNP}^2}\) and \(\widehat{\delta_{SNP}^2}\)?
Note: You can read the GRM binary file in R using the following R
function:
# R script to read the GRM binary file
ReadGRMBin=function(prefix, AllN=F, size=4){
sum_i=function(i){
return(sum(1:i))
}
BinFileName=paste(prefix,".grm.bin",sep="")
NFileName=paste(prefix,".grm.N.bin",sep="")
IDFileName=paste(prefix,".grm.id",sep="")
id = read.table(IDFileName)
n=dim(id)[1]
BinFile=file(BinFileName, "rb");
grm=readBin(BinFile, n=n*(n+1)/2, what=numeric(0), size=size)
NFile=file(NFileName, "rb");
if(AllN==T){
N=readBin(NFile, n=n*(n+1)/2, what=numeric(0), size=size)
}
else N=readBin(NFile, n=1, what=numeric(0), size=size)
i=sapply(1:n, sum_i)
return(list(diag=grm[i], off=grm[-i], id=id, N=N))
}
The function return a list of 4 objects:
diag (the diagonal elements of the GRM)
off (the off-diagonal elements of the GRM)
id (individual and family IDs)
N (the number of markers used in the GRM)
Question 2: What are the estimated \(h_{SNP}^2\) and \(\delta_{SNP}^2\) from the GREML
analysis?
Question 3: Are the reported SEs of the estimates in
accordance with theoretical expectations? What are the 95% confidence
intervals (CIs) of the estimates?
Answers
Question 1: On the server, open R and load the GRMs
binary files and compute the variance of the off-diagonal elements.
Theoretical standard errors of the narrow-sense and dominance
heritability can be computed given these values and the sample size
N=11,730:
#In R: load the function ReadGRMBin.
ReadGRMBin=function(prefix, AllN=F, size=4){
sum_i=function(i){
return(sum(1:i))
}
BinFileName=paste(prefix,".grm.bin",sep="")
NFileName=paste(prefix,".grm.N.bin",sep="")
IDFileName=paste(prefix,".grm.id",sep="")
id = read.table(IDFileName)
n=dim(id)[1]
BinFile=file(BinFileName, "rb");
grm=readBin(BinFile, n=n*(n+1)/2, what=numeric(0), size=size)
NFile=file(NFileName, "rb");
if(AllN==T){
N=readBin(NFile, n=n*(n+1)/2, what=numeric(0), size=size)
}
else N=readBin(NFile, n=1, what=numeric(0), size=size)
i=sapply(1:n, sum_i)
return(list(diag=grm[i], off=grm[-i], id=id, N=N))
}
#Load the two GRMs
GRMA=ReadGRMBin(prefix="/data/module4/prac10/GRMs/ToyExample_GRM05")
GRMD=ReadGRMBin(prefix="/data/module4/prac10/GRMs/ToyExample_GRM05.d")
#Calculate the variance of the off-diagonal elements of the two GRMs
var_GRMA=var(GRMA$off)
var_GRMD=var(GRMD$off)
#Compute the theoretical standard error of the estimates given these values and the sample size N=11,730
N=11730
theo_se_h2=sqrt(2/(var_GRMA * N^2)) #0.02824041
theo_se_d2=sqrt(2/(var_GRMD * N^2)) #0.04019642
Theoretical expectation of the standard error of the estimates are
0.028 for \(SE(\widehat{h^2_{SNP}})\)
and 0.040 for \(SE(\widehat{\delta^2_{SNP}})\).
Question 2: the estimated SNP-based heritability
from GCTA is 0.46 and dominance heritability is 0.13.
Question 3: The reported standard errors of the
narrow sense and dominance heritability are respectively 0.032 and
0.040, in accordance with their theoretical expectations. The 95% CIs
are computed as \(estimate +- 1.96\times
SE\), resulting in \(CI95
~\widehat{h_{SNP}^2}=[0.41;0.53]\) and \(CI95 ~
\widehat{\delta_{SNP}^2}=[0.05;0.20]\).
PART II: Estimation of genetic correlation between two traits:
Bivariate-GREML
In this part, you will perform a bivariate GREML analysis with GCTA
to estimate the genetic correlation between two simulated traits using a
sample of unrelated individuals.
Repeat step 1 and 2 from PART I
Questions
Question 1: How many individuals are included in the
analysis for traits 1 and 2?
Question 2: What are the estimated SNP-based
heritability and genetic correlation of the two traits? What are the 95%
confidence intervals?
Answers
Question 1: From the gcta log file, 3000 individuals
are phenotyped for traits 1 and 2 with a complete sample overlap.
Question 2: The estimated SNP-based heritability are
0.59 and 0.39 for trait 1 and 2 respectively. The estimated genetic
correlation between the two traits is 0.66. The 95% confidence intervals
are \(CI95
~\widehat{h_{SNP\_Trait1}^2}=[0.36;0.83]\), \(CI95 ~
\widehat{h_{SNP\_Trait2}^2}=[0.15;0.62]\) and \(CI95 ~ rg=[0.44;0.88]\)