Practical 3: Using GCTA to estimate heritability

This runs on the GGWS server.

Preparation

First login

ssh yourlogincredentials@203.101.229.143

We provide a R script named simulate-data-practical3.R, which will generate the data (PLINK files: mydata.ped and mydata.map; two phenotypes in mydata.phen) needed for this practical.

First, use the terminal create a folder (in the /scratch/module3/) where the data will be stored (e.g., myFolderPrac3 but you can use another name of course).

cd /scratch/module3/
mkdir yourFolder
cd yourFolder/
mkdir prac3

Move in the folder

cd prac3

Copy the script to generate the data (the “.” at the end of the command is mandatory to specify that the files are to be copied in the current folder)

cp /data/module3/Prac3_GCTA/simulate-data-practical3.R .

Generate the data using the following Rscript command (takes ~1 minute). Don’t forget to copy the “.” at the end of the command.

Rscript simulate-data-practical3.R .

Check that the data have been correctly generated

ls -l

Convert PLINK files from *.ped format to *.bed format

plink --file mydata --make-bed --out mydata

Part 1. Using realised relationship based on Identity-By-State (IBS)

Exercise 1: Calculate the Genetic Relationship Matrix (GRM) using all SNPs

gcta --bfile mydata --make-grm-bin --out mydata_allSNPs

Questions:

1) How many individuals/SNPs do you have in the sample?

2) What is the mean of the diagonal elements of the GRM? Is that expected?

3) What is the variance of the off-diagonal elements? The inverse of that variance gives an estimate of the effective number of independent markers. Calculate that effective number of markers. Why can it be different from the actual number of SNPs used to build the GRM?

Exercise 2: Calculate GRM using SNPs with MAF>0.05 and MAF<0.05

gcta --bfile mydata --maf 0.05 --make-grm-bin --out mydata_maf_above_0.05

gcta --bfile mydata --max-maf 0.05 --make-grm-bin --out mydata_maf_below_0.05

Questions:

1) How many SNPs have a MAF<0.05?

2) Why is the variance of diagonal elements larger in the GRM based on SNPs with MAF<0.05?

Some more data processing…

## Creating mgrm.txt file (with the "prefix" of the MAF stratified GRMs)
echo Creating mgrm.txt file
echo mydata_maf_below_0.05 > mgrm.txt
echo mydata_maf_above_0.05 >> mgrm.txt

[Note: in general it is preferable to use the full path for defining the mgrm.txt file]

Exercise 3: Identify (genetically) unrelated individuals

gcta --grm mydata_allSNPs --grm-singleton 0.05 --out unrelated

Question: How many individuals are unrelated in this sample? GRM cut-off 0.05? cut-off 0.025?

Exercise 4: GREML estimation of heritability with and without relatives

gcta --grm mydata_allSNPs --pheno mydata.phen --mpheno 1 \
--reml --out trait1_with_relatives

gcta --grm mydata_allSNPs --pheno mydata.phen --mpheno 1 \
--reml --out trait1_without_relatives --grm-cutoff 0.05

Questions: Compare heritability estimates from these two analyses. What could explain the difference that you observe?

Exercise 5: MAF-stratified heritability estimation (Note: this is using a different phenotype than in Exercise 4)

gcta --grm mydata_allSNPs --pheno mydata.phen --mpheno 2 \
--reml --keep unrelated.singleton.txt --out trait2_allSNPs

gcta --mgrm mgrm.txt --pheno mydata.phen --mpheno 2 \
--reml --keep unrelated.singleton.txt --out trait2_maf_stratified

Questions: Compare heritability estimates from these two analyses

1) What could explain the difference that you observe? 2) Which one of these estimates should you trust more? How can you be sure?

Part 2. Using realised relationship based on Identity-By-Descent (IBD)

In the first part of the practical we used Identity-By-State (IBS) between (conventionally) unrelated individuals to estimate heritability. Here, we will estimate heritability by leveraging the fact that within-family segregation generates an observable variation of realized relatedness (i.e., proportion of the genome shared Identical By Descent - IBD) around their expected value given the pedigree relationship.

We provide a dataset named sib.RData, which contains 100,000 sibling pairs. The phenotype of siblings are denoted PhenoSib1 and PhenoSib2 respectively. We also provide the proportion of their genome that is IBD.

Load the data in R using the following command

load("/data/module3/Prac3_GCTA/sib.RData")

The command above will load an R data.frame object named sibdt. Here is an overview the data

head(sibdt,4)

Question 1. Calculate the mean and standard deviation of the IBD proportion (10th column). Are these results (at least the mean) expected?

Question 2. Calculate the phenotypic correlation between siblings. Under what assumption(s) is twice that correlation an unbiased estimator of the narrow sense heritability (Hint: You may refer to Practical 2 to answer the second part of the question)?

We now use Haseman-Elston (HE) regression to estimate heritability. Run the following command

summary( lm(I(PhenoSib1 * PhenoSib2) ~ IBD, data=sibdt) )$coefficients

Question 3. What is the heritability of the trait of interest? Is the result different from that obtained for Question 2? What could be the interpretation of the intercept?

We now select the first 5000 pairs to estimate heritability using the Restricted Maximum Likelihood (REML) implemented in GCTA. We only use 5000 pairs during the practical to reduce computational time. Analyzing 100,000 pairs would take days of calculations.

First run the following R command to convert the data into GCTA’s GRM format.

prefix <- "sibdt5k"
n      <- 5000
N      <- 2*n
G      <- diag(N)
iid <- do.call("c",lapply(1:n,function(i){ 
  as.character( sibdt[i,c("IID1","IID2")])
} ) )
phn <- do.call("c",lapply(1:n,function(i) {
  as.numeric( sibdt[i,c("PhenoSib1","PhenoSib2")])}
) )
fid <- rep(1:n,each=2)

pheno10ksib <- cbind.data.frame(fid,iid,phn)
write.table(pheno10ksib,paste0(prefix,".phen"),quote=F,row.names = F,col.names = F)

for(i in 1:n){
  G[2*(i-1)+1,2*i] <- G[2*i,2*(i-1)+1] <- sibdt[i,"IBD"]
}

size <- 4 #defines how many bytes of memory will be used to represent GRM values
collapsed.grm <- G[upper.tri(G,diag = TRUE)] #Create a vector of GRM values
write.table(cbind(fid,iid),paste0(prefix,".grm.id"),quote=F,row.names=F,col.names = F)

BinFileName <- paste0(prefix,".grm.bin") #Name the GRM
BinFile     <- file(BinFileName, "wb")   #Create the file
writeBin(con = BinFile, collapsed.grm, size = size) #Write the file, size = 4
close(BinFile)

Now run GCTA with the --reml-wfam option, which performs REML assuming (in this case knowing) that all families are independent. This is much faster than using the standard --reml flag. For simplicity, we will call GCTA from R using the system() function. Type the following command in your R console.

Output from GCTA will simply be displayed in the console but actual results are still stored in the separate files.

system("gcta --grm sibdt5k --pheno sibdt5k.phen --reml-wfam --out sibdt5k")

Question 4. What is the REML heritability estimates and its standard error? Compare your results with those from HE regression above and from HE regression using the first 5000 pairs. For the same sample size (n=5000 sibling pairs), how much larger are standard errors from HE regression relative to that of REML?

summary( lm(I(PhenoSib1 * PhenoSib2) ~ IBD, data=sibdt[1:5000,]) )$coefficients

Practical 3: Using GCTA to estimate heritability

Genetics and Genomics Winter School (Module 3)

Preparation

Part 1. Using realised relationship based on Identity-By-State (IBS)

Part 2. Using realised relationship based on Identity-By-Descent (IBD)