This runs on the GGWS server.
First login
ssh yourlogincredentials@203.101.229.143
We provide a R
script named
simulate-data-practical3.R
, which will generate the data
(PLINK files: mydata.ped
and mydata.map
; two
phenotypes in mydata.phen
) needed for this practical.
First, use the terminal create a folder (in the
/scratch/module3/
) where the data will be stored (e.g.,
myFolderPrac3
but you can use another name of course).
cd /scratch/module3/
mkdir yourFolder
cd yourFolder/
mkdir prac3
Move in the folder
cd prac3
Copy the script to generate the data (the “.” at the end of the command is mandatory to specify that the files are to be copied in the current folder)
cp /data/module3/Prac3_GCTA/simulate-data-practical3.R .
Generate the data using the following Rscript
command
(takes ~1 minute). Don’t forget to copy
the “.” at the end of the command.
Rscript simulate-data-practical3.R .
Check that the data have been correctly generated
ls -l
Convert PLINK files from *.ped
format to
*.bed
format
plink --file mydata --make-bed --out mydata
Exercise 1: Calculate the Genetic Relationship Matrix (GRM) using all SNPs
gcta --bfile mydata --make-grm-bin --out mydata_allSNPs
Questions:
1) How many individuals/SNPs do you have in the sample?
2) What is the mean of the diagonal elements of the GRM? Is that expected?
3) What is the variance of the off-diagonal elements? The inverse of that variance gives an estimate of the effective number of independent markers. Calculate that effective number of markers. Why can it be different from the actual number of SNPs used to build the GRM?
Exercise 2: Calculate GRM using SNPs with MAF>0.05 and MAF<0.05
gcta --bfile mydata --maf 0.05 --make-grm-bin --out mydata_maf_above_0.05
gcta --bfile mydata --max-maf 0.05 --make-grm-bin --out mydata_maf_below_0.05
Questions:
1) How many SNPs have a MAF<0.05?
2) Why is the variance of diagonal elements larger in the GRM based on SNPs with MAF<0.05?
Some more data processing…
## Creating mgrm.txt file (with the "prefix" of the MAF stratified GRMs)
echo Creating mgrm.txt file
echo mydata_maf_below_0.05 > mgrm.txt
echo mydata_maf_above_0.05 >> mgrm.txt
[Note: in general it is preferable to use the full path for defining
the mgrm.txt
file]
Exercise 3: Identify (genetically) unrelated individuals
gcta --grm mydata_allSNPs --grm-singleton 0.05 --out unrelated
Question: How many individuals are unrelated in this sample? GRM cut-off 0.05? cut-off 0.025?
Exercise 4: GREML estimation of heritability with and without relatives
gcta --grm mydata_allSNPs --pheno mydata.phen --mpheno 1 \
--reml --out trait1_with_relatives
gcta --grm mydata_allSNPs --pheno mydata.phen --mpheno 1 \
--reml --out trait1_without_relatives --grm-cutoff 0.05
Questions: Compare heritability estimates from these two analyses. What could explain the difference that you observe?
Exercise 5: MAF-stratified heritability estimation (Note: this is using a different phenotype than in Exercise 4)
gcta --grm mydata_allSNPs --pheno mydata.phen --mpheno 2 \
--reml --keep unrelated.singleton.txt --out trait2_allSNPs
gcta --mgrm mgrm.txt --pheno mydata.phen --mpheno 2 \
--reml --keep unrelated.singleton.txt --out trait2_maf_stratified
Questions: Compare heritability estimates from these two analyses
1) What could explain the difference that you observe? 2) Which one of these estimates should you trust more? How can you be sure?
In the first part of the practical we used Identity-By-State (IBS) between (conventionally) unrelated individuals to estimate heritability. Here, we will estimate heritability by leveraging the fact that within-family segregation generates an observable variation of realized relatedness (i.e., proportion of the genome shared Identical By Descent - IBD) around their expected value given the pedigree relationship.
We provide a dataset named sib.RData
, which contains
100,000 sibling pairs. The phenotype of siblings are denoted
PhenoSib1
and PhenoSib2
respectively. We also
provide the proportion of their genome that is IBD.
Load the data in R using the following command
load("/data/module3/Prac3_GCTA/sib.RData")
The command above will load an R data.frame
object named
sibdt
. Here is an overview the data
head(sibdt,4)
Question 1. Calculate the mean and standard deviation of the IBD proportion (10th column). Are these results (at least the mean) expected?
Question 2. Calculate the phenotypic correlation between siblings. Under what assumption(s) is twice that correlation an unbiased estimator of the narrow sense heritability (Hint: You may refer to Practical 2 to answer the second part of the question)?
We now use Haseman-Elston (HE) regression to estimate heritability. Run the following command
summary( lm(I(PhenoSib1 * PhenoSib2) ~ IBD, data=sibdt) )$coefficients
Question 3. What is the heritability of the trait of interest? Is the result different from that obtained for Question 2? What could be the interpretation of the intercept?
We now select the first 5000 pairs to estimate heritability using the Restricted Maximum Likelihood (REML) implemented in GCTA. We only use 5000 pairs during the practical to reduce computational time. Analyzing 100,000 pairs would take days of calculations.
First run the following R
command to convert the data
into GCTA
’s GRM format.
prefix <- "sibdt5k"
n <- 5000
N <- 2*n
G <- diag(N)
iid <- do.call("c",lapply(1:n,function(i){
as.character( sibdt[i,c("IID1","IID2")])
} ) )
phn <- do.call("c",lapply(1:n,function(i) {
as.numeric( sibdt[i,c("PhenoSib1","PhenoSib2")])}
) )
fid <- rep(1:n,each=2)
pheno10ksib <- cbind.data.frame(fid,iid,phn)
write.table(pheno10ksib,paste0(prefix,".phen"),quote=F,row.names = F,col.names = F)
for(i in 1:n){
G[2*(i-1)+1,2*i] <- G[2*i,2*(i-1)+1] <- sibdt[i,"IBD"]
}
size <- 4 #defines how many bytes of memory will be used to represent GRM values
collapsed.grm <- G[upper.tri(G,diag = TRUE)] #Create a vector of GRM values
write.table(cbind(fid,iid),paste0(prefix,".grm.id"),quote=F,row.names=F,col.names = F)
BinFileName <- paste0(prefix,".grm.bin") #Name the GRM
BinFile <- file(BinFileName, "wb") #Create the file
writeBin(con = BinFile, collapsed.grm, size = size) #Write the file, size = 4
close(BinFile)
Now run GCTA
with the --reml-wfam
option,
which performs REML assuming (in this case knowing) that all families
are independent. This is much faster than using the standard
--reml
flag. For simplicity, we will call GCTA
from R
using the system()
function. Type the
following command in your R
console.
Output from GCTA will simply be displayed in the console but actual results are still stored in the separate files.
system("gcta --grm sibdt5k --pheno sibdt5k.phen --reml-wfam --out sibdt5k")
Question 4. What is the REML heritability estimates and its standard error? Compare your results with those from HE regression above and from HE regression using the first 5000 pairs. For the same sample size (n=5000 sibling pairs), how much larger are standard errors from HE regression relative to that of REML?
summary( lm(I(PhenoSib1 * PhenoSib2) ~ IBD, data=sibdt[1:5000,]) )$coefficients