Practical 2: Correlation between relatives and estimation of (co)variance components

This practical can be run on your own computer (if R is installed) or on the server.

Within-family variance

This part of the practical aims to illustrate how much genetic (hence potential phenotypic) variance exists within a family.

The rest of Practical 2 addresses how to estimate the genetic variance using a pedigree-based design across multiple families. Here, we will assume that a single family can have an arbitrary large number of offspring such that we could directly observed the variance of phenotypes across all offspring. The code below defines a R function that simulates the phenotypes of members of a nuclear family with two parents and nOffsprings offspring (e.g., nOffsprings = 1000 or more).

Step 1: Copy/Execute the R code below.

set.seed(12345678)
simPhenoFamily <- function(
    nOffsprings, # number of children
    Vg,          # (Additive) Genetic variance in the population
    Ve,          # Environmental variance
    M            # number of causal SNPs
    ){
  
  ## Allele Frequencies are simulated using a uniform distribution between 0.05 and 0.95
  AlleleFrequencies   <- runif(M,min=0.05,max=0.95)
  
  # Expected heterozygosity under Hardy-Weinberg Equilibrium
  ExpectHeteHWE       <- 2 * AlleleFrequencies * (1 - AlleleFrequencies)
  
  # SNP effects are simulated using a normal distribution with mean 0 and variance
  SNPEffects          <- rnorm(M,mean=0,sd=sqrt(Vg/(M * ExpectHeteHWE)))
  
  # Expected Population mean
  PopulationMeanPheno <- sum(2 * AlleleFrequencies * SNPEffects) 
  
  ## Simulate SNPs in derived populations
  genoParent1  <- sapply(1:M,function(j) rbinom(1,size=2,prob=AlleleFrequencies[j]))
  genoParent2  <- sapply(1:M,function(j) rbinom(1,size=2,prob=AlleleFrequencies[j]))
  
  
  ## Breeding values in the derived population
  geneticValueParent1 <- c(genoParent1%*%SNPEffects)
  geneticValueParent2 <- c(genoParent2%*%SNPEffects)
  
  ## Residuals
  NonGeneticValueParent1  <- rnorm(1,mean=0,sd=sqrt(Ve))
  NonGeneticValueParent2  <- rnorm(1,mean=0,sd=sqrt(Ve))
  
  ## Phenotypes
  PhenoParent1  <- geneticValueParent1 + NonGeneticValueParent1
  PhenoParent2  <- geneticValueParent2 + NonGeneticValueParent2
  
  ## Simulate Offspring
  PhenoOffspring <- rep(NA,nOffsprings)
  for(idOffspring in 1:nOffsprings){
    AlleleTransmittedByParent1 <- rbinom(M,size=1,prob = genoParent1/2)
    AlleleTransmittedByParent2 <- rbinom(M,size=1,prob = genoParent2/2)
    genoOffspring <- AlleleTransmittedByParent1 + AlleleTransmittedByParent2
    geneticValueOffspring        <- c(genoOffspring%*%SNPEffects)
    NonGeneticValueOffspring     <- rnorm(1,mean=0,sd=sqrt(Ve))
    PhenoOffspring[idOffspring]  <- geneticValueOffspring + NonGeneticValueOffspring
  }
  return(c(PhenoParent1,PhenoParent2,PhenoOffspring) - PopulationMeanPheno )
}

Step 2 Visualize within family variance.

In order to appreciate the range of variance that exists within a family, run the command below using a large number of offspring (e.g., nOffsprings = 1000 as in the example below) and also using different values for the genetic variance Vg, the environmental variance Ve.

Vg      <- 1
Ve      <- 0
Vp      <- Vg + Ve
h2      <- Vg / Vp
famData <- simPhenoFamily(nOffsprings = 1000,Vg=Vg,Ve=Ve,M=1000)
meanPar <- mean(famData[c(1,2)])  # mean of parent phenotypes
meanOff <- mean(famData[-c(1,2)]) # mean of offspring phenotypes
varOff  <- var(famData[-c(1,2)])  # variance of offspring phenotypes
cat(paste0("Parent Mean = ",round(meanPar,3),
           " - Offspring Mean = ",round(meanOff,3),
           " - Offspring Variance = ",round(varOff,3),".\n"))

hist(famData[-c(1,2)],xlim= 4 * c(-1,1) * sqrt(Vp),
     main="",xlab="Within-family Phenotype distribution")
abline(v=meanPar,col="dodgerblue",lwd=2,lty=1)
abline(v=meanOff,col="orange",lwd=2,lty=2)
legend("topleft",legend=c("Parent Average","Offspring Average"),
       box.lty=0,lty=1:2,col=c("dodgerblue","orange"))

Question 1. Is the distribution offspring phenotypes centered around the average of parent phenotypes [Hint: you can increase the number of offspring to be sure]? How does this observation vary as you increase the heritability? Is this expected?

Step 3: Determine the relationship between the genetic variance (in the population) and the within-family phenotypic variance.

We provide below a R function named withinFamVar, which gets the genetic variance in the population Vg as input and returns the within-family phenotypic variance varOff. We vary Vg between 0 and 10 (R code: Vgset <- seq(0,10,by=0.1)).

Vgset <- seq(0,10,by=0.1)
withinFamVar <- function(Vg){
  famData <- simPhenoFamily(nOffsprings = 1000,Vg = Vg, Ve = 0, M=1000)
  varOff  <- var(famData[-c(1,2)] - mean(famData[c(1,2)]) )
  return(varOff)
}

## applies withinFamVar to all values of Vg contained in Vgset
withinFamVarset <- sapply(Vgset,withinFamVar)

Question 2. Plot the relationship between Vg and withinFamVar. What relationship do you see? Repeat the plot using different values of the environmental variance (Ve). What do you notice?

plot(Vgset,withinFamVarset,pch=19)

Question 3. Use the code below to regress the within-family variance onto the genetic variance in the population.

summary( lm(withinFamVarset~Vgset) )

What is the slope and in the intercept? Repeat the plot using different values of the environmental variance (Ve). What do you notice? Can you postulate a formal relationship between Vg and withinFamVar?

Twin models

The classic twin study begins from assessing the variance of a phenotype in a population, and attempts to estimate how much of this is due to: genetic effects (heritability); shared environment - events that happen to both twins, affecting them in the same way; unshared, or unique, environment - events that occur to one twin but not the other, or events that affect either twin in a different way. These three components can be modelled as a composition of additive genetics (A), common environment or family effect (C), and unique environment (E), or the ACE model. It is also possible to examine non-additive genetics effects (for example, dominance and the ADE model). The ACE model indicates what proportion of variance in a trait is heritable, versus the proportion due to shared environment or unshared environment.

We can express the phenotype \(y_{ij}\) of twin \(j\) in family \(i\) as

\[\begin{equation} y_{ij} = \mu + c_i + a_{ij} + e_{ij} \end{equation}\]

where \(\mu\) is the population average, \(c_i\) a common environtal effect shared by twins in family \(i\), \(a_ij\) the (additive) genetic value of twin \(j\) in family \(i\) and \(e_{ij}\) an environmental effect specific to that twin. It follows that we can decompose the phenotypic variance as

\[\begin{equation} \text{var}(y) = \text{var}(c) + \text{var}(a) + \text{var}(e) = \sigma^2_{c} + \sigma^2_{a} + \sigma^2_{e} \end{equation}\]

and

\[\begin{equation} \text{cov}(y_{ij},y_{ik}) = \sigma^2_{c} + \pi_{a(jk)} \sigma^2_{a} \end{equation}\]

where \(\pi_{a(jk)}\) is the expected coefficient of co-ancestry (a.k.a coefficient of relatedness) between twin \(j\) and twin \(k\) in family \(i\). Therefore, for monozygotic and dizygotic twins

\[\begin{align} \text{cov}_{MZ}(y_{ij},y_{ik}) &= \sigma^2_{c} + \sigma^2_{a} \\ \text{cov}_{DZ}(y_{ij},y_{ik}) &= \sigma^2_{c} + \frac{ \sigma^2_{a} } {2}, \end{align}\]

which implies

\[\begin{equation} \text{cov}_{MZ}(y_{ij},y_{ik}) - \text{cov}_{DZ}(y_{ij},y_{ik}) = \sigma^2_{c} + \sigma^2_{a} - \sigma^2_{c} - \frac{ \sigma^2_{a} } {2} \Longrightarrow \sigma^2_{a} = 2\left[ \text{cov}_{MZ}(y_{ij},y_{ik}) - \text{cov}_{DZ}(y_{ij},y_{ik}) \right] \end{equation}\]

Twin Data

Load the data in R using the following command. The twin.RData file is available on Slack in the main module channel. Download the file and put it your practical directory.

load("/data/module3/Prac2_family_designs/twin.RData")

The command above will load an R data.frame object named twindt. Here is an overview the data

head(twindt)

The dataset contains 9128 rows and 10 columns. Each row represents a family (first column is the family ID) with a consisting of one pair of monozygotic (MZ) or dizygotic (DZ) twin. Other information provided are the year of birth (YOB) and sex of each twin.

Using the principles introduced above and in the lectures, we are interested in answering the following questions with these data (twindt)

What is the phenotypic correlation for both height and BMI for the dizygotic (DZ) and monozygotic (MZ) twins?
Is the phenotypic correlation for height and BMI the same for both males and females?
What is the estimate of the heritability for height and BMI (narrow sense) given we have estimated these correlations?
After adjusting for sex do we get the same estimate for the heritability for height and BMI?
Is there evidence for a common environment?

It is expected that you try to answer these questions using the R functions introduced above. The below code is a partial solution to the above questions and uses some key functions that you may want to use to answer the above questions.

Q1. What is the phenotypic correlation for both height and BMI for the dizygotic (DZ) and monozygotic (MZ) twins?

We can calculate the correlation between MZ and DZ twins for height using the code below. Modify that code to get answer the question for BMI.

## For height...
whoIsMZ   <- which(twindt[,"TWIN"]=="MZ")
whoIsDZ   <- which(twindt[,"TWIN"]=="DZ")
corMZ_ht  <- cor(twindt[whoIsMZ,"HT_t1"],twindt[whoIsMZ,"HT_t2"])
corDZ_ht  <- cor(twindt[whoIsDZ,"HT_t1"],twindt[whoIsDZ,"HT_t2"])
corMZ_ht
corDZ_ht
2*(corMZ_ht - corDZ_ht)

Q2. Is the phenotypic correlation for height and BMI the same for both males and females?

We can calculate the correlation between MZ and DZ twins for height (in males, SEX==1) using the code below. Modify the code to answer the question for females (SEX==2) and for both BMI in both sexes.

## For height...
whoIsMZ1   <- which(twindt[,"TWIN"]=="MZ" & twindt[,"SEX_t1"]==1 & twindt[,"SEX_t2"]==1)
whoIsDZ1   <- which(twindt[,"TWIN"]=="DZ" & twindt[,"SEX_t1"]==1 & twindt[,"SEX_t2"]==1)
corMZ_ht1  <- cor(twindt[whoIsMZ1,"HT_t1"],twindt[whoIsMZ1,"HT_t2"])
corDZ_ht1  <- cor(twindt[whoIsDZ1,"HT_t1"],twindt[whoIsDZ1,"HT_t2"])
corMZ_ht1
corDZ_ht1
2*(corMZ_ht1 - corDZ_ht1)

Q4. Re-estimate heritability after adjusting height and BMI for sex. You could use the code below to generate sex-adjusted height / BMI.

We can use the code below to adjust height for sex. Modify the code below to answer the question for BMI.

## For height...
twin1Sex1 <- which(twindt[,"SEX_t1"]==1)
twin1Sex2 <- which(twindt[,"SEX_t1"]==2)

twin2Sex1 <- which(twindt[,"SEX_t2"]==1)
twin2Sex2 <- which(twindt[,"SEX_t2"]==2)

meanHTtwin1sex1 <- mean( twindt[twin1Sex1,"HT_t1"] )
meanHTtwin1sex2 <- mean( twindt[twin1Sex2,"HT_t1"] )

meanHTtwin2sex1 <- mean( twindt[twin2Sex1,"HT_t2"] )
meanHTtwin2sex2 <- mean( twindt[twin2Sex2,"HT_t2"] )

twindt$HTadjSex_t1 <- NA
twindt[twin1Sex1,"HTadjSex_t1"] <- twindt[twin1Sex1,"HT_t1"] - meanHTtwin1sex1
twindt[twin1Sex2,"HTadjSex_t1"] <- twindt[twin1Sex2,"HT_t1"] - meanHTtwin1sex2

twindt$HTadjSex_t2 <- NA
twindt[twin2Sex1,"HTadjSex_t2"] <- twindt[twin2Sex1,"HT_t2"] - meanHTtwin2sex1
twindt[twin2Sex2,"HTadjSex_t2"] <- twindt[twin2Sex2,"HT_t2"] - meanHTtwin2sex2

whoIsMZ   <- which(twindt[,"TWIN"]=="MZ")
whoIsDZ   <- which(twindt[,"TWIN"]=="DZ")

corMZ  <- cor(twindt[whoIsMZ,"HTadjSex_t1"],twindt[whoIsMZ,"HTadjSex_t2"])
corDZ  <- cor(twindt[whoIsDZ,"HTadjSex_t1"],twindt[whoIsDZ,"HTadjSex_t2"])
corMZ
corDZ
2*(corMZ - corDZ)

Practical 2: Correlation between relatives and estimation of (co)variance components

Genetics and Genomics Winter School (Module 3)

Within-family variance

Twin models

Twin Data