This practical can be run on your own computer (if R is installed) or on the server.
This part of the practical aims to illustrate how much genetic (hence potential phenotypic) variance exists within a family.
The rest of Practical 2 addresses how to estimate
the genetic variance using a pedigree-based design across multiple
families. Here, we will assume that a single family can have an
arbitrary large number of offspring such that we could directly observed
the variance of phenotypes across all offspring. The code below defines
a R function that simulates the phenotypes of members of a nuclear
family with two parents and nOffsprings
offspring (e.g.,
nOffsprings = 1000
or more).
Step 1: Copy/Execute the R code below.
set.seed(12345678)
simPhenoFamily <- function(
nOffsprings, # number of children
Vg, # (Additive) Genetic variance in the population
Ve, # Environmental variance
M # number of causal SNPs
){
## Allele Frequencies are simulated using a uniform distribution between 0.05 and 0.95
AlleleFrequencies <- runif(M,min=0.05,max=0.95)
# Expected heterozygosity under Hardy-Weinberg Equilibrium
ExpectHeteHWE <- 2 * AlleleFrequencies * (1 - AlleleFrequencies)
# SNP effects are simulated using a normal distribution with mean 0 and variance
SNPEffects <- rnorm(M,mean=0,sd=sqrt(Vg/(M * ExpectHeteHWE)))
# Expected Population mean
PopulationMeanPheno <- sum(2 * AlleleFrequencies * SNPEffects)
## Simulate SNPs in derived populations
genoParent1 <- sapply(1:M,function(j) rbinom(1,size=2,prob=AlleleFrequencies[j]))
genoParent2 <- sapply(1:M,function(j) rbinom(1,size=2,prob=AlleleFrequencies[j]))
## Breeding values in the derived population
geneticValueParent1 <- c(genoParent1%*%SNPEffects)
geneticValueParent2 <- c(genoParent2%*%SNPEffects)
## Residuals
NonGeneticValueParent1 <- rnorm(1,mean=0,sd=sqrt(Ve))
NonGeneticValueParent2 <- rnorm(1,mean=0,sd=sqrt(Ve))
## Phenotypes
PhenoParent1 <- geneticValueParent1 + NonGeneticValueParent1
PhenoParent2 <- geneticValueParent2 + NonGeneticValueParent2
## Simulate Offspring
PhenoOffspring <- rep(NA,nOffsprings)
for(idOffspring in 1:nOffsprings){
AlleleTransmittedByParent1 <- rbinom(M,size=1,prob = genoParent1/2)
AlleleTransmittedByParent2 <- rbinom(M,size=1,prob = genoParent2/2)
genoOffspring <- AlleleTransmittedByParent1 + AlleleTransmittedByParent2
geneticValueOffspring <- c(genoOffspring%*%SNPEffects)
NonGeneticValueOffspring <- rnorm(1,mean=0,sd=sqrt(Ve))
PhenoOffspring[idOffspring] <- geneticValueOffspring + NonGeneticValueOffspring
}
return(c(PhenoParent1,PhenoParent2,PhenoOffspring) - PopulationMeanPheno )
}
Step 2 Visualize within family variance.
In order to appreciate the range of variance that exists within a
family, run the command below using a large number of offspring (e.g.,
nOffsprings = 1000
as in the example below) and also using
different values for the genetic variance Vg
, the
environmental variance Ve
.
Vg <- 1
Ve <- 0
Vp <- Vg + Ve
h2 <- Vg / Vp
famData <- simPhenoFamily(nOffsprings = 1000,Vg=Vg,Ve=Ve,M=1000)
meanPar <- mean(famData[c(1,2)]) # mean of parent phenotypes
meanOff <- mean(famData[-c(1,2)]) # mean of offspring phenotypes
varOff <- var(famData[-c(1,2)]) # variance of offspring phenotypes
cat(paste0("Parent Mean = ",round(meanPar,3),
" - Offspring Mean = ",round(meanOff,3),
" - Offspring Variance = ",round(varOff,3),".\n"))
hist(famData[-c(1,2)],xlim= 4 * c(-1,1) * sqrt(Vp),
main="",xlab="Within-family Phenotype distribution")
abline(v=meanPar,col="dodgerblue",lwd=2,lty=1)
abline(v=meanOff,col="orange",lwd=2,lty=2)
legend("topleft",legend=c("Parent Average","Offspring Average"),
box.lty=0,lty=1:2,col=c("dodgerblue","orange"))
Question 1. Is the distribution offspring phenotypes centered around the average of parent phenotypes [Hint: you can increase the number of offspring to be sure]? How does this observation vary as you increase the heritability? Is this expected?
Step 3: Determine the relationship between the genetic variance (in the population) and the within-family phenotypic variance.
We provide below a R function named withinFamVar
, which
gets the genetic variance in the population Vg
as input and
returns the within-family phenotypic variance varOff
. We
vary Vg
between 0 and 10 (R code:
Vgset <- seq(0,10,by=0.1)
).
Vgset <- seq(0,10,by=0.1)
withinFamVar <- function(Vg){
famData <- simPhenoFamily(nOffsprings = 1000,Vg = Vg, Ve = 0, M=1000)
varOff <- var(famData[-c(1,2)] - mean(famData[c(1,2)]) )
return(varOff)
}
## applies withinFamVar to all values of Vg contained in Vgset
withinFamVarset <- sapply(Vgset,withinFamVar)
Question 2. Plot the relationship between Vg
and
withinFamVar
. What relationship do you see? Repeat the plot
using different values of the environmental variance (Ve
).
What do you notice?
plot(Vgset,withinFamVarset,pch=19)
Question 3. Use the code below to regress the within-family variance onto the genetic variance in the population.
summary( lm(withinFamVarset~Vgset) )
What is the slope and in the intercept? Repeat the plot using
different values of the environmental variance (Ve
). What
do you notice? Can you postulate a formal relationship between
Vg
and withinFamVar
?
The classic twin study begins from assessing the variance of a phenotype in a population, and attempts to estimate how much of this is due to: genetic effects (heritability); shared environment - events that happen to both twins, affecting them in the same way; unshared, or unique, environment - events that occur to one twin but not the other, or events that affect either twin in a different way. These three components can be modelled as a composition of additive genetics (A), common environment or family effect (C), and unique environment (E), or the ACE model. It is also possible to examine non-additive genetics effects (for example, dominance and the ADE model). The ACE model indicates what proportion of variance in a trait is heritable, versus the proportion due to shared environment or unshared environment.
We can express the phenotype \(y_{ij}\) of twin \(j\) in family \(i\) as
\[\begin{equation} y_{ij} = \mu + c_i + a_{ij} + e_{ij} \end{equation}\]where \(\mu\) is the population average, \(c_i\) a common environtal effect shared by twins in family \(i\), \(a_ij\) the (additive) genetic value of twin \(j\) in family \(i\) and \(e_{ij}\) an environmental effect specific to that twin. It follows that we can decompose the phenotypic variance as
\[\begin{equation} \text{var}(y) = \text{var}(c) + \text{var}(a) + \text{var}(e) = \sigma^2_{c} + \sigma^2_{a} + \sigma^2_{e} \end{equation}\]and
\[\begin{equation} \text{cov}(y_{ij},y_{ik}) = \sigma^2_{c} + \pi_{a(jk)} \sigma^2_{a} \end{equation}\]where \(\pi_{a(jk)}\) is the expected coefficient of co-ancestry (a.k.a coefficient of relatedness) between twin \(j\) and twin \(k\) in family \(i\). Therefore, for monozygotic and dizygotic twins
\[\begin{align} \text{cov}_{MZ}(y_{ij},y_{ik}) &= \sigma^2_{c} + \sigma^2_{a} \\ \text{cov}_{DZ}(y_{ij},y_{ik}) &= \sigma^2_{c} + \frac{ \sigma^2_{a} } {2}, \end{align}\]which implies
\[\begin{equation} \text{cov}_{MZ}(y_{ij},y_{ik}) - \text{cov}_{DZ}(y_{ij},y_{ik}) = \sigma^2_{c} + \sigma^2_{a} - \sigma^2_{c} - \frac{ \sigma^2_{a} } {2} \Longrightarrow \sigma^2_{a} = 2\left[ \text{cov}_{MZ}(y_{ij},y_{ik}) - \text{cov}_{DZ}(y_{ij},y_{ik}) \right] \end{equation}\]Load the data in R using the following command. The
twin.RData
file is available on Slack in the main module
channel. Download the file and put it your practical
directory.
load("/data/module3/Prac2_family_designs/twin.RData")
The command above will load an R data.frame
object named
twindt
. Here is an overview the data
head(twindt)
The dataset contains 9128 rows and 10 columns. Each row represents a family (first column is the family ID) with a consisting of one pair of monozygotic (MZ) or dizygotic (DZ) twin. Other information provided are the year of birth (YOB) and sex of each twin.
Using the principles introduced above and in the lectures, we are
interested in answering the following questions with these data
(twindt
)
It is expected that you try to answer these questions using the R functions introduced above. The below code is a partial solution to the above questions and uses some key functions that you may want to use to answer the above questions.
Q1. What is the phenotypic correlation for both height and BMI for the dizygotic (DZ) and monozygotic (MZ) twins?
We can calculate the correlation between MZ and DZ twins for height using the code below. Modify that code to get answer the question for BMI.
## For height...
whoIsMZ <- which(twindt[,"TWIN"]=="MZ")
whoIsDZ <- which(twindt[,"TWIN"]=="DZ")
corMZ_ht <- cor(twindt[whoIsMZ,"HT_t1"],twindt[whoIsMZ,"HT_t2"])
corDZ_ht <- cor(twindt[whoIsDZ,"HT_t1"],twindt[whoIsDZ,"HT_t2"])
corMZ_ht
corDZ_ht
2*(corMZ_ht - corDZ_ht)
Q2. Is the phenotypic correlation for height and BMI the same for both males and females?
We can calculate the correlation between MZ and DZ twins for height
(in males, SEX==1
) using the code below. Modify the code to
answer the question for females (SEX==2
) and for both BMI
in both sexes.
## For height...
whoIsMZ1 <- which(twindt[,"TWIN"]=="MZ" & twindt[,"SEX_t1"]==1 & twindt[,"SEX_t2"]==1)
whoIsDZ1 <- which(twindt[,"TWIN"]=="DZ" & twindt[,"SEX_t1"]==1 & twindt[,"SEX_t2"]==1)
corMZ_ht1 <- cor(twindt[whoIsMZ1,"HT_t1"],twindt[whoIsMZ1,"HT_t2"])
corDZ_ht1 <- cor(twindt[whoIsDZ1,"HT_t1"],twindt[whoIsDZ1,"HT_t2"])
corMZ_ht1
corDZ_ht1
2*(corMZ_ht1 - corDZ_ht1)
Q4. Re-estimate heritability after adjusting height and BMI for sex. You could use the code below to generate sex-adjusted height / BMI.
We can use the code below to adjust height for sex. Modify the code below to answer the question for BMI.
## For height...
twin1Sex1 <- which(twindt[,"SEX_t1"]==1)
twin1Sex2 <- which(twindt[,"SEX_t1"]==2)
twin2Sex1 <- which(twindt[,"SEX_t2"]==1)
twin2Sex2 <- which(twindt[,"SEX_t2"]==2)
meanHTtwin1sex1 <- mean( twindt[twin1Sex1,"HT_t1"] )
meanHTtwin1sex2 <- mean( twindt[twin1Sex2,"HT_t1"] )
meanHTtwin2sex1 <- mean( twindt[twin2Sex1,"HT_t2"] )
meanHTtwin2sex2 <- mean( twindt[twin2Sex2,"HT_t2"] )
twindt$HTadjSex_t1 <- NA
twindt[twin1Sex1,"HTadjSex_t1"] <- twindt[twin1Sex1,"HT_t1"] - meanHTtwin1sex1
twindt[twin1Sex2,"HTadjSex_t1"] <- twindt[twin1Sex2,"HT_t1"] - meanHTtwin1sex2
twindt$HTadjSex_t2 <- NA
twindt[twin2Sex1,"HTadjSex_t2"] <- twindt[twin2Sex1,"HT_t2"] - meanHTtwin2sex1
twindt[twin2Sex2,"HTadjSex_t2"] <- twindt[twin2Sex2,"HT_t2"] - meanHTtwin2sex2
whoIsMZ <- which(twindt[,"TWIN"]=="MZ")
whoIsDZ <- which(twindt[,"TWIN"]=="DZ")
corMZ <- cor(twindt[whoIsMZ,"HTadjSex_t1"],twindt[whoIsMZ,"HTadjSex_t2"])
corDZ <- cor(twindt[whoIsDZ,"HTadjSex_t1"],twindt[whoIsDZ,"HTadjSex_t2"])
corMZ
corDZ
2*(corMZ - corDZ)