Expression Quantitative Trait Loci Mapping

Author

Dr Solal Chauquet

Introduction

Expression Quantitative Trait loci (eQTL, plural eQTLs) are genetic loci (single nucleotide polymorphisms, SNP) whose alleles are associated with different expression levels of a specific gene. Different alleles can be associated with a decrease or increase in gene expression. Figure 1 highlights how a SNP (in red) can be associated with gene expression. In this example, an A allele increases gene expression in a dose-response manner, with A homozygotes displaying higher levels of expression of a gene compared to G homozygotes. Additionally, alleles are not expressed uniformly, with some variants rarer than others, as represented by the relative frequency of different genotypes on the y-axis.

Figure 1. Representation of an eQTL effect on a gene. Figure taken from Nica, A. C. & Dermitzakis, E. T. Expression quantitative trait loci: present and future. Philos. Trans. R. Soc. London. Ser. B, Biol. Sci. 368, 20120362 (2013).

Most eQTLs are found outside of coding regions1 and can be divided into two categories:

Cis-eQTLs are found on the same chromosome as the gene they influence, usually within a 100,000 base pair window around it. Cis-eQTLs are thought to act on the gene directly through the regulation of enhancers, silencers, promoter regions or other regulatory elements of the gene.
Trans-eQTLs can be found anywhere in the genome, further away from the gene they influence or even on other chromosomes. They are thought to influence gene expression through regulation of biological pathways.

During this practical, we will investigate how to identify eQTLs. To better understand eQTL analysis, we will start by simulating both genotype and expression data. This simulation approach will allow us to understand the structure of the data used for eQTL mapping as well as investigate information relevant to QTL mapping. After performing the simulation, we will investigate the GTEx website and see how eQTLs can be used to investigate genome-wide association study (GWAS) results in real life.

Part 1: eQTL simulation

eQTL mapping

During the lecture, we saw that QTL mapping can be performed using a simple linear regression of the following form:

\[ y=\beta_{0}+\beta_{1}x+\epsilon \]

With:

\(y\): Gene expression for the different individuals measured
\(\beta_{0}\): Intercept (mean gene expression across alleles).
\(\beta_{1}\): Effect of an allele on the gene expression.
\(x\): Genotype value of the different individual measured (0,1 or 2).
\(\epsilon\): Error term of the model

To perform a QTL mapping, we will have to simulate the gene expression \(y\) as well as the genotype \(x\) that will be used to infer the other parameters (\(\beta_0\), \(\beta_1\)).

Genetic data

Before simulating genetic data, we will first gain a better understanding of how genotype data are represented.

Question 1:

Think about a single genetic locus where the allele can either be A or T. How would you represent four individuals whose genotype at this locus are respectively:

Note: there are two possible sets of answers.

Genetic expression is usually represented based on the number of alleles an individual carries at a specific locus. However, the allele used as a reference is arbitrary. For the previous question, we can represent the individuals based on the number of A they carry: [2 1 1 0] or based on the number of T: [0 1 1 2].

Notice that we observe no difference for homozygous individuals as they have have the same number of A and T. Similarly, the strand carrying the allele does not matter (for this kind of analysis), A/T and T/A are both represented as 1.

To simulate genetic data, we therefore have to create several vectors containing 0, 1 or 2 representing the genotype of a single individual at different loci. Those vectors are then arranged into a matrix of the following form:

\[ \begin{bmatrix}2 & 1 & 1 & 0 \\1 & 0 & 0 & 0 \\1 & 2 & 1 & 2 \\1 & 2 & 1 & 0\end{bmatrix} \]

With the columns of the matrix representing the different individuals and the rows a specific genetic locus.

Connection to the cluster

To simulate the data, we will need to connect to the HPC cluster set for this class. You can reference the introduction to computing document to find how to connect to the HPC.

Alternatively, you can run the analysis on your local machine. Keep in mind that the simulation might take a while.

Simulation of genotypes

First, we will set up the HPC folders to keep your analyses organised. The following bash script allows you to create three folders.

cd ~
mkdir eQTLPrac
cd eQTLPrac
mkdir Genotype
mkdir Expression
mkdir eQTL

Files created during the practical will be stored in those folders.

The following R code will now perform the simulation of genetic data. Start your R session and copy paste the following into R.

# Load all libraries needed for the practical:
library(tidyverse)
library(MASS)
library(cowplot)
library(MetBrewer)
set.seed(6543456)
frequency <- 0.5
SNP <- rbinom(5000, size = 2, frequency)
SNP_number <- 1000
indv_number <- 500
p <- runif(SNP_number, min = 0, max = 1)
genotypes <- replicate(indv_number, rbinom(SNP_number, 2, p))
rownames(genotypes) <- paste0('SNP', seq(1, nrow(genotypes)))
colnames(genotypes) <- paste0('Indv', seq(1, ncol(genotypes)))
print(nrow(genotypes))
print(ncol(genotypes))
print(genotypes[1:10,1:10])

Note: the set.seed function allows the code to be reproducible by fixing random processes (such as random sampling for simulation). A different seed would change the results of the downstream analysis.

Question 2:

How many SNPs were simulated?
How many individuals were simulated?
Given that the SNP3 reference allele is G and the alternate allele C, what is the gneotype of individual 5?

Exploration of allele frequency

Now that we simulated genotype data, we can calculate the frequency of the alleles simulated with the following code:

maf = rowMeans(genotypes)/2
maf <- pmin(maf, 1-maf)
jpeg('~/eQTLPrac/Genotype/HistogramMAFsimulated.jpeg',width = 21, height = 12, res = 300, units = 'cm', type = 'cairo')
truehist(maf, main = "Histogram of minor allele frequency", col = "light grey", nbins=100)
lines(density(maf), lty = 2, col = "dark red", lwd = 3)
dev.off()

You can download the plot that you just created by using the following command on your local machine (Opening a new terminal window for downloading plot is always a great idea!):

scp <username>@203.101.xxx.xxx:~/eQTLPrac/Genotype/HistogramMAFsimulated.jpeg .

Question 3:

Look at the allele frequency of the genotype data you simulated:

What is the allele frequency occurring the most?
In your opinion, why is allele frequency important for eQTL analysis? - Ask a tutor if you’re not sure!
What does the x-axis represent? - Why is it limited to 0.5?

Allele frequency is an important parameter during eQTL analysis due to the possible lack of representation of some genotypes. For QTL analyses to be possible, we need to ideally include individuals with all genotype groups or to have at least two genotype groups present (0,1 or 1,2).

Question 4:

Fill the table below using the genotype frequency derived from the Hardy-Weinberg principle for an allele A with a frequency p=0.99.

What is the property of a linear regression that allows us to perform eQTL mapping when only 2 genotype groups are present?
Out of the three different populations in the table, which one(s) could be used to perform an eQTL mapping for alleles with a frequency of 1%?

Proportion of the possible genotypes for a genetic loci with a minor allele frequency q of 0.01 (Calculation based on the Hardy–Weinberg principle).
	Population genotype frequency:	1,000 Individuals	10,000 Individuals	100,000 Individuals
AA Frequency: \(p^2\)
AT Frequency: \(2pq\)
TT: Frequency: \(q^2\)

To identify eQTLs with a minor allele frequency of 1%, we would need a population larger than 10,000 individuals where we have access to two genotypes. With 198 heterozygous individuals expected in the population we can therefore perform eQTL mapping. given that the eQTL effect is expected to be additive (the effect of two alleles is twice the effect of one allele), the eQTL mapping can be performed with only 2 genotype groups.

Currently the largest eQTL study, the eQTLgen consortium contains 31,684 individuals, highlighting that eQTL mapping for rare variants (MAF < 1%) is currently not possible. Allele frequency always needs to be kept in mind when performing eQTL mapping; exclusion of rare variants is currently always performed.

Simulation of gene expression data

Now that we simulated genetic data, we need to create matching gene expression data. While gene expression is not normally distribution (RNA-sequencing and read-based sequencing technology generate discrete data that usually follow a negative binomial or Poisson distribution), most analyses will start by normalising the data. Simulating gene expression data can be performed either at the discrete level or at the normalised level.

For simplicity, this practical will simulate normally distributed data.

genesTotal <- 1000
geneswithQTL <- 50
geneswithoutQTL <- genesTotal - geneswithQTL
# Select the SNPs associated with each of the gene:
SNPs <- rownames(genotypes)
SNPswithQTL <- sample(SNPs, size = geneswithQTL)
SNPswithoutQTL <- SNPs[-which(SNPs %in% SNPswithQTL)]

# Bulk RNA-seq data comprises multiple cell-types (e.g. whole blood contains multiple different white blood cell-types), 
# so we will be simulating gene expression from 3 different cell-types per gene 
# Then combining them into a 'bulk' sample expression (i.e. the sum of the three cell-types for that gene)
expMatrixAssociated <- do.call(cbind, lapply(SNPswithQTL,
 function(i) {
 #Simulate expression of three different cellTypes:
 meanCT1 <- c(rnorm(mean = rnorm(mean = 3, n = 1), n = 1, ),
 rnorm(mean = rnorm(mean = 5, n = 1), n = 1),
 rnorm(mean = rnorm(mean = 7, n = 1), n = 1))
 meanCT2 <- c(rnorm(mean = 3, n = 1),
 rnorm(mean = 3, n = 1),
 rnorm(mean = 3, n = 1))
 meanCT3 <- c(rnorm(mean = rnorm(mean = 4, n = 1), n = 1),
 rnorm(mean = rnorm(mean = 2, n = 1), n = 1),
 rnorm(mean = rnorm(mean = 0, n = 1), n = 1))
 yWithQTLCT1 <- rnorm(indv_number, meanCT1[factor(genotypes[i,])])
 yWithQTLCT2 <- rnorm(indv_number, meanCT2[factor(genotypes[i,])])
 yWithQTLCT3 <- rnorm(indv_number, meanCT3[factor(genotypes[i,])])
 df <- data.frame(ct1 = yWithQTLCT1,
 ct2 = yWithQTLCT2,
 ct3 = yWithQTLCT3)
# Create a bulk dataframe as the sum of the expression of the three cell types:
 df$bulk <- rowSums(df, na.rm = T)
return(df)
}))
# Add columns to the matrix:
colnames(expMatrixAssociated) <- paste0('Gene',
 rep(1:geneswithQTL, each = 4),
'_',
colnames(expMatrixAssociated))
# Simulate genes without QTLs:
expMatrixNotAssociated <- do.call(cbind, lapply(SNPswithoutQTL,
 function(i) {
 meanForAlleles <- c(rnorm(1,10))
 yWithQTLCT1 <- rnorm(indv_number, meanForAlleles)
 yWithQTLCT2 <- rnorm(indv_number, meanForAlleles)
 yWithQTLCT3 <- rnorm(indv_number, meanForAlleles)
 df <- data.frame(ct1 = yWithQTLCT1,
 ct2 = yWithQTLCT2,
 ct3 = yWithQTLCT3)
 df$bulk <- rowSums(df, na.rm = T)
 return(df)
}))
# Add colunms to the matrix:
colnames(expMatrixNotAssociated) <- paste0('Gene',
 rep(1:geneswithoutQTL, each = 4),
'_',
colnames(expMatrixNotAssociated))

# Separate the simulated expression data into a bulk RNA-seq dataframe and a cell-type specific dataframe
# Bulk df
expMatrix <- cbind(
  (expMatrixNotAssociated %>% dplyr::select(contains('bulk'))),
  (expMatrixAssociated %>% dplyr::select(contains('bulk')))
  )
# Add colnames 
colnames(expMatrix) <- paste0('Gene', 1:ncol(expMatrix))
# Cell-type specific df
expMatrixCT <- cbind(
  expMatrixAssociated %>% dplyr::select(!contains('bulk')),
  expMatrixNotAssociated %>% dplyr::select(!contains('bulk'))
)
# Add colnames 
colnames(expMatrixCT) <- paste0('Gene', rep(1:1000, each = 3), '_', c('ct1', 'ct2', 'ct3'))

Question 5:

Run the code above and answer the following questions:

How many genes were simulated?
How many of those genes were associated with SNPs?
How many cell types were simulated?
How was the bulk expression created?
What omics technology does the bulk data correspond to? Cell-type data?

The following code will generate plots showing the bulk expression level of one gene against the genotypes of several SNPs. You can change the code (by changing the name of the SNPs and genes) to visually inspect the association between different SNPs and genes.

### Test for SNPs:
SNPassociationPlot <- function(expMatrix, SNPID, GeneID) {
 ggplot(data.frame(snp = genotypes[SNPID,], y = expMatrix[,GeneID]),
 aes(x = factor(snp), y = y)) +
 ggtitle(paste0('Association between ', GeneID, ' and ', SNPID)) +
 geom_boxplot(fill = 'dark red') + geom_point(col = 'dark grey') +
 xlab("Reference allele count") +
 theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
}

p1 <- SNPassociationPlot(expMatrix, SNPID = 'SNP182', GeneID = 'Gene2')
p2 <- SNPassociationPlot(expMatrix, SNPID = 'SNP243', GeneID = 'Gene2')
p3 <- SNPassociationPlot(expMatrix, SNPID = 'SNP921', GeneID = 'Gene2')
p4 <- SNPassociationPlot(expMatrix, SNPID = 'SNP564', GeneID = 'Gene2')
p <- cowplot::plot_grid(p1,p2,p3,p4)
ggsave(p, filename = '~/eQTLPrac/Expression/GenotypeVsExpressionPlot.jpeg', width = 14, height=14, dpi =
300)

Download the plot created using the following command:

scp <username>@203.101.xxx.xxx:~/eQTLPrac/Expression/GenotypeVsExpressionPlot.jpeg .

Question 6:

Based on the plot that you generated answer the following questions:

What is the mean expression of the gene you simulated?
What is the unit of gene expression?
Which SNP (if any) is associated with the expression of Gene 5?
How would you identify SNP statistically associated with gene expression?

While identifying SNP and gene expression pairs visually is already time-consuming, the human genome is composed of 3.2 billion base pairs and roughly 20,000 genes, rendering it impossible to do manually. We need to use the linear regression that we previously described and filter the results based on significance.

eQTL mapping, simple linear regression

Now that we have simulated both gene expression and genotype, we will use linear regression to identify significant eQTLs.

Note that the plot function below will not work until you change the SNPID value of the function

# Expression:
GeneID='Gene1'
# Set the first test:
Association <- summary(lm(expMatrix[,GeneID]~genotypes['SNP1',]))
Association <- as.data.frame(Association$coefficients)[2,]
rownames(Association) <- 'SNP1'
Association <- data.frame()
for(SNPID in rownames(genotypes)){
 test <- summary(lm(expMatrix[,GeneID]~genotypes[SNPID,]))
 test <- as.data.frame(test$coefficients)[2,]
 rownames(test) <- SNPID
 Association <- rbind(Association, test)
}
colnames(Association) <- c("Estimate", "Std.Error", "t_value", "P")
Association %>% arrange(P) %>% head() %>% print()
# Change the SNP in the following code:
signifQTL <- SNPassociationPlot(expMatrix, SNPID = 'ChangeSNP', GeneID = 'Gene1')
ggsave(signifQTL,
 filename = '~/eQTLPrac/Expression/significantBulkQTL.jpeg', width = 7, height=7, dpi = 300)

Question 7:

What SNP (if any) is significantly associated with Gene 1?
Fill the gap in the code (SNPID) with the significantly associated SNP and investigate the association visually.
Modify the previous code to investigate associations with gene 982.

The association test between the gene and 1,000 SNPs for 500 individuals that we just performed took only a few seconds. However, this toy example does not represent the scale of the human genome with its more than 3 billion base pairs. eQTL analyses quickly result in prohibitive computation time as we increase the number of SNPs and individuals tested.

You can modify the variables SNP_number and indv_number and rerun the code up to this point to simulate a larger number of individuals and SNPs. However, increasing the number of individuals to 2,000 will already heavily increase computational time needed to run this (poorly optimised) script.

Software such as MatrixQTL and fastQTL have been developed to decrease the computational resources and time necessary for eQTL analyses. While we will not go into details on the inner working here, the underlying mechanisms of that software remain similar to the analysis performed within this practical (linear regression - the backbone of genetics). Methodology used to improve computational efficiency ranges from limiting the SNPs tested for a gene to the closest SNPs to developing mathematical approximations to computationally heavy calculations.

QTL mapping with interaction

Our simulation of gene expression data was based on the presence of three different cell types measured, which is representative of bulk-RNA sequencing. We will now see what those QTLs look like when we decompose them across cell type.

Question 8:

Modify and run the code below to visually inspect Gene 1 with the SNP that you previously found to be significantly associated.
Does the eQTL identified previously represent a specific cell type?

set.seed(58944)
# Using the cell-type specific expression matrix 
expMatrixCT <- expMatrixCT %>% mutate(Indv = paste0('Indv', seq(1, indv_number)))
expMatrixCTlonger <- expMatrixCT %>% pivot_longer(cols = -Indv,
                                                  names_to = 'geneID',
                                                  values_to = 'expression') %>%
  mutate(
    gene = str_split(geneID, '_', simplify = T)[, 1],
    cellType = str_split(geneID, '_', simplify = T)[, 2]
  )
genotypeToMerge <- t(genotypes) %>% as.data.frame() %>%
  mutate(Indv = rownames(.))
expMatrixCTlonger <- left_join(expMatrixCTlonger, genotypeToMerge, by = 'Indv')


# Plot gene 1:
# Change the code in Red to inspect the gene of interest:
ggplot(
  expMatrixCTlonger %>% filter(gene == 'Gene1'),
  aes(x = factor(SNP1), y = expression, fill = cellType)
) +
  geom_point(position = position_jitterdodge()) +
  facet_wrap( ~ cellType) +
  geom_boxplot() +
  theme_minimal() +
  scale_fill_manual(values = met.brewer(n = 3, 'Hokusai1')) -> interacionPlot
ggsave(
  interacionPlot,
  filename = '~/eQTLPrac/Expression/interacionQTL.jpeg',
  width = 7,
  height = 7,
  dpi = 300
)

As we can see, the eQTL observed previously was produced by a different expression in gene between the different cell types. We do not observe any eQTL when the cell type information is known.

We will now investigate eQTLs when different cell types are included. our previous linear regression can be written with interaction between cell type and genotype as follow:

\[ y=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\beta_{3}x_{1}x_{2} \]

With:

\(y\): Gene expression for the different individuals measured
\(\beta_{0}\): Intercept (mean gene expression accross alleles).
\(\beta_{1}\): Effect of an allele on the gene expression.
\(x_1\): Genotype value of the different individual measured (0,1 or 2).
\(\beta_2\): Overall effect of cell type on gene expression
\(x_2\): Cell type of origin
\(\beta_3\): Effect of genotype on gene expression within different cell type
\(\epsilon\): Error term of the model

Using this model, our term of interest will be \(\beta_3\), representing the effect of one cell type compared to the overall effect of the genotype.

Question 9:

Run the code below and modify the plotting function to output a significant interaction for gene 999.

What is the interaction observed between cell type, your significant SNP and gene 999?
What does the bulk QTL data look like for your significant SNP and gene 999, is it a significant eQTL?

interactionResults <- data.frame()
for (snp in SNPs) {
 test <- expMatrixCTlonger %>% filter(gene == 'Gene985') %>%
 dplyr::select(expression, gene, cellType, snp) %>% mutate(variant = snp)
 colnames(test) <- c('expression', 'gene', 'cellType', 'genotype', 'variant')

 lmTest <- broom::tidy(summary(lm(expression ~ genotype +
 cellType + genotype*cellType,
 data = test)))
 lmTest$SNP <- snp
 interactionResults <- rbind(interactionResults, lmTest)
}
interactionResults %>%
 filter(str_detect(term, ':')) %>%
 arrange(p.value) %>% head()
# Change the SNP value in Red to the most significant SNP:
ggplot(expMatrixCTlonger %>% filter(gene == 'Gene985'),
 aes(x = factor(SNP), y = expression, fill = cellType)) +
 geom_point(position = position_jitterdodge()) +
 facet_wrap(~cellType) +
 geom_boxplot() +
 theme_minimal() +
 scale_fill_manual(values = met.brewer( n = 3, 'Hokusai1')) -> signifInteracionQLT
ggsave(signifInteracionQLT, filename = '~/eQTLPrac/Expression/significantInteractionQTL.jpeg', width = 7, height=7, dpi = 300)

eQTL associations can be driven by biological factors such as cell type proportions or environmental factors. Accurate mapping of eQTL using interaction allows for a better investigation of GWAS results or of causality between phenotypes (see SMR practicals).

Part 2: Real world eQTL

Genotype-Tissue Expression (eQTL)

We will now investigate real-world eQTL data. If you do not have the time to go through the whole document, an answer sheet will be provided at the end of the practicals to go through in your own time.

For this part, we will go to the GTEx website. You can access it through this link.

The GTEx consortium collected post-mortem samples for 948 donors. We know that eQTLs are dynamic and evolve over time and with exposure to the environment. Characteristics such as sex, age or disease status can influence eQTL associations and are therefore important.

Question 10:

On the GTEx website, look for the sample characteristics that could influence an eQTL association study.
Hint: Navigate to the Tissue & Sample statistics page.

eQTLs are influenced by both age, sex and ancestries; the observed unbalanced number of male and females as well as a largely white and aging (84.6% white, 68.1% of samples older than 50) cohort, therefore, need to be taken into account when performing eQTL analysis. Additionally, the cohort can be split in half with younger donors succumbing to traumatic injury while older donors displaying non-traumatic pathologies.

Sample characteristics, therefore, need to be considered when performing QTL mapping. You can read the landmark GTEx publication to observe which sample characteristics were corrected for when testing for QTL associations.

Investigation of GWAS signal

We will now investigate a real example of an eQTL association. For this, we will start by looking at a genome-wide association study of lipids published in 2013 in nature genetics:

Discovery and refinement of loci associated with lipid levels

Question 11:

Read the abstract of the GWAS paper, what is the goal of this paper?

This paper aimed to identify the genetic control of blood lipid levels. As such, they identified associations between SNPs and blood lipid levels. They then mapped those SNPs to the closest genes, concluding that this means these genes had a role in blood lipid levels.

We will investigate how eQTL analysis can give us more information regarding the genetic control of blood low-density lipoprotein (LDL) cholesterol.

Open the supplementary figures from the paper and go to the supplementary table 3.

Question 12:

Find the gene with the strongest negative effect on LDL blood levels.
What is the impact of each alternate allele?
If the average person has an LDL blood level of 209.7mg/dL, what would be the expected LDL level of an individual with a genotype of GG at locus rs6511720?

Let’s investigate the effect of rs6511720, the snp associated with the highest decrease in LDL blood levels. Search the GTEx website for rs6511720 and answer the following questions.

Question 13:

With which genes is rs6511720 associated?
In which tissues are those association located?
Do you think that a change in gene expression could be responsible for the association observed between LDL levels and rs6511720?

We will now look at genetic loci associated with LDL cholesterol levels. rs12916 is associated with HMGCR, a gene coding for HMG-CoA reductase an enzyme playing a central role in cholesterol synthesis. Let’s investigate eQTL associated with rs12916, search the GTEx website for rs12916.

Question 14:

In which tissue is rs12916 associated with HMGCR?
Where does the SNP fall? (hint: open the IGV browser)
Do you think that a change in gene expression could be responsible for the association observed between LDL levels and rs12916?

In conclusion, eQTL can help interpreting the functional significance of GWAS signals. They can provide biological interpretation of non-coding variants helping to hint at the mechanisms underlying complex trait and diseases.

Part 3: Extension

This part is completely optional.

Here are a few extension questions to continue to help understand what QTLs are and how we can investigate them.

Extension Question 1:

The code used to generate bulk RNA-seq at the start of this practical assumes an equal mixture of all three cell types.

Do you think that it is realistic?
Does the same mixture of cell type applies to all genes?

This underlying assumption is not biologically accurate. We made it here since the goal of this practical is to gain a better understanding of how QTL studies work, not to generate biologically accurate data.

Extension Question 2:

Modify the code generating bulk data to include reads coming from different proportions of cell types.
How does this influence the identification of QTL at the bulk or single-cell level?

Good luck, this one is harder!