library(ggplot2)
library(reshape2)
In this practical we will take a basic look at some genotype files from the 1000 genomes project. There are two parts:
Part 1 will investigate \(r^2\) as a measure of LD between SNPs and create a LD-matrix plot using LDlink
Part 2 will construct a small toy example of a genomic-relationship matrix (GRM) and also look at some properties of a GRM computed with GCTA.
The data for the prac can be found in the following directory:
/data/module1/3_genotypes/
dir="/data/module1/3_genotypes/"
LD (linkage disequilbrium) is the non-random association between genomic markers. It is created by population demographic events and recombination. In this first part of the practical, we will use European and African ancestry genomes from the 1000 genomes project to investigate LD in a small region on chromosome 2.
The two genotype files are 1kg_chr2v3_EUR.raw
(Europeans) and 1kg_chr2v3_AFR2.raw
(Africans). I have coverted the binary PLINK format into text file with the --recode A
option so that the alleles are coded as 0, 1 or 2 copies of the reference allele. There is a corresponding map.txt
file for the chromosome, marker name (rsID) and position in base pairs.
Read in the files and take a look at the format.
# read in the EUR genotypes & take a look
EUR = read.table(paste0(dir,"1kg_chr2v3_EUR.raw"), header=T)
head(EUR[,1:10])
## FID IID PAT MAT SEX PHENOTYPE rs17022634_C rs3087385_C rs17763718_T
## 1 HG00096 HG00096 0 0 0 -9 0 0 0
## 2 HG00097 HG00097 0 0 0 -9 1 0 0
## 3 HG00099 HG00099 0 0 0 -9 0 0 1
## 4 HG00100 HG00100 0 0 0 -9 0 0 0
## 5 HG00101 HG00101 0 0 0 -9 1 0 0
## 6 HG00102 HG00102 0 0 0 -9 1 0 0
## rs3087395_C
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
# we don't need information on individuals so we'll trim these columns
EUR <- EUR[,-1:-6]
head(EUR[,1:10])
## rs17022634_C rs3087385_C rs17763718_T rs3087395_C rs7602535_C rs3087399_C
## 1 0 0 0 0 0 0
## 2 1 0 0 0 0 0
## 3 0 0 1 0 0 0
## 4 0 0 0 0 0 0
## 5 1 0 0 0 0 0
## 6 1 0 0 0 0 1
## rs3087403_T rs28369955_G rs28369946_G rs10191001_C
## 1 1 0 0 0
## 2 1 0 0 0
## 3 0 0 0 0
## 4 2 0 0 0
## 5 1 0 0 0
## 6 0 0 0 0
dim(EUR) # 503 rows = number of individuals, 9801 columns = number of SNPs
## [1] 503 9801
map <- read.table(paste0("map.txt"),header=T)
dim(map) # 9801 rows = number of SNPs, matches above yay!
## [1] 9801 5
AFR = read.table(paste0(dir,"1kg_chr2v3_AFR2.raw"),header=T)
AFR <- AFR[,-1:-6] # trim again
dim(AFR) # there are 660 individuals, and 9,801 SNPs
## [1] 660 9801
Now we'll check the allele frequencies in the samples. Make sure you understand the code. If necessary run subsections of the code and use head()
or dim()
to understand what the functions are doing.
# first calculate the frequency at a single SNP, e.g. the first one
snp = EUR[,1]
snp # check the format, coded 0, 1 and 2
## [1] 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0
## [38] 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 2
## [75] 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1
## [112] 1 0 0 0 0 0 0 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0
## [149] 1 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
## [223] 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1
## [260] 0 1 0 1 1 0 1 2 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0
## [297] 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0
## [334] 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 1 1 1 0 0 0 0 0 0 1 0
## [371] 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0
## [408] 0 0 0 1 0 0 0 1 1 1 0 0 2 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0
## [445] 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0
## [482] 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 2 1 0 0 0 0
freq = mean(snp)/2 # why divide by 2?
freq
## [1] 0.1431412
# now all SNPs
EURp = colMeans(EUR)/2
AFRp = colMeans(AFR)/2
plot(AFRp ~ EURp, pch = 20, col="dark blue",
xlab = "allele frequency, EUR", ylab = "allele frequency, AFR")
# we can also see the data has been filtered for MAF 0.01 in EUR, but not in AFR
range(EURp)
## [1] 0.01093439 0.50000000
range(AFRp)
## [1] 0.0000000 0.9810606
mafSNP = AFRp>0.01
# add the frequencies as columns to the map file - for convenience
map = cbind(map, EURp, AFRp)
We are going to calculate LD for a marker with all surrounding markers, either in the EUR or AFR groups. I have chosen rs77732099
, which is a marker at about 135 Mbp. I will calculate the \(r^2\) or squared correlation with the marker rs77732099
and all other markers within a 4Mbp region.
marker = "rs77732099"
SNP = which(map$rsID==marker)
map[SNP,] # check the SNP is not monomorphic
## chr rsID NA. position refEUR EURp AFRp
## rs77732099_G 2 rs77732099 0 135866750 G 0.1023857 0.05227273
# identify SNPs in 4Mbp region
start = map[SNP,"position"] - 2*10^6
stop = map[SNP,"position"] + 2*10^6
region = map$position>start & map$position < stop
testSNPs = which (region & mafSNP)
# write a small loop to calculate LD with all SNP in the region
LD_EUR = NULL ; LD_AFR = NULL
for (i in testSNPs) {
LD_EUR = c(LD_EUR,cor(EUR[,SNP],EUR[,i])^2)
LD_AFR = c(LD_AFR,cor(AFR[,SNP],AFR[,i])^2)
}
# plot the result
map$mbp = map$pos/10^6
{ plot(LD_EUR ~ map$mbp[testSNPs], pch = 20, col="orange",
xlab = "linkage disequilbrium, chr2 Mbp",
ylim = range(c(LD_EUR,LD_AFR)))
points(LD_AFR ~ map$mbp[testSNPs], pch=20, col="dark blue")
legend("topright", legend = c("EUR","AFR"), fill = c("orange","dark blue"))
abline(v=map$mbp[SNP], col="red", lty = 2) }
We can see extensive LD in this region. The region was not a random choice, it is near the LCT (or lactase) gene, which conferrs lactase persistance in humans. Lactase persistance is where lactose (the primary carbohydrate found in cows milk) can be digested into adulthood. It has been under selection. A single SNP (rs4988235
) conferrs lactase persistance in European-descent populations, with it being one of several SNPs conferring lactase persistance in african populations.
We will now use a web-based tool LDlink to investigate LD in this region using an LD-matrix plot. From the home page, click on the 'documentation' page and then scroll down to the 'LDmatrix' module for the requirements. The input is a list of up to 300 rsID numbers. Use R to select up to 300 SNP to input into the tool and create your LD-matrix plot. It should only take a few moments to produce the plot.
The LDlink tool has many modules and tools for GWAS which may be useful. There is also an R package called LDlinkR.
length(testSNPs) # too many, select about 300 in the middle
## [1] 661
testSNPs = testSNPs[201:500]
write.table(file="rsIDs.txt",map[testSNPs,"rsID"],quote=FALSE,row.names=FALSE,col.names=FALSE)
In case that your plot doesn't work in real time, here is one I generated using all the EUR populations (CEU, TSI, FIN, GBR, IBS).
The concept of a genomic relationship matrix (GRM) is useful when dealing with relatives and more complex models in GWAS. It is a square matrix that, in it's most basic form, is calculated by:
\[ GRM = WW'/n \] where \(W\) is a matrix of scaled genotypes (scaled so that each SNP has mean 0 and variance 1), and \(n\) is the number of markers.
We will make a simple GRM and plot it in unrelated individuals. First read in the data, and check MAF > 0.01. why is it important to ensure that MAF > 0.01?
# read in the EUR genotypes
EUR2 = read.table(paste0(dir,"1kg_rand_EUR.raw"), header=T)
dim(EUR2) # 503 individuals, 457 SNP
## [1] 503 457
head(EUR2[,1:10]) # same format as above
## FID IID PAT MAT SEX PHENOTYPE rs4908438_G rs10737914_T rs6704445_A
## 1 HG00096 HG00096 0 0 0 -9 0 1 1
## 2 HG00097 HG00097 0 0 0 -9 0 2 1
## 3 HG00099 HG00099 0 0 0 -9 0 0 1
## 4 HG00100 HG00100 0 0 0 -9 0 1 0
## 5 HG00101 HG00101 0 0 0 -9 0 2 1
## 6 HG00102 HG00102 0 0 0 -9 0 1 1
## rs34824071_A
## 1 1
## 2 1
## 3 1
## 4 0
## 5 0
## 6 2
EUR2 = EUR2[,-1:-6] # remove ids
EUR2p = colMeans(EUR2)/2
range(EUR2p) # filtered for maf
## [1] 0.01093439 0.49900596
Making the GRM is simple on this small scale. However, as we have only used very few SNPs (n = 451), the properties of the GRM are more variable than when using large numbers of variants. Constructing the GRM can be very time comsuming for large numbers of individuals with large numbers of SNPs (why?)
# scale the genotypes
W = apply(EUR2,2,scale)
# WW'/n
GRM = W%*%t(W)/length(EUR2p)
dim(GRM) # number of individuals by number of individuals
## [1] 503 503
# plot, first get in small subset into 'long' format then plot
data = melt(GRM[1:20,1:20])
ggplot(data, aes(Var1, Var2, fill= value)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
coord_fixed()
Mostly, genomic relationship matrices are made with custom software such as GCTA. Usually they are stored in binary format but the following is in .grm.gz
format which can be easily read into R.
There are two paired files, one called families.grm.gz
which contains the lower triangle elements of the GRM (remember the GRM is symmetrical) plus the diagonal elements. The columns in the families.grm.gz
file are row index, column index, number of SNPs and \(\pi\) or the relationship value. The file is in .gz
format which can be viewed in unix using cat file.gz | head
for example.
The corresponding id file (familes.grm.id
) contains the family ID and individual ID as columsn 1 and 2. Individuals require unique individual IDs but may have the same family ID number.
# read in the GRM
GRM = read.table(paste0(dir,"families.grm.gz")) # lower triangle
head(GRM) ; tail(GRM) # 815x815 square matrix, lower triangle provided
## V1 V2 V3 V4
## 1 1 1 273621 0.995302200
## 2 2 1 271903 0.002727741
## 3 2 2 271910 1.015097000
## 4 3 1 273565 -0.005177113
## 5 3 2 271856 0.006702893
## 6 3 3 273572 1.007067000
## V1 V2 V3 V4
## 332515 815 810 273536 0.002251385
## 332516 815 811 273539 -0.007548287
## 332517 815 812 273349 0.001597272
## 332518 815 813 273149 0.004846184
## 332519 815 814 273535 -0.004921155
## 332520 815 815 273551 0.987163100
IDs = read.table(paste0(dir,"families.grm.id")) # ids, n = 815 matching .grm.gz file
head(IDs)
## V1 V2
## 1 2661 1012
## 2 3046 1034
## 3 3917 1062
## 4 163 4853
## 5 3007 4852
## 6 3920 337
dim(IDs)
## [1] 815 2
We are going to take a look at the GRM, and plot the histogram of the diagonal and off-diagonal elements. We can see that this GRM has a few close relatives (larger values visable in plot below which are slightly off the diagonal), but most of the individuals are what is usually defined as 'unrelated' (\(\pi < 0.05\)).
Note that:
* mean of the diagonal elements, is approx. 1
* mean of the off-diagonal elements, is approx. 0
For a GWAS study, there are two options for handling relatives. Either remove one member of the relative pair (probably appropiate in this case as there are only 20 pairs) or a mixed model approach could be used. We look at using both of these approaches in a later prac.
# small subset & name columns to plot
names(GRM) = c("ID1","ID2","nSNPs","value")
GRM2 = GRM[GRM$ID1<101&GRM$ID2<101,]
ggplot(GRM2, aes(ID1, ID2, fill= value)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
coord_fixed()
# diagonal elements, relationship each individual with themselves
diags = GRM[GRM$ID1==GRM$ID2,"value"]
mean(diags) # mean diagonals ~ 1
## [1] 1.001106
hist(diags,breaks=50)
# off-diagonal elements, relationship each individual with each other
offdiags = GRM[GRM$ID1!=GRM$ID2,"value"]
mean(offdiags) # mean off-diag ~ 0
## [1] -0.0001509315
hist(offdiags,breaks=50)
length(offdiags[offdiags>0.05]) # there are about 20 relative pairs, pi>0.05
## [1] 20
hist(offdiags[offdiags>0.05],breaks=50) # mostly look like full-sib pairs, pi ~ 0.5