library(ggplot2)
library(reshape2)
In this practical we will take a look at some genotype files from the 1000 genomes project. You can choose part 1 or 2:
Part 1 will investigate \(r^2\) as a measure of LD between SNPs and
create a LD-matrix plot using LDlink
Part 2 will construct a small toy example of a
genomic-relationship matrix (GRM) and also look at some properties of a
GRM computed with GCTA.
You may download the data files from the 1000G directly to your laptops if you wish. There is some data in Part 2 where you will do the analysis on the server, save your plots and download to view.
The data for the prac can be found in the following
directories:
/data/module1/downloads/2_genotypesPrac/
and
/data/module1/2_genotypesPrac/
Use the following commands in R on the cluster to define the path to
the data. If you have downloaded the data you will need to update the
path for dir1
.
dir1="/data/module1/downloads/2_genotypesPrac/"
dir2="/data/module1/2_genotypesPrac/"
Again, if you are working on the cluster you will need to save the
plots below using the same commands in the first prac, i.e. if the plot
is made using plot()
then:
jpeg(file="myPlot.jpg")
# open the jpeg.
plot(...)
# call the plot.
dev.off()
# close the device.
OR, if the plot is made using ggplot()
then:
savePlot = ggplot() # save the plot as an object.
ggsave(savePlot,file=“myPlot.jpg”) # use ggsave()
to
save the object.
Data used in this part 1 of the practical are:
1kg_chr2v3_EUR.raw
→ PLINK (text) file for 1000G
european ancestry individuals.
1kg_chr2v3_AFR2.raw
→ PLINK (text) file for 1000G
african ancestry individuals.
map.txt
→ corresponding map for the above data;
chromosome, marker name (rsID) and position in base pairs.
LD (linkage disequilbrium) is the non-random association between genomic markers. It is created by population demographic events and recombination. In this first part of the practical, we will use European and African ancestry genomes from the 1000 genomes project to investigate LD in a small region on chromosome 2.
The two genotype files are 1kg_chr2v3_EUR.raw
(Europeans) and 1kg_chr2v3_AFR2.raw
(Africans). I have
coverted the binary PLINK format (which cannot easily be read into R or
interpreted without using software) into a text file format with the
--recode A
option so that the alleles are coded as 0, 1 or
2 copies of the reference allele.
Read in the files and take a look at the format.
# read in the EUR genotypes & take a look
EUR = read.table(paste0(dir1,"1kg_chr2v3_EUR.raw"), header=T)
head(EUR[,1:10])
## FID IID PAT MAT SEX PHENOTYPE rs17022634_C rs3087385_C rs17763718_T
## 1 HG00096 HG00096 0 0 0 -9 0 0 0
## 2 HG00097 HG00097 0 0 0 -9 1 0 0
## 3 HG00099 HG00099 0 0 0 -9 0 0 1
## 4 HG00100 HG00100 0 0 0 -9 0 0 0
## 5 HG00101 HG00101 0 0 0 -9 1 0 0
## 6 HG00102 HG00102 0 0 0 -9 1 0 0
## rs3087395_C
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
# we don't need information on individuals so we'll trim these columns
EUR <- EUR[,-1:-6]
head(EUR[,1:10])
## rs17022634_C rs3087385_C rs17763718_T rs3087395_C rs7602535_C rs3087399_C
## 1 0 0 0 0 0 0
## 2 1 0 0 0 0 0
## 3 0 0 1 0 0 0
## 4 0 0 0 0 0 0
## 5 1 0 0 0 0 0
## 6 1 0 0 0 0 1
## rs3087403_T rs28369955_G rs28369946_G rs10191001_C
## 1 1 0 0 0
## 2 1 0 0 0
## 3 0 0 0 0
## 4 2 0 0 0
## 5 1 0 0 0
## 6 0 0 0 0
dim(EUR) # 503 rows = number of individuals, 9801 columns = number of SNPs
## [1] 503 9801
map <- read.table(paste0(dir1,"map.txt"),header=T)
dim(map) # 9801 rows = number of SNPs, matches above yay!
## [1] 9801 5
AFR = read.table(paste0(dir1,"1kg_chr2v3_AFR2.raw"),header=T)
AFR <- AFR[,-1:-6] # trim again
dim(AFR) # there are 660 individuals, and 9,801 SNPs
## [1] 660 9801
Now we’ll check the allele frequencies in the samples. Make sure you
understand the code. If necessary run subsections of the code and use
head()
or dim()
to understand what the
functions are doing.
# first calculate the frequency at a single SNP, e.g. the first one
snp = EUR[,1]
snp # check the format, coded 0, 1 and 2
## [1] 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0
## [38] 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 2
## [75] 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1
## [112] 1 0 0 0 0 0 0 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0
## [149] 1 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
## [223] 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1
## [260] 0 1 0 1 1 0 1 2 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0
## [297] 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0
## [334] 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 1 1 1 0 0 0 0 0 0 1 0
## [371] 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0
## [408] 0 0 0 1 0 0 0 1 1 1 0 0 2 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0
## [445] 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0
## [482] 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 2 1 0 0 0 0
freq = mean(snp)/2 # why divide by 2?
freq
## [1] 0.1431412
# now all SNPs
EURp = colMeans(EUR)/2
AFRp = colMeans(AFR)/2
plot(AFRp ~ EURp, pch = 20, col="dark blue",
xlab = "allele frequency, EUR", ylab = "allele frequency, AFR")
# we can also see the data has been filtered for MAF 0.01 in EUR, but not in AFR
range(EURp)
## [1] 0.01093439 0.50000000
range(AFRp)
## [1] 0.0000000 0.9810606
mafSNP = AFRp>0.01
# add the frequencies as columns to the map file - for convenience
map = cbind(map, EURp, AFRp)
We are going to calculate LD for a marker with all surrounding
markers, either in the EUR or AFR ancestry groups. I have chosen
rs77732099
, which is a marker at about 135 Mbp. I will
calculate the \(r^2\) or squared
correlation with the marker rs77732099
and all other
markers within a 4Mbp region.
marker = "rs77732099"
SNP = which(map$rsID==marker)
map[SNP,] # check the SNP is not monomorphic
## chr rsID NA. position refEUR EURp AFRp
## rs77732099_G 2 rs77732099 0 135866750 G 0.1023857 0.05227273
# identify SNPs in 4Mbp region
start = map[SNP,"position"] - 2*10^6
stop = map[SNP,"position"] + 2*10^6
region = map$position>start & map$position < stop
testSNPs = which (region & mafSNP)
# write a small loop to calculate LD with all SNP in the region
LD_EUR = NULL ; LD_AFR = NULL
for (i in testSNPs) {
LD_EUR = c(LD_EUR,cor(EUR[,SNP],EUR[,i])^2)
LD_AFR = c(LD_AFR,cor(AFR[,SNP],AFR[,i])^2)
}
# plot the result
map$mbp = map$pos/10^6
{ plot(LD_EUR ~ map$mbp[testSNPs], pch = 20, col="orange",
xlab = "linkage disequilbrium, chr2 Mbp",
ylim = range(c(LD_EUR,LD_AFR)))
points(LD_AFR ~ map$mbp[testSNPs], pch=20, col="dark blue")
legend("topright", legend = c("EUR","AFR"), fill = c("orange","dark blue"))
abline(v=map$mbp[SNP], col="red", lty = 2) }
In practice, it is very easy in PLINK
to calculate
allele frequencies and/or LD for genotype files in binary format. We
will look at doing this in the afternoon. However, its useful to see how
to do it manually - sometimes this is needed when trouble shooting or
checking the output from PLINK
.
We can see extensive LD in this region. The region was not a random
choice, it is near the LCT (or lactase) gene, which confers lactase
persistence in humans. Lactase persistence is where lactose (the primary
carbohydrate found in cows milk) can be digested into adulthood. It has
been under selection. A single SNP (rs4988235
) confers
lactase persistence in European-descent populations, with it being one
of several SNPs conferring lactase persistence in African
ancestries.
We will now use a web-based tool LDlink to investigate LD in this region using an LD-matrix plot. From the home page, click on the ‘documentation’ page and then scroll down to the ‘LDmatrix’ module for the requirements. The input is a list of up to 300 rsID numbers. Use R to select up to 300 SNP to input into the tool. Go back to the homepage and navigate to the LDmatrix tool and create your LD-matrix plot. It should only take a few moments to produce the plot.
The LDlink tool has many modules and tools for GWAS which may be useful. There is also an R package called LDlinkR.
length(testSNPs) # too many, select about 300 in the middle
## [1] 661
testSNPs = testSNPs[201:500]
write.table(file="rsIDs.txt",map[testSNPs,"rsID"],quote=FALSE,row.names=FALSE,col.names=FALSE)
In case that your plot doesn’t work in real time, here is one I generated using all the EUR populations (CEU, TSI, FIN, GBR, IBS).
Data used in Part 2 of the practical are:
1kg_rand_EUR.raw
. → same file as Part1, european
ancestry genotypes from 1000G.
families.grm.gz.
→ a GCTA
genomic
relationship matrix (GRM) file in g-zipped format.
familes.grm.id.
→ a GCTA
file of sample IDs
corresponding to the GRM above.
The concept of a genomic relationship matrix (GRM) is useful when
dealing with relatives and more complex GWAS models. It is a square
matrix that, in it’s most basic form, is calculated by:
\[ GRM = WW'/n \] where \(W\) is a matrix of scaled genotypes (scaled
so that each SNP has mean 0 and variance 1), and \(n\) is the number of markers.
We will make a simple GRM and plot it in unrelated individuals. First read in the data, and check MAF > 0.01. why is it important to ensure that MAF > 0.01?
# read in the EUR genotypes
EUR2 = read.table(paste0(dir1,"1kg_rand_EUR.raw"), header=T)
dim(EUR2) # 503 individuals, 457 SNP
## [1] 503 457
head(EUR2[,1:10]) # same format as above
## FID IID PAT MAT SEX PHENOTYPE rs4908438_G rs10737914_T rs6704445_A
## 1 HG00096 HG00096 0 0 0 -9 0 1 1
## 2 HG00097 HG00097 0 0 0 -9 0 2 1
## 3 HG00099 HG00099 0 0 0 -9 0 0 1
## 4 HG00100 HG00100 0 0 0 -9 0 1 0
## 5 HG00101 HG00101 0 0 0 -9 0 2 1
## 6 HG00102 HG00102 0 0 0 -9 0 1 1
## rs34824071_A
## 1 1
## 2 1
## 3 1
## 4 0
## 5 0
## 6 2
EUR2 = EUR2[,-1:-6] # remove ids
EUR2p = colMeans(EUR2)/2
range(EUR2p) # filtered for maf
## [1] 0.01093439 0.49900596
Making the GRM is simple on this small scale. However, as we have only used very few SNPs (n = 451), the properties of the GRM are more variable than when using large numbers of variants. Constructing the GRM can be very time consuming for large numbers of individuals with large numbers of SNPs (why?)
# scale the genotypes
W = apply(EUR2,2,scale)
# WW'/n
GRM = W%*%t(W)/length(EUR2p)
dim(GRM) # number of individuals by number of individuals
## [1] 503 503
# plot, first get in small subset into 'long' format then plot
data = melt(GRM[1:20,1:20])
ggplot(data, aes(Var1, Var2, fill= value)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
coord_fixed()
Mostly, genomic relationship matrices are made with software such as
GCTA
. Please see equation 3 in the main GCTA
paper for details on how the GRM is calculated in this program. Usually
GRMs are stored in (non-human readable) binary format but the following
is in g-zipped .grm.gz
format which can be easily read into
R.
There are two paired files, one called families.grm.gz
which contains the lower triangle elements of the GRM (remember the GRM
is symmetrical) plus the diagonal elements. The columns in the
families.grm.gz
file are row index, column index, number of
SNPs and \(\pi\) or the relationship
value. The file is in .gz
format which can be viewed in
unix using cat file.gz | head
for example.
The corresponding id file (familes.grm.id
) contains the
family ID and individual ID as columns 1 and 2. Individuals require
unique individual IDs but may have the same family ID number.
# read in the GRM
GRM = read.table(paste0(dir2,"families.grm.gz")) # lower triangle
head(GRM) ; tail(GRM) # 815x815 square matrix, lower triangle provided
## V1 V2 V3 V4
## 1 1 1 273621 0.995302200
## 2 2 1 271903 0.002727741
## 3 2 2 271910 1.015097000
## 4 3 1 273565 -0.005177113
## 5 3 2 271856 0.006702893
## 6 3 3 273572 1.007067000
## V1 V2 V3 V4
## 332515 815 810 273536 0.002251385
## 332516 815 811 273539 -0.007548287
## 332517 815 812 273349 0.001597272
## 332518 815 813 273149 0.004846184
## 332519 815 814 273535 -0.004921155
## 332520 815 815 273551 0.987163100
IDs = read.table(paste0(dir2,"families.grm.id")) # ids, n = 815 matching .grm.gz file
head(IDs)
## V1 V2
## 1 2661 1012
## 2 3046 1034
## 3 3917 1062
## 4 163 4853
## 5 3007 4852
## 6 3920 337
dim(IDs)
## [1] 815 2
We are going to take a look at the GRM, and plot the histogram of the diagonal and off-diagonal elements. We can see that this GRM has a few close relatives (larger values in plot below which are not on the diagonal), but most of the individuals are what is usually defined as ‘unrelated’ (\(\pi < 0.05\)) in human genetics.
Note that:
* mean of the diagonal elements, is approx. 1
* mean of the off-diagonal elements, is approx. 0
For a GWAS study, there are two options for handling relatives. Either remove one member of the relative pair (probably appropriate in this case as there are only 20 pairs of relatives) or use a mixed model approach could be used. We look at using both of these approaches in the pracs this afternoon.
# small subset & name columns to plot
names(GRM) = c("ID1","ID2","nSNPs","value")
GRM2 = GRM[GRM$ID1<101&GRM$ID2<101,]
ggplot(GRM2, aes(ID1, ID2, fill= value)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
coord_fixed()
# diagonal elements, relationship each individual with themselves
diags = GRM[GRM$ID1==GRM$ID2,"value"]
mean(diags) # mean diagonals ~ 1
## [1] 1.001106
hist(diags,breaks=50)
# off-diagonal elements, relationship each individual with each other
offdiags = GRM[GRM$ID1!=GRM$ID2,"value"]
mean(offdiags) # mean off-diag ~ 0
## [1] -0.0001509315
hist(offdiags,breaks=50)
length(offdiags[offdiags>0.05]) # there are about 20 relative pairs, pi>0.05
## [1] 20
hist(offdiags[offdiags>0.05],breaks=50) # mostly look like full-sib pairs, pi ~ 0.5