library(ggplot2)
library(reshape2)

Objective

In this practical we will take a look at some genotype files from the 1000 genomes project. You can choose part 1 or 2:

Part 1 will investigate \(r^2\) as a measure of LD between SNPs and create a LD-matrix plot using LDlink
Part 2 will construct a small toy example of a genomic-relationship matrix (GRM) and also look at some properties of a GRM computed with GCTA.

You may download the data files from the 1000G directly to your laptops if you wish. There is some data in Part 2 where you will do the analysis on the server, save your plots and download to view.

The data for the prac can be found in the following directories:
/data/module1/downloads/2_genotypesPrac/ and /data/module1/2_genotypesPrac/

Use the following commands in R on the cluster to define the path to the data. If you have downloaded the data you will need to update the path for dir1.

dir1="/data/module1/downloads/2_genotypesPrac/"
dir2="/data/module1/2_genotypesPrac/"

Again, if you are working on the cluster you will need to save the plots below using the same commands in the first prac, i.e. if the plot is made using plot() then:

jpeg(file="myPlot.jpg") # open the jpeg.

plot(...) # call the plot.

dev.off() # close the device.

OR, if the plot is made using ggplot() then:

savePlot = ggplot() # save the plot as an object.

ggsave(savePlot,file=“myPlot.jpg”) # use ggsave() to save the object.


Part 1: LD between loci

Data used in this part 1 of the practical are:

1kg_chr2v3_EUR.raw → PLINK (text) file for 1000G european ancestry individuals.

1kg_chr2v3_AFR2.raw → PLINK (text) file for 1000G african ancestry individuals.

map.txt → corresponding map for the above data; chromosome, marker name (rsID) and position in base pairs.

LD (linkage disequilbrium) is the non-random association between genomic markers. It is created by population demographic events and recombination. In this first part of the practical, we will use European and African ancestry genomes from the 1000 genomes project to investigate LD in a small region on chromosome 2.

The two genotype files are 1kg_chr2v3_EUR.raw (Europeans) and 1kg_chr2v3_AFR2.raw (Africans). I have coverted the binary PLINK format (which cannot easily be read into R or interpreted without using software) into a text file format with the --recode A option so that the alleles are coded as 0, 1 or 2 copies of the reference allele.

Read in the files and take a look at the format.

  # read in the EUR genotypes & take a look
  EUR = read.table(paste0(dir1,"1kg_chr2v3_EUR.raw"), header=T)
  head(EUR[,1:10])
##       FID     IID PAT MAT SEX PHENOTYPE rs17022634_C rs3087385_C rs17763718_T
## 1 HG00096 HG00096   0   0   0        -9            0           0            0
## 2 HG00097 HG00097   0   0   0        -9            1           0            0
## 3 HG00099 HG00099   0   0   0        -9            0           0            1
## 4 HG00100 HG00100   0   0   0        -9            0           0            0
## 5 HG00101 HG00101   0   0   0        -9            1           0            0
## 6 HG00102 HG00102   0   0   0        -9            1           0            0
##   rs3087395_C
## 1           0
## 2           0
## 3           0
## 4           0
## 5           0
## 6           0
  # we don't need information on individuals so we'll trim these columns
  EUR <- EUR[,-1:-6]
  head(EUR[,1:10])
##   rs17022634_C rs3087385_C rs17763718_T rs3087395_C rs7602535_C rs3087399_C
## 1            0           0            0           0           0           0
## 2            1           0            0           0           0           0
## 3            0           0            1           0           0           0
## 4            0           0            0           0           0           0
## 5            1           0            0           0           0           0
## 6            1           0            0           0           0           1
##   rs3087403_T rs28369955_G rs28369946_G rs10191001_C
## 1           1            0            0            0
## 2           1            0            0            0
## 3           0            0            0            0
## 4           2            0            0            0
## 5           1            0            0            0
## 6           0            0            0            0
  dim(EUR)            # 503 rows = number of individuals, 9801 columns = number of SNPs
## [1]  503 9801
  map  <- read.table(paste0(dir1,"map.txt"),header=T)
  dim(map)            # 9801 rows = number of SNPs, matches above yay!
## [1] 9801    5
  AFR = read.table(paste0(dir1,"1kg_chr2v3_AFR2.raw"),header=T)
  AFR <- AFR[,-1:-6]  # trim again
  dim(AFR)            # there are 660 individuals, and 9,801 SNPs
## [1]  660 9801

Calculating allele frequencies

Now we’ll check the allele frequencies in the samples. Make sure you understand the code. If necessary run subsections of the code and use head() or dim() to understand what the functions are doing.

   # first calculate the frequency at a single SNP, e.g. the first one
   snp = EUR[,1]
   snp                  # check the format, coded 0, 1 and 2 
##   [1] 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0
##  [38] 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 2
##  [75] 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1
## [112] 1 0 0 0 0 0 0 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0
## [149] 1 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
## [223] 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1
## [260] 0 1 0 1 1 0 1 2 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0
## [297] 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0
## [334] 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 1 1 1 0 0 0 0 0 0 1 0
## [371] 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0
## [408] 0 0 0 1 0 0 0 1 1 1 0 0 2 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0
## [445] 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0
## [482] 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 2 1 0 0 0 0
   freq = mean(snp)/2   # why divide by 2?
   freq
## [1] 0.1431412
   # now all SNPs
   EURp = colMeans(EUR)/2
   AFRp = colMeans(AFR)/2
   plot(AFRp ~ EURp, pch = 20, col="dark blue", 
      xlab = "allele frequency, EUR", ylab = "allele frequency, AFR")

   # we can also see the data has been filtered for MAF 0.01 in EUR, but not in AFR
   range(EURp)
## [1] 0.01093439 0.50000000
   range(AFRp)
## [1] 0.0000000 0.9810606
   mafSNP = AFRp>0.01
   
   # add the frequencies as columns to the map file - for convenience
   map = cbind(map, EURp, AFRp)

LD between loci

We are going to calculate LD for a marker with all surrounding markers, either in the EUR or AFR ancestry groups. I have chosen rs77732099, which is a marker at about 135 Mbp. I will calculate the \(r^2\) or squared correlation with the marker rs77732099 and all other markers within a 4Mbp region.

   marker = "rs77732099"
   SNP = which(map$rsID==marker)
   map[SNP,]      # check the SNP is not monomorphic
##              chr       rsID NA.  position refEUR      EURp       AFRp
## rs77732099_G   2 rs77732099   0 135866750      G 0.1023857 0.05227273
   # identify SNPs in 4Mbp region
   start = map[SNP,"position"] - 2*10^6
   stop = map[SNP,"position"] + 2*10^6
   region = map$position>start & map$position < stop
   testSNPs = which (region & mafSNP)
   
   # write a small loop to calculate LD with all SNP in the region
   LD_EUR = NULL ; LD_AFR = NULL
   for (i in testSNPs) {
    LD_EUR = c(LD_EUR,cor(EUR[,SNP],EUR[,i])^2)
    LD_AFR = c(LD_AFR,cor(AFR[,SNP],AFR[,i])^2)
   }
   
   # plot the result
   map$mbp = map$pos/10^6
   { plot(LD_EUR ~ map$mbp[testSNPs], pch = 20, col="orange", 
          xlab = "linkage disequilbrium, chr2 Mbp",
          ylim = range(c(LD_EUR,LD_AFR)))
     points(LD_AFR ~ map$mbp[testSNPs], pch=20, col="dark blue") 
     legend("topright", legend = c("EUR","AFR"), fill = c("orange","dark blue")) 
     abline(v=map$mbp[SNP], col="red", lty = 2) }

In practice, it is very easy in PLINK to calculate allele frequencies and/or LD for genotype files in binary format. We will look at doing this in the afternoon. However, its useful to see how to do it manually - sometimes this is needed when trouble shooting or checking the output from PLINK.

An LD-matrix plot

We can see extensive LD in this region. The region was not a random choice, it is near the LCT (or lactase) gene, which confers lactase persistence in humans. Lactase persistence is where lactose (the primary carbohydrate found in cows milk) can be digested into adulthood. It has been under selection. A single SNP (rs4988235) confers lactase persistence in European-descent populations, with it being one of several SNPs conferring lactase persistence in African ancestries.

We will now use a web-based tool LDlink to investigate LD in this region using an LD-matrix plot. From the home page, click on the ‘documentation’ page and then scroll down to the ‘LDmatrix’ module for the requirements. The input is a list of up to 300 rsID numbers. Use R to select up to 300 SNP to input into the tool. Go back to the homepage and navigate to the LDmatrix tool and create your LD-matrix plot. It should only take a few moments to produce the plot.

The LDlink tool has many modules and tools for GWAS which may be useful. There is also an R package called LDlinkR.

   length(testSNPs) # too many, select about 300 in the middle
## [1] 661
   testSNPs = testSNPs[201:500]
   write.table(file="rsIDs.txt",map[testSNPs,"rsID"],quote=FALSE,row.names=FALSE,col.names=FALSE)

In case that your plot doesn’t work in real time, here is one I generated using all the EUR populations (CEU, TSI, FIN, GBR, IBS).

An LD-matrix near LCT in European-ancestry samples
An LD-matrix near LCT in European-ancestry samples

Part 2: A simple relationship matrix

Data used in Part 2 of the practical are:

1kg_rand_EUR.raw. → same file as Part1, european ancestry genotypes from 1000G.

families.grm.gz. → a GCTA genomic relationship matrix (GRM) file in g-zipped format.

familes.grm.id. → a GCTA file of sample IDs corresponding to the GRM above.

The concept of a genomic relationship matrix (GRM) is useful when dealing with relatives and more complex GWAS models. It is a square matrix that, in it’s most basic form, is calculated by:
\[ GRM = WW'/n \] where \(W\) is a matrix of scaled genotypes (scaled so that each SNP has mean 0 and variance 1), and \(n\) is the number of markers.

making the GRM, toy example

We will make a simple GRM and plot it in unrelated individuals. First read in the data, and check MAF > 0.01. why is it important to ensure that MAF > 0.01?

  # read in the EUR genotypes
  EUR2 = read.table(paste0(dir1,"1kg_rand_EUR.raw"), header=T)
  dim(EUR2)           # 503 individuals, 457 SNP
## [1] 503 457
  head(EUR2[,1:10])   # same format as above
##       FID     IID PAT MAT SEX PHENOTYPE rs4908438_G rs10737914_T rs6704445_A
## 1 HG00096 HG00096   0   0   0        -9           0            1           1
## 2 HG00097 HG00097   0   0   0        -9           0            2           1
## 3 HG00099 HG00099   0   0   0        -9           0            0           1
## 4 HG00100 HG00100   0   0   0        -9           0            1           0
## 5 HG00101 HG00101   0   0   0        -9           0            2           1
## 6 HG00102 HG00102   0   0   0        -9           0            1           1
##   rs34824071_A
## 1            1
## 2            1
## 3            1
## 4            0
## 5            0
## 6            2
  EUR2 = EUR2[,-1:-6] # remove ids
  EUR2p = colMeans(EUR2)/2
  range(EUR2p)        # filtered for maf
## [1] 0.01093439 0.49900596

Making the GRM is simple on this small scale. However, as we have only used very few SNPs (n = 451), the properties of the GRM are more variable than when using large numbers of variants. Constructing the GRM can be very time consuming for large numbers of individuals with large numbers of SNPs (why?)

  # scale the genotypes
  W = apply(EUR2,2,scale)

  # WW'/n
  GRM = W%*%t(W)/length(EUR2p)
  dim(GRM)          # number of individuals by number of individuals
## [1] 503 503
  # plot, first get in small subset into 'long' format then plot
  data = melt(GRM[1:20,1:20])
  ggplot(data, aes(Var1, Var2, fill= value)) + 
    geom_tile() +
    scale_fill_gradient(low = "white", high = "red") +
    coord_fixed()

GRM from GCTA .gz format, with some relatives

Mostly, genomic relationship matrices are made with software such as GCTA. Please see equation 3 in the main GCTA paper for details on how the GRM is calculated in this program. Usually GRMs are stored in (non-human readable) binary format but the following is in g-zipped .grm.gz format which can be easily read into R.

There are two paired files, one called families.grm.gz which contains the lower triangle elements of the GRM (remember the GRM is symmetrical) plus the diagonal elements. The columns in the families.grm.gz file are row index, column index, number of SNPs and \(\pi\) or the relationship value. The file is in .gz format which can be viewed in unix using cat file.gz | head for example.

The corresponding id file (familes.grm.id) contains the family ID and individual ID as columns 1 and 2. Individuals require unique individual IDs but may have the same family ID number.

   # read in the GRM
   GRM = read.table(paste0(dir2,"families.grm.gz"))  # lower triangle
   head(GRM) ; tail(GRM)                # 815x815 square matrix, lower triangle provided
##   V1 V2     V3           V4
## 1  1  1 273621  0.995302200
## 2  2  1 271903  0.002727741
## 3  2  2 271910  1.015097000
## 4  3  1 273565 -0.005177113
## 5  3  2 271856  0.006702893
## 6  3  3 273572  1.007067000
##         V1  V2     V3           V4
## 332515 815 810 273536  0.002251385
## 332516 815 811 273539 -0.007548287
## 332517 815 812 273349  0.001597272
## 332518 815 813 273149  0.004846184
## 332519 815 814 273535 -0.004921155
## 332520 815 815 273551  0.987163100
   IDs = read.table(paste0(dir2,"families.grm.id"))  # ids, n = 815 matching .grm.gz file
   head(IDs)
##     V1   V2
## 1 2661 1012
## 2 3046 1034
## 3 3917 1062
## 4  163 4853
## 5 3007 4852
## 6 3920  337
   dim(IDs)
## [1] 815   2

We are going to take a look at the GRM, and plot the histogram of the diagonal and off-diagonal elements. We can see that this GRM has a few close relatives (larger values in plot below which are not on the diagonal), but most of the individuals are what is usually defined as ‘unrelated’ (\(\pi < 0.05\)) in human genetics.

Note that:
* mean of the diagonal elements, is approx. 1
* mean of the off-diagonal elements, is approx. 0

For a GWAS study, there are two options for handling relatives. Either remove one member of the relative pair (probably appropriate in this case as there are only 20 pairs of relatives) or use a mixed model approach could be used. We look at using both of these approaches in the pracs this afternoon.

  #  small subset & name columns to plot
   names(GRM) = c("ID1","ID2","nSNPs","value")
   GRM2 = GRM[GRM$ID1<101&GRM$ID2<101,]
   ggplot(GRM2, aes(ID1, ID2, fill= value)) + 
      geom_tile() +
      scale_fill_gradient(low = "white", high = "red") +
      coord_fixed()

   # diagonal elements, relationship each individual with themselves
   diags = GRM[GRM$ID1==GRM$ID2,"value"]
   mean(diags)                              # mean diagonals ~ 1
## [1] 1.001106
   hist(diags,breaks=50)

   # off-diagonal elements, relationship each individual with each other
   offdiags = GRM[GRM$ID1!=GRM$ID2,"value"]
   mean(offdiags)                           # mean off-diag ~ 0
## [1] -0.0001509315
   hist(offdiags,breaks=50)

   length(offdiags[offdiags>0.05])          # there are about 20 relative pairs, pi>0.05 
## [1] 20
   hist(offdiags[offdiags>0.05],breaks=50)  # mostly look like full-sib pairs, pi ~ 0.5