library(ggplot2)
library(reshape2)

Objective

In this practical we will take a basic look at some genotype files from the 1000 genomes project. There are two parts:

Part 1 will investigate \(r^2\) as a measure of LD between SNPs and create a LD-matrix plot using LDlink
Part 2 will construct a small toy example of a genomic-relationship matrix (GRM) and also look at some properties of a GRM computed with GCTA.

The data for the prac can be found in the following directory:
/data/module1/3_genotypes/

dir="/data/module1/3_genotypes/"

Part 1: LD between loci

LD (linkage disequilbrium) is the non-random association between genomic markers. It is created by population demographic events and recombination. In this first part of the practical, we will use European and African ancestry genomes from the 1000 genomes project to investigate LD in a small region on chromosome 2.

The two genotype files are 1kg_chr2v3_EUR.raw (Europeans) and 1kg_chr2v3_AFR2.raw (Africans). I have coverted the binary PLINK format into text file with the --recode A option so that the alleles are coded as 0, 1 or 2 copies of the reference allele. There is a corresponding map.txt file for the chromosome, marker name (rsID) and position in base pairs.

Read in the files and take a look at the format.

  # read in the EUR genotypes & take a look
  EUR = read.table(paste0(dir,"1kg_chr2v3_EUR.raw"), header=T)
  head(EUR[,1:10])
##       FID     IID PAT MAT SEX PHENOTYPE rs17022634_C rs3087385_C rs17763718_T
## 1 HG00096 HG00096   0   0   0        -9            0           0            0
## 2 HG00097 HG00097   0   0   0        -9            1           0            0
## 3 HG00099 HG00099   0   0   0        -9            0           0            1
## 4 HG00100 HG00100   0   0   0        -9            0           0            0
## 5 HG00101 HG00101   0   0   0        -9            1           0            0
## 6 HG00102 HG00102   0   0   0        -9            1           0            0
##   rs3087395_C
## 1           0
## 2           0
## 3           0
## 4           0
## 5           0
## 6           0
  # we don't need information on individuals so we'll trim these columns
  EUR <- EUR[,-1:-6]
  head(EUR[,1:10])
##   rs17022634_C rs3087385_C rs17763718_T rs3087395_C rs7602535_C rs3087399_C
## 1            0           0            0           0           0           0
## 2            1           0            0           0           0           0
## 3            0           0            1           0           0           0
## 4            0           0            0           0           0           0
## 5            1           0            0           0           0           0
## 6            1           0            0           0           0           1
##   rs3087403_T rs28369955_G rs28369946_G rs10191001_C
## 1           1            0            0            0
## 2           1            0            0            0
## 3           0            0            0            0
## 4           2            0            0            0
## 5           1            0            0            0
## 6           0            0            0            0
  dim(EUR)            # 503 rows = number of individuals, 9801 columns = number of SNPs
## [1]  503 9801
  map  <- read.table(paste0("map.txt"),header=T)
  dim(map)            # 9801 rows = number of SNPs, matches above yay!
## [1] 9801    5
  AFR = read.table(paste0(dir,"1kg_chr2v3_AFR2.raw"),header=T)
  AFR <- AFR[,-1:-6]  # trim again
  dim(AFR)            # there are 660 individuals, and 9,801 SNPs
## [1]  660 9801

Calculating allele frequencies

Now we'll check the allele frequencies in the samples. Make sure you understand the code. If necessary run subsections of the code and use head() or dim() to understand what the functions are doing.

   # first calculate the frequency at a single SNP, e.g. the first one
   snp = EUR[,1]
   snp                  # check the format, coded 0, 1 and 2 
##   [1] 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0
##  [38] 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 2
##  [75] 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1
## [112] 1 0 0 0 0 0 0 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0
## [149] 1 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
## [223] 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1
## [260] 0 1 0 1 1 0 1 2 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0
## [297] 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0
## [334] 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 1 1 1 0 0 0 0 0 0 1 0
## [371] 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0
## [408] 0 0 0 1 0 0 0 1 1 1 0 0 2 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0
## [445] 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0
## [482] 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 2 1 0 0 0 0
   freq = mean(snp)/2   # why divide by 2?
   freq
## [1] 0.1431412
   # now all SNPs
   EURp = colMeans(EUR)/2
   AFRp = colMeans(AFR)/2
   plot(AFRp ~ EURp, pch = 20, col="dark blue", 
      xlab = "allele frequency, EUR", ylab = "allele frequency, AFR")

   # we can also see the data has been filtered for MAF 0.01 in EUR, but not in AFR
   range(EURp)
## [1] 0.01093439 0.50000000
   range(AFRp)
## [1] 0.0000000 0.9810606
   mafSNP = AFRp>0.01
   
   # add the frequencies as columns to the map file - for convenience
   map = cbind(map, EURp, AFRp)

LD between loci

We are going to calculate LD for a marker with all surrounding markers, either in the EUR or AFR groups. I have chosen rs77732099, which is a marker at about 135 Mbp. I will calculate the \(r^2\) or squared correlation with the marker rs77732099 and all other markers within a 4Mbp region.

   marker = "rs77732099"
   SNP = which(map$rsID==marker)
   map[SNP,]      # check the SNP is not monomorphic
##              chr       rsID NA.  position refEUR      EURp       AFRp
## rs77732099_G   2 rs77732099   0 135866750      G 0.1023857 0.05227273
   # identify SNPs in 4Mbp region
   start = map[SNP,"position"] - 2*10^6
   stop = map[SNP,"position"] + 2*10^6
   region = map$position>start & map$position < stop
   testSNPs = which (region & mafSNP)
   
   # write a small loop to calculate LD with all SNP in the region
   LD_EUR = NULL ; LD_AFR = NULL
   for (i in testSNPs) {
    LD_EUR = c(LD_EUR,cor(EUR[,SNP],EUR[,i])^2)
    LD_AFR = c(LD_AFR,cor(AFR[,SNP],AFR[,i])^2)
   }
   
   # plot the result
   map$mbp = map$pos/10^6
   { plot(LD_EUR ~ map$mbp[testSNPs], pch = 20, col="orange", 
          xlab = "linkage disequilbrium, chr2 Mbp",
          ylim = range(c(LD_EUR,LD_AFR)))
     points(LD_AFR ~ map$mbp[testSNPs], pch=20, col="dark blue") 
     legend("topright", legend = c("EUR","AFR"), fill = c("orange","dark blue")) 
     abline(v=map$mbp[SNP], col="red", lty = 2) }

An LD-matrix plot

We can see extensive LD in this region. The region was not a random choice, it is near the LCT (or lactase) gene, which conferrs lactase persistance in humans. Lactase persistance is where lactose (the primary carbohydrate found in cows milk) can be digested into adulthood. It has been under selection. A single SNP (rs4988235) conferrs lactase persistance in European-descent populations, with it being one of several SNPs conferring lactase persistance in african populations.

We will now use a web-based tool LDlink to investigate LD in this region using an LD-matrix plot. From the home page, click on the 'documentation' page and then scroll down to the 'LDmatrix' module for the requirements. The input is a list of up to 300 rsID numbers. Use R to select up to 300 SNP to input into the tool and create your LD-matrix plot. It should only take a few moments to produce the plot.

The LDlink tool has many modules and tools for GWAS which may be useful. There is also an R package called LDlinkR.

   length(testSNPs) # too many, select about 300 in the middle
## [1] 661
   testSNPs = testSNPs[201:500]
   write.table(file="rsIDs.txt",map[testSNPs,"rsID"],quote=FALSE,row.names=FALSE,col.names=FALSE)

In case that your plot doesn't work in real time, here is one I generated using all the EUR populations (CEU, TSI, FIN, GBR, IBS).

An LD-matrix near LCT in European-ancestry samples

An LD-matrix near LCT in European-ancestry samples


Part 2: A simple relationship matrix

The concept of a genomic relationship matrix (GRM) is useful when dealing with relatives and more complex models in GWAS. It is a square matrix that, in it's most basic form, is calculated by:
\[ GRM = WW'/n \] where \(W\) is a matrix of scaled genotypes (scaled so that each SNP has mean 0 and variance 1), and \(n\) is the number of markers.

making the GRM, toy example

We will make a simple GRM and plot it in unrelated individuals. First read in the data, and check MAF > 0.01. why is it important to ensure that MAF > 0.01?

  # read in the EUR genotypes
  EUR2 = read.table(paste0(dir,"1kg_rand_EUR.raw"), header=T)
  dim(EUR2)           # 503 individuals, 457 SNP
## [1] 503 457
  head(EUR2[,1:10])   # same format as above
##       FID     IID PAT MAT SEX PHENOTYPE rs4908438_G rs10737914_T rs6704445_A
## 1 HG00096 HG00096   0   0   0        -9           0            1           1
## 2 HG00097 HG00097   0   0   0        -9           0            2           1
## 3 HG00099 HG00099   0   0   0        -9           0            0           1
## 4 HG00100 HG00100   0   0   0        -9           0            1           0
## 5 HG00101 HG00101   0   0   0        -9           0            2           1
## 6 HG00102 HG00102   0   0   0        -9           0            1           1
##   rs34824071_A
## 1            1
## 2            1
## 3            1
## 4            0
## 5            0
## 6            2
  EUR2 = EUR2[,-1:-6] # remove ids
  EUR2p = colMeans(EUR2)/2
  range(EUR2p)        # filtered for maf
## [1] 0.01093439 0.49900596

Making the GRM is simple on this small scale. However, as we have only used very few SNPs (n = 451), the properties of the GRM are more variable than when using large numbers of variants. Constructing the GRM can be very time comsuming for large numbers of individuals with large numbers of SNPs (why?)

  # scale the genotypes
  W = apply(EUR2,2,scale)

  # WW'/n
  GRM = W%*%t(W)/length(EUR2p)
  dim(GRM)          # number of individuals by number of individuals
## [1] 503 503
  # plot, first get in small subset into 'long' format then plot
  data = melt(GRM[1:20,1:20])
  ggplot(data, aes(Var1, Var2, fill= value)) + 
    geom_tile() +
    scale_fill_gradient(low = "white", high = "red") +
    coord_fixed()

GRM from GCTA .gz format, with some relatives

Mostly, genomic relationship matrices are made with custom software such as GCTA. Usually they are stored in binary format but the following is in .grm.gz format which can be easily read into R.

There are two paired files, one called families.grm.gz which contains the lower triangle elements of the GRM (remember the GRM is symmetrical) plus the diagonal elements. The columns in the families.grm.gz file are row index, column index, number of SNPs and \(\pi\) or the relationship value. The file is in .gz format which can be viewed in unix using cat file.gz | head for example.

The corresponding id file (familes.grm.id) contains the family ID and individual ID as columsn 1 and 2. Individuals require unique individual IDs but may have the same family ID number.

   # read in the GRM
   GRM = read.table(paste0(dir,"families.grm.gz"))  # lower triangle
   head(GRM) ; tail(GRM)                # 815x815 square matrix, lower triangle provided
##   V1 V2     V3           V4
## 1  1  1 273621  0.995302200
## 2  2  1 271903  0.002727741
## 3  2  2 271910  1.015097000
## 4  3  1 273565 -0.005177113
## 5  3  2 271856  0.006702893
## 6  3  3 273572  1.007067000
##         V1  V2     V3           V4
## 332515 815 810 273536  0.002251385
## 332516 815 811 273539 -0.007548287
## 332517 815 812 273349  0.001597272
## 332518 815 813 273149  0.004846184
## 332519 815 814 273535 -0.004921155
## 332520 815 815 273551  0.987163100
   IDs = read.table(paste0(dir,"families.grm.id"))  # ids, n = 815 matching .grm.gz file
   head(IDs)
##     V1   V2
## 1 2661 1012
## 2 3046 1034
## 3 3917 1062
## 4  163 4853
## 5 3007 4852
## 6 3920  337
   dim(IDs)
## [1] 815   2

We are going to take a look at the GRM, and plot the histogram of the diagonal and off-diagonal elements. We can see that this GRM has a few close relatives (larger values visable in plot below which are slightly off the diagonal), but most of the individuals are what is usually defined as 'unrelated' (\(\pi < 0.05\)).

Note that:
* mean of the diagonal elements, is approx. 1
* mean of the off-diagonal elements, is approx. 0

For a GWAS study, there are two options for handling relatives. Either remove one member of the relative pair (probably appropiate in this case as there are only 20 pairs) or a mixed model approach could be used. We look at using both of these approaches in a later prac.

  #  small subset & name columns to plot
   names(GRM) = c("ID1","ID2","nSNPs","value")
   GRM2 = GRM[GRM$ID1<101&GRM$ID2<101,]
   ggplot(GRM2, aes(ID1, ID2, fill= value)) + 
      geom_tile() +
      scale_fill_gradient(low = "white", high = "red") +
      coord_fixed()

   # diagonal elements, relationship each individual with themselves
   diags = GRM[GRM$ID1==GRM$ID2,"value"]
   mean(diags)                              # mean diagonals ~ 1
## [1] 1.001106
   hist(diags,breaks=50)

   # off-diagonal elements, relationship each individual with each other
   offdiags = GRM[GRM$ID1!=GRM$ID2,"value"]
   mean(offdiags)                           # mean off-diag ~ 0
## [1] -0.0001509315
   hist(offdiags,breaks=50)

   length(offdiags[offdiags>0.05])          # there are about 20 relative pairs, pi>0.05 
## [1] 20
   hist(offdiags[offdiags>0.05],breaks=50)  # mostly look like full-sib pairs, pi ~ 0.5