library(ggplot2)

Objective

In this practical session you will learn some basic concepts associated with phenotype QC and the generation of principal components (in R). Can choose either Part 1 or 2 to complete.
Part 1 will pre-adjust quantitative phenotypes for covariates.
Part 2 will conduct a simple PCA (Principal Component Analysis, i.e. an eigenvalue decomposition).

The directories for the data are given below. Please run your analysis on the cluster in your /scratch/username/ directory and use the paths below to access the data. Once you are in your scratch directory, open R by typing R at the command line. If your in any doubt, you can use getwd() command in R to see where the files and plots you generate will be saved.


Part 1: Phenotype QC

You have been supplied with 10 quantitative phenotypes, and a file of covariates. We will perform basic QC and standardise on one of these phenotypes. The phenotypes are:

  1. Body mass index (BMI)
  2. fasting glucose
  3. fasting insulin
  4. ferritin
  5. height
  6. neuroticisim
  7. sleep duration
  8. smoking pack years
  9. systolic blood pressure
  10. waist:hip ratio

Choose a phenotype to work with. For this demonstration we will use BMI as an example.

You must work on the server for this part of the practical, save your plots and download the plots to your laptop view your results. Do not download the phenotype data

The data is located in the directory /data/module1/1_phenotypesPrac/. Define the path to the data, which will be used below, in R as follows:

dir="/data/module1/1_phenotypesPrac/"

The files used in this part 1 of the prac are:

BMI.phen → raw phenotypes file.

covariateFiltered.cov → covariates file.

Do the values make sense?

First read in the data and examine the distribution of the trait, what do you expect - do the values make sense?

Use the R commands dim() and head() to examine the data structure. The phenotype file has 3 columns: family ID, individual ID and the phenotype value. You can use hist(), part of the R Base Graphics, to quickly plot and visualise the distribution of the data.

  data = read.table(paste0(dir,"BMI.phen"))
  dim(data)   # there are 11,793 rows, one record per row
## [1] 11793     3
  head(data)
##        V1      V2    V3
## 1 7653762 7653762 24.08
## 2 8144519 8144519 24.47
## 3 2337680 2337680 24.37
## 4 5219864 5219864 25.83
## 5 1417721 1417721 28.59
## 6 2371103 2371103 26.80
  hist(data[,3], breaks = 100)            # view for html

  jpeg(file="raw_BMI_distribution1.jpg")  # make file for download
  hist(data[,3], breaks = 100)
  dev.off()
## quartz_off_screen 
##                 2

There seems to be negative values for BMI. This is not plausible as \(BMI = weight/height^2\) and usually ranges from about 15 - 40. A negative value could be a missing values which was encoded as an unrealistic number. Let us remove the negative values and look at the distribution again.

   data = data[data[,3]>0,]
   hist(data[,3], breaks = 100)           # view for html

   jpeg(file="raw_BMI_distribution2.jpg")  # make file for download
   hist(data[,3], breaks = 100)
   dev.off()
## quartz_off_screen 
##                 2
   dim(data)  # there are 11,783 records remaining, 10 removed
## [1] 11783     3

Better!

Now let us read in the covariates. The covariate file (covariateFiltered.cov) is located in the same directory. The covariates colunms are age, sex and the first 5 principal components.

   cov = read.table(paste0(dir,"covariateFiltered.cov"), header = T)  
   dim(cov) # there are 11,780 records, 1 row per record
## [1] 11780     9
   colnames(cov)[3:9] = c("age","sex",paste0("PC",1:5))
   head(cov)
##       FID     IID age sex         PC1         PC2         PC3          PC4
## 1 7653762 7653762  39   2  0.01008060 -0.00665342 -0.01391810  1.19919e-02
## 2 8144519 8144519  66   2 -0.01998500 -0.00328824 -0.00324179 -9.56271e-03
## 3 2337680 2337680  71   2 -0.00254482 -0.00613635  0.00328573  4.11204e-04
## 4 5219864 5219864  47   1 -0.00693638 -0.00569925 -0.00311431 -3.31955e-03
## 5 1417721 1417721  43   1 -0.01165670 -0.00672607 -0.00211596 -2.10093e-03
## 6 2371103 2371103  38   2 -0.00396343  0.02634790 -0.00540908 -8.12653e-05
##            PC5
## 1  0.009173280
## 2 -0.002555470
## 3 -0.000540739
## 4 -0.001726890
## 5  0.014064800
## 6 -0.003650660

We need to be careful of matching the covariate file with the phenotype file correctly. Once that’s done have a look at the ranges/counts of the covariates.

Another useful basic function in R is table(). This returns the count of each unique number or item in the supplied vector.

   # first match covariate and phenotype files
   s = match(cov[,1],data[,1])          # s gives the order of data to match cov
   sum(data[s,1]==cov[,1],na.rm=T)      # check
## [1] 11770
   data = data[s,]                      # reorder the file
   data2 = data.frame(cov,bmi=data[,3]) # combine and make a dataframe for ggplot
   data2 = data2[!is.na(data2$bmi),]    # remove records with missing phenotypes
   table(cov$age)                       # range of ages for trait is 37 to 76
## 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56 
## 121 237 258 259 258 225 239 223 241 239 285 262 259 265 301 306 363 348 388 428 
##  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76 
## 413 462 452 529 488 467 418 382 406 335 333 298 309 283 289 272 135   1   2   1
   table(cov$sex)                       # 1 = males, 2 = females
## 
##    1    2 
## 5346 6434
   data2$sex = as.factor(data2$sex)     # interpret 1 and 2 as factors rather than integers

We are now going to plot a the distribution of the data using the library(ggplot2) and some code that I adapted from here.

   mu = aggregate(data2$bmi~data2$sex,FUN=mean)
   names(mu) = c("sex","bmi")
   mu
##   sex      bmi
## 1   1 29.24613
## 2   2 25.88712
   savePlot = ggplot(data2, aes(x=bmi, color=sex)) +
     geom_histogram(fill="white", position="dodge", bins = 50)+
     geom_vline(data=mu,aes(xintercept=bmi,color=sex), linetype="dashed")
   savePlot                     # view for html

   ggsave(file="bmiSex.jpg")  # save for download
## Saving 7 x 5 in image

Pre-adjusting the phenotypes

Now we will correct the phenotype for sex and age effects, as well as the first 5 principal components by using a linear model. In studies in human genetics, correcting for sex and age effects is the minimum requirement. Often further covariates (such as genotyping batch) or more complex models (for example with sex-specific effects) are used.

Pre-adjusting phenotypes slightly reduces GWAS power but makes the analysis much faster. Many program internally pre-adjust the phenotypes for covariates, even if they are supplied at the time of analysis.

Since we are testing for linear effects of age, we need to center age (i.e. correct for the mean) to help with interpretation of the co-efficients from the model. Sometimes it is worth checking for non-linear age effects as well, so we will add \(age^2\) to the model.

   data2$age = data2$age - mean(data2$age)
   data2$age2 = data2$age^2
   lm0 = lm(bmi ~ sex + age + age2 + PC1 + PC2 + PC3 + PC4 + PC5, data=data2)
   summary(lm0)
## 
## Call:
## lm(formula = bmi ~ sex + age + age2 + PC1 + PC2 + PC3 + PC4 + 
##     PC5, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5297  -2.8330  -0.0456   2.8226  29.0687 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.2908006  0.0694176 421.950   <2e-16 ***
## sex2        -3.4111714  0.0786246 -43.386   <2e-16 ***
## age          0.1686041  0.0041958  40.184   <2e-16 ***
## age2        -0.0001749  0.0004193  -0.417   0.6767    
## PC1          9.0083273  4.2459643   2.122   0.0339 *  
## PC2          1.7252037  4.2469427   0.406   0.6846    
## PC3         -5.3130620  4.2464166  -1.251   0.2109    
## PC4          3.7630622  4.2473917   0.886   0.3757    
## PC5          3.1973422  4.2466482   0.753   0.4515    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.245 on 11761 degrees of freedom
## Multiple R-squared:  0.2323, Adjusted R-squared:  0.2318 
## F-statistic: 444.8 on 8 and 11761 DF,  p-value: < 2.2e-16

The effect estimated by the model for the intercept is the average BMI for males (sex = 1) at the mean age. Thus the average BMI for males in the data is about \(29.3 kg/m^2\). Is this what you expect?

The next line is the sex effect (sex2) which is the difference between the average of males and females at the mean age. What is the mean BMI for females according the the linear model, what is the raw mean for females?

The mean BMI for females from the linear model \(29.3 - 3.4 = 25.9 kg/m^2\). This is very close to the raw sex averages calculated above.

  • what does the significant effect of age mean? How do you interpret the coefficient?
  • the age2 (\(age^2\)) effect is not significant and can be dropped.
  • the PCs are also not significant but are usually fitted regardless of significance.

We can now extract the residuals from the model, i.e. this is the phenotype (\(y\)) minus the variables fitted in the model (overall mean, sex effect - if relevant, linear effects of age and PC1-PC5). Note here that the residuals from the model are in the same order as the input file.

   lm0 = lm(bmi ~ sex + age + PC1 + PC2 + PC3 + PC4 + PC5, data=data2)
   data2$resid = lm0$residuals

Final checks

After pre-adjusting the phenotypes, there are a number of checks normally performed. These include:

  • the distribution of the adjusted trait is approximately normal
  • remove the outliers (i.e. records > 5 standard deviations from the mean)
  • check the variance is similar between males and females

Traits with skewed distributions may require a log10(x) transformation to achieve normality. A RINT (rank-based inverse-normal) transformation is also possible, although the SNP effects for RINT transformations are difficult to interpret. A RINT transformation can be performed with the function below.

  InvNorm=function(x) {
    return(qnorm((rank(x, na.last="keep")-0.5)/sum(!is.na(x))))
  }

Try out the RINT function above on the data, what does it do and what types of values does it produce?

Often pre-adjusted phenotypes are standardised to a unit normal \(N(0,1)\) so that effect sizes can be compared across traits. Normally, the SNP effects from a linear model for a quantitative trait are in trait units, what are the trait units if the trait has been standardised to \(N(0,1)\)?

   data2$std = scale(data2$resid)     # standardise N(0,1)
   sum(abs(data2$std)>5)              # how many outliers are more than 5 SD from mean?
## [1] 9
   data2 = data2[abs(data2$std)<5,]   # remove outliers
   
   # plot final QC'd phenotype
   savePlot = ggplot(data2, aes(x=std, color=sex)) +
      geom_histogram(fill="white", position="dodge", bins = 50)
   savePlot                           # veiw for html

   ggsave(file="bmiStd.jpg")            # save for download
## Saving 7 x 5 in image
   # write file for later
   write.table(file="bmiStd.phen",data2[,c("FID","IID","bmi","std")],quote=FALSE,row.names=FALSE,col.names=FALSE)

Part 2: PCA Analysis

PCA analysis is an approach often used to identify population structure in genomic data. You will be provide with a pre-computed genomic relationship matrix for a small subset of the 1000 Genomes project data. Further information on the 1000 Genomes Project can be found [here] (https://www.internationalgenome.org/).

The genomic relationship matrix (GRM) in this example has been computed for 50 individuals identified from one of 5 ancestry groups. It was computed from about 1.3M SNP spread evenly throughout the human genome using standard methods in GCTA. We will make a simple GRM in the next practical.

If you need an introduction in Principal Component Analysis (PCA) there is a great 20minute youtube clip called PCA… Step-by-Step by StatQuest. StatQuest also has many other useful bite-size tutorials on topics related to genetics.

You can download the 1000G data directly to your laptop if you wish. Otherwise complete the practical on the server, save your plots and download them to your laptop.

The files used in this part 2 of the prac are:

1kg_winterSchool.grm.gz → pre-made genomic relationship matrix for the 1000G data.

1kg_winterSchool_superPop.txt → file defining the continental populations of the 1000G samples.

the data is located in the directory /data/module1/downloads/1_phenotypesPrac/.

Define your path to the data in R, which will be used below, for the cluster using the following code in R. If you have downloaded the data, you will need to update this path to where you saved the data.

dir="/data/module1/downloads/1_phenotypesPrac/"

First we will read in the data, and have a quick look at the format of the file.

  # read in the data
  x=read.table(paste0(dir,"1kg_winterSchool.grm.gz"))
  labels=read.table(paste0(dir,"1kg_winterSchool_superPop.txt"))[,1]

  # columns are row number, column number, number of SNP, genomic relationship
  head(x)
##   V1 V2      V3         V4
## 1  1  1 1365790 0.95116320
## 2  2  1 1365790 0.06590147
## 3  2  2 1365790 0.92498110
## 4  3  1 1365790 0.05928111
## 5  3  2 1365790 0.07948754
## 6  3  3 1365790 0.92168420
  tail(x)
##      V1 V2      V3          V4
## 1270 50 45 1365790 0.001001333
## 1271 50 46 1365790 0.046047610
## 1272 50 47 1365790 0.034500270
## 1273 50 48 1365790 0.049708950
## 1274 50 49 1365790 0.048485100
## 1275 50 50 1365790 0.878313700

The GRM is a square symmetrical matrix. The data read in are the lower triangular and diagonal elements, as indicated by the row and column number. How many elements do you expect in this file for a 50x50 GRM?

In the first column there will be \(n\) elements, \(n-1\) in the 2nd column, \(n-2\) in the 3rd, etc. until \(1\) element in the 50th column. Check this calculation against the dimensions of the GRM provided.

  # n = number of individuals
  n=50

  # expected number of elements in lower-triangular matrix, inc. diagonals
  N=sum(n:1) ; N
## [1] 1275
  # how many elements read in by R for the lower-triangle + diagonal.
  dim(x)
## [1] 1275    4

Each of the 1000G participants has a ancestry label assigned to it. Use the table() function to count the labels, note that AFR = African, AMR = admixed American, EAS = east Asian, SAS = south Asian, EUR = European.

  table(labels)
## labels
## AFR AMR EAS EUR SAS 
##   7   8  12  11  12

Now construct the GRM as a square matrix so that we can use the eigen() function in R to do the eigenvalue decomposition. The relationship values are in the 4th column of x. We can access the 4th column by using the notation object[rows,columns], where x[,4] will access all rows of the 4th column or x[1,4] will access a single element from the 1st row and 4th column.

This example also uses a for(i in array) {action} loop in R. If you are not familiar with loops in R, try a basic loop such as for (i in 1:10) {print(i)} to test it out. Make sure you understand the code!

  # define the matrix using NA, then fill the matrix
  GRM=matrix(NA,nrow=n,ncol=n)
  for (k in 1:N) {
    i=x[k,1] ; j=x[k,2]
    GRM[i,j]=x[k,4]
    GRM[j,i]=x[k,4]
  }

Now conduct the PCA analysis using the eigen() function. Save the output and check the structure of the returned object using str(). If the matrix is full-rank, we expect 50 eigenvalues with values > 0, and 50 corresponding vectors of length 50 (i.e. a 50x50 matrix). We’ll plot the first 2 PCs with the labels provided and detailing how much variation the PCs capture.

If you are working on the cluster you will need to save the plots below using the same commands as above, i.e. 

jpeg(file="myPlot.jpg") # open the jpeg.

plot(...) # call the plot.

dev.off() # close the device.

  #eigen value decomposition
   PCA=eigen(GRM)

  #eigenvalues (vector), eigenvectors (50x50 matrix, 1 column per eigenvector)
  str(PCA) 
## List of 2
##  $ values : num [1:50] 4.78 2.97 1.46 1.43 1.39 ...
##  $ vectors: num [1:50, 1:50] 0.0406 0.0471 0.049 0.0482 0.1006 ...
##  - attr(*, "class")= chr "eigen"
  # proportion of variation explained by PCs
   total=sum(PCA$values)
   plot(PCA$values[1:10]/total,
        las=1, type="b", lwd=1.2, col="blue", xlab="principal component",
        ylab="Proportion of total variance", pch=20)

  # plot 
  labels=as.factor(labels)
  PC1=PCA$vectors[,1] ; PC2=PCA$vectors[,2]
  plot(PC2~PC1, col=labels, pch=20, las=1, 
       xlab="PC1, 9.2% of variation", ylab="PC2, 5.7% of variation")
  legend("bottomleft",legend=levels(labels),fill=1:5)

In practice, we would use software such as GCTA or PLINK at the command line to (1) calculate the GRM, and (2) conduct the PCA analysis. These programs are much more efficient compared to R. However, we have done it in R here as it is useful for understanding what’s going on.