Objective

In this practical session you will learn some basic concepts associated with performing a GWAS.

This practical is split into three parts that each introduce you to a different type of GWAS you might want to perform.

Part 1: Performing a GWAS in unrelated individuals with a quantitative trait using PLINK

Part 2: Performing a GWAS in unrelated individuals with a binary trait using PLINK

Part 3: Performing a GWAS in related individual using GCTA.


You you will need to do your analysis on the server, save your plots and download to view. Do not download the data used in this practical


We are assuming here that we are using QC’d genotype and phenotype files. In each part we will run the GWAS and examine the output by generating Manhattan plots, qq-plots and calculate the genomic inflation factor.

For this practical we will use commands both in unix (at the command line) and in R.

Blue code chunks are used to denote command line code
Grey chunks are used to denote R code

The data can be found in the directory /data/module1/5_gwasPrac/ on the cluster.

For more information on the software we are using:

PLINK website https://www.cog-genomics.org/plink/

GCTA website https://yanglab.westlake.edu.cn/software/gcta/#Overview



Part 3: Including relatives in GCTA for a quantitative trait

Up until now we have been using PLINK and (nominally) unrelated individuals for our GWAS. Any closely related individuals were excluded from the analysis. In circumstances where there are too many close relatives in the dataset and excluding individuals is not desirable (i.e. it would substantially reduce the power of the GWAS), we can use a different GWAS model in GCTA. The objective of this part of the practical is to run a GWAS using a sample where relatives are included. The model we will fit is:


GCTA

We will be using GCTA to run our GWAS with related individuals.

Similar to PLINK, the basic command is : gcta64 --bfile <data prefix> --command

The GWAS will be performed for a quantitative trait however there are options for binary traits as well.

We will generate a sparse-GRM implemented in GCTA to account for the covariance between ‘close’ relatives (i.e. \(\pi > 0.05\)).


Data

Data for this practical is found in the directory: /data/module1/5_gwasPrac/

The genotype & phenotype files are:

data.bed → binary file containing all genotypes

data.bim → information about SNP markers

data.fam → information about individuals

simData2.phen → phenotype file

Lets look at the GWAS data. Q: How many individuals & SNPs in the dataset?

Use the following commands at the command line.

head /data/module1/5_gwasPrac/data.fam
head /data/module1/5_gwasPrac/data.bim
wc -l /data/module1/5_gwasPrac/data.fam
wc -l /data/module1/5_gwasPrac/data.bim

GRM

We created a GRM file using GCTA for these individuals using the --make-grm-bin option in GCTA. It produces a GRM in binary format with the following files:

data2.grm.bin → binary file containing genomic relationship matrix

data2.grm.N.bin → binary file with number of SNP markers used in GRM

data2.grm.id → individual ID’s corresponding to grm files

We looked at the g-zipped version of these files this morning. Q: How many individuals are included in the GRM? Does this match the number of individuals in the bfiles?

e.g. at the command line

wc -l  /data/module1/5_gwasPrac/data2.grm.id 


To get a better understanding of the grm we are going to use it to identify ‘relative’ and an ‘unrelated’ set. We can do this using GCTA at the command line with the --grm-singleton flag, e.g. 

gcta64 --grm /data/module1/5_gwasPrac/data2 --grm-singleton 0.05 --out relatives

The relationship threshold in this example is 0.05, which is the value commonly used in human genetics to define ‘unrelated’ individuals. The GCTA option produces three files:

relatives.family.txt → all relative pairs and their relationship value

relatives.singleton.txt → all ‘singletons’, individuals with no relatives in the dataset

relatives.log → log file

We are running this command just to see what the data is and how many relatives we have in our dataset. Have a look at the output.

Q: How many individuals do you have in each set?


Making a sparse GRM

Use GCTA at the command line with the --make-bK-sparse option to make the sparse GRM. This produces a ‘sparse GRM’, i.e. the sparse GRM only contains values \(> 0.05\). GCTA assumes all the other values (i.e. those \(<0.05\)) are zero.

gcta64 --grm /data/module1/5_gwasPrac/data2 --make-bK-sparse 0.05 --out data2_sparse

Three files produced:

data2_sparse.grm.sp → index and relationships over 0.05 from GRM

data2_sparse.grm.id → corresponding ID file

data2_sparse.grm.log → log file

Use R and unix to investigate your output.

Q: Why are the number of lines in the sparse GRM different from the relatives.families.txt file obtained previously?


Running fastGWA

Use GCTA at the command line with the --fastGWA-mla and --grm-sparse flags to run the GWAS, e.g. 

gcta64 --bfile /data/module1/5_gwasPrac/data --fastGWA-mlm --grm-sparse data2_sparse --pheno /data/module1/5_gwasPrac/simData2.phen --out assocSparse

This writes regression statistic results to assocSparse.fastGWA

Using the R code provided previously as a guide, can you create a Manhattan and QQ-plots for this GWAS?