Objective
In this practical session you will learn some basic concepts
associated with performing a GWAS.
This practical is split into three parts that each introduce you to a
different type of GWAS you might want to perform.
Part 1: Performing a GWAS in unrelated individuals with a
quantitative trait using PLINK
Part 2: Performing a GWAS in unrelated individuals with a
binary trait using PLINK
Part 3: Performing a GWAS in related individual using
GCTA
.
You you will need to do your analysis on the server, save your plots
and download to view. Do not download the data used in this
practical
We are assuming here that we are using QC’d genotype and phenotype
files. In each part we will run the GWAS and examine the output by
generating Manhattan plots, qq-plots and calculate the genomic inflation
factor.
For this practical we will use commands both in unix (at the command
line) and in R
.
Blue code chunks are used to denote command line code
Grey chunks are used to denote R code
The data can be found in the directory
/data/module1/5_gwasPrac/
on the cluster.
For more information on the software we are using:
PLINK website https://www.cog-genomics.org/plink/
GCTA website https://yanglab.westlake.edu.cn/software/gcta/#Overview
Part 1: GWAS of a unrelated individuals in PLINK for a quantitative
trait
Data
The genetic data we will use to run our GWAS are contained in the
following 3 files:
gwasQC.bed
→ binary file containing all genotypes
gwasQC.bim
→ information about SNP markers
gwasQC.fam
→ information about individuals
These 3 files come as a set and are often collectively referred to as
bfiles. They are the input for the plink --bfile
option.
There are heaps of phenotypes to choose from → Fasting glucose,
fasting insulin, ferritin, height, neuroticism, sleep duration, smoking
(pack years), systolic blood pressure, waist-to-hip ratio.
Quantitative traits have the suffix _QC.phen
. Binary
traits have the suffix binary.phen
GWAS for a quantitative trait
To run a very simple GWAS in PLINK with a quantitative trait we use
the command --assoc
in PLINK
. This fits the
following model to each SNP, in turn:
The code in PLINK
is:
plink --bfile /data/module1/5_gwasPrac/gwasQC --assoc --pheno /data/module1/5_gwasPrac/BMI_QC.phen
Given the above command and a quantitative phenotype,
--assoc
writes regression statistic results to
plink.qassoc
.
If you want to change the output file prefix in PLINK
you can try out the --out fileNamePrefix
option. The output
file below used --out raw
and the file is
raw.qassoc
.
Combining these commands should produce a file like the one below,
do you understand what each column of output is?
head raw.qassoc
After running the GWAS there are several checks and diagnostic we can
perform:
Manhattan plot
QQ-plot
Genomic inflation factor
Q: How many genome-wide significant hits are there? (\(P<5x10^{-8}\)).
Q: Is there any evidence for genomic inflation in the test
statistic?
# Load the qqman package
library(qqman)
# Load our GWAS results
gwasResults <- read.table('raw.qassoc', header = T)
# Calculate genomic inflation factor
qchisq(1-median(gwasResults$P),1)/qchisq(0.5,1)
## [1] 1.014068
# The expected value of the genomic inflation factor is 1.0. Is there any evidence for inflation?
# We are going to plot the manhattan and QQ-plots and then save them to a pdf. This allows us to download the plots and open them locally when working on a cluster
pdf(file="GWAS_visualisation_quantitative_trait.pdf")
par(mfrow = c(2, 1))
# Plot the Manhattan plot
manhattan(gwasResults)
# Plot the qq-plot
qq(gwasResults$P)
dev.off()
## quartz_off_screen
## 2
We can then copy the plots locally to open them.
Do your plots look good?
Adding covariates
Covariates can help control for know sources of variation and bias in
GWAS. Examples of discrete covariates include sex and smoking status,
and quantitative covariates might include principle components. There
are several options for dealing with covariates.
There are commands to add covariates to our regression in when doing
our GWAS in PLINK
.
e.g. --linear --covar <file>
However this is SLOW! A LOT SLOWER IF YOU INCLUDE PCS AS WELL!
An alternative is to regress the phenotype against the covariates in
R
and create a new phenotype file with the residuals. This
morning we made a file called bmiStd.phen
where fitted a
linear model to BMI and took the residuals after fitting the covariates
to the data. We can use this file now. Regressing out covariates results
in some loss in power, but that is generally accepted as it makes
running the GWAS much more efficient.
The options are:
- fitting the covariates to every SNP, very slow. e.g.
plink --bfile /data/module1/5_gwasPrac/gwasQC --linear --covar /data/module1/5_gwasPrac/covariateFiltered.cov --pheno /data/module1/5_gwasPrac/BMI_QC.phen --out bmiCov
- pre-adjusting the phenotypes for covariates, e.g.
plink --bfile /data/module1/5_gwasPrac/gwasQC --assoc --pheno /data/module1/5_gwasPrac/bmiStd.phen --out bmiStd
You can try an analysis in PLINK
fitting covariates
to your data if you like
Part 2: GWAS of unrelated individuals in PLINK for binary trait
To run a simple GWAS in PLINK
with a binary
(case/control) trait we use the command: --assoc
.
This performs a 1 df chi-square test.
Case/control phenotypes are expected to be encoded as 1 = unaffected
(control), 2 = affected (case); 0 is accepted as an alternate missing
value encoding. If you use the --1
option, 0 is interpreted
as unaffected status instead, while 1 maps to affected. This also
forces phenotypes to be interpreted as case/control..
Before running the GWAS, have a look at the phenotype file. How
are the cases and controls coded?
Run your GWAS for a binary trait by modifying the example below:
plink --bfile /data/module1/5_gwasPrac/gwasQC --assoc --pheno /data/module1/5_gwasPrac/Fasting_Insulin_binary2.phen --out binary
Given a binary phenotype, --assoc
writes out regression
statistic results to plink.assoc
. We also used
--out binary
command to change the prefix of the output.
Use head
at the command line, as below, to see the top of
the output file. Make sure you understand each of the output
columns.
head binary.assoc
As before we can create a Manhattan plot, a QQ-plot and calculate the
genomic inflation factor
Q: How many genome-wide significant hits are there? (\(P<5x10^{-8}\)). If you use the same
phenotype for the quantitative trait GWAS, how does this
compare?.
The binary version of traits such as BMI were created by thresholding
the data into cases and controls (i.e. BMI values above a certain value
were cases, otherwise they were controls). What do you think about
thresholding quantitative traits into discrete categories?
# Load the qqman package
library(qqman)
# Load our GWAS results
gwasResults <- read.table('binary.assoc', head=T)
# Calculate genomic inflation factor
qchisq(1-median(gwasResults$P),1)/qchisq(0.5,1)
## [1] 1.013126
# The expected value of the genomic inflation factor is 1.0.
# We are going to plot the manhattan and QQ-plots and then save them to a png. This allows us to open the plots locally when working on a cluster
pdf("GWAS_visualisation_binary_trait.pdf")
par(mfrow = c(2, 1))
# Plot the manhattan plot
manhattan(gwasResults)
# Plot the qq-plot
qq(gwasResults$P)
dev.off()
## quartz_off_screen
## 2
We can download the plots locally and open them.
Adding covariates
If you wanted to include covariates in analyses using binary traits,
then you can use the --logistic
command to perform a
logistic regression. Example code is given below for reference, it takes
a long time to run.
plink --bfile /data/module1/5_gwasPrac/gwasQC --logistic --pheno /data/module1/5_gwasPrac/BMI_binary1.phen --covar /data/module1/5_gwasPrac/covariateFiltered.cov --1 --out logisticCovar
Part 3: Including relatives in GCTA for a quantitative trait
Up until now we have been using PLINK
and (nominally)
unrelated individuals for our GWAS. Any closely related individuals were
excluded from the analysis. In circumstances where there are too many
close relatives in the dataset and excluding individuals is not
desirable (i.e. it would substantially reduce the power of the GWAS), we
can use a different GWAS model in GCTA
. The objective of
this part of the practical is to run a GWAS using a sample where
relatives are included. The model we will fit is:
GCTA
We will be using GCTA
to run our GWAS with related
individuals.
Similar to PLINK
, the basic command is :
gcta64 --bfile <data prefix> --command
The GWAS will be performed for a quantitative trait however there are
options for binary traits as well.
We will generate a sparse-GRM implemented in GCTA to account for the
covariance between ‘close’ relatives (i.e. \(\pi > 0.05\)).
Data
Data for this practical is found in the directory:
/data/module1/5_gwasPrac/
The genotype & phenotype files are:
data.bed
→ binary file containing all genotypes
data.bim
→ information about SNP markers
data.fam
→ information about individuals
simData2.phen
→ phenotype file
Lets look at the GWAS data. Q: How many individuals & SNPs in
the dataset?
Use the following commands at the command line.
head /data/module1/5_gwasPrac/data.fam
head /data/module1/5_gwasPrac/data.bim
wc -l /data/module1/5_gwasPrac/data.fam
wc -l /data/module1/5_gwasPrac/data.bim
GRM
We created a GRM file using GCTA
for these individuals
using the --make-grm-bin
option in GCTA
. It
produces a GRM in binary format with the following files:
data2.grm.bin
→ binary file containing genomic
relationship matrix
data2.grm.N.bin
→ binary file with number of SNP markers
used in GRM
data2.grm.id
→ individual ID’s corresponding to grm
files
We looked at the g-zipped version of these files this morning. Q:
How many individuals are included in the GRM? Does this match the number
of individuals in the bfiles?
e.g. at the command line
wc -l /data/module1/5_gwasPrac/data2.grm.id
To get a better understanding of the grm we are going to use it to
identify ‘relative’ and an ‘unrelated’ set. We can do this using
GCTA
at the command line with the
--grm-singleton
flag, e.g.
gcta64 --grm /data/module1/5_gwasPrac/data2 --grm-singleton 0.05 --out relatives
The relationship threshold in this example is 0.05, which is the
value commonly used in human genetics to define ‘unrelated’ individuals.
The GCTA
option produces three files:
relatives.family.txt
→ all relative pairs and their
relationship value
relatives.singleton.txt
→ all ‘singletons’, individuals
with no relatives in the dataset
relatives.log
→ log file
We are running this command just to see what the data is and how many
relatives we have in our dataset. Have a look at the output.
Q: How many individuals do you have in each set?
Making a sparse GRM
Use GCTA
at the command line with the
--make-bK-sparse
option to make the sparse GRM. This
produces a ‘sparse GRM’, i.e. the sparse GRM only contains values \(> 0.05\). GCTA
assumes all
the other values (i.e. those \(<0.05\)) are zero.
gcta64 --grm /data/module1/5_gwasPrac/data2 --make-bK-sparse 0.05 --out data2_sparse
Three files produced:
data2_sparse.grm.sp
→ index and relationships over 0.05
from GRM
data2_sparse.grm.id
→ corresponding ID file
data2_sparse.grm.log
→ log file
Use R and unix to investigate your output.
Q: Why are the number of lines in the sparse GRM different from
the relatives.families.txt
file obtained
previously?
Running fastGWA
Use GCTA
at the command line with the
--fastGWA-mla
and --grm-sparse
flags to run
the GWAS, e.g.
gcta64 --bfile /data/module1/5_gwasPrac/data --fastGWA-mlm --grm-sparse data2_sparse --pheno /data/module1/5_gwasPrac/simData2.phen --out assocSparse
This writes regression statistic results to
assocSparse.fastGWA
Using the R
code provided previously as a guide, can you
create a Manhattan and QQ-plots for this GWAS?