Objective

In this practical session you will learn some basic concepts associated with performing a GWAS.

This practical is split into three parts that each introduce you to a different type of GWAS you might want to perform.

Part 1: Performing a GWAS in unrelated individuals with a quantitative trait using PLINK

Part 2: Performing a GWAS in unrelated individuals with a binary trait using PLINK

Part 3: Performing a GWAS in related individual using GCTA.

You you will need to do your analysis on the server, save your plots and download to view. Do not download the data used in this practical

We are assuming here that we are using QC’d genotype and phenotype files. In each part we will run the GWAS and examine the output by generating Manhattan plots, qq-plots and calculate the genomic inflation factor.

For this practical we will use commands both in unix (at the command line) and in R.

Blue code chunks are used to denote command line code

Grey chunks are used to denote R code

The data can be found in the directory /data/module1/5_gwasPrac/ on the cluster.

For more information on the software we are using:

PLINK website https://www.cog-genomics.org/plink/

GCTA website https://yanglab.westlake.edu.cn/software/gcta/#Overview

Part 1: GWAS of a unrelated individuals in PLINK for a quantitative trait

Data

The genetic data we will use to run our GWAS are contained in the following 3 files:

gwasQC.bed → binary file containing all genotypes

gwasQC.bim → information about SNP markers

gwasQC.fam → information about individuals

These 3 files come as a set and are often collectively referred to as bfiles. They are the input for the plink --bfile option.

There are heaps of phenotypes to choose from → Fasting glucose, fasting insulin, ferritin, height, neuroticism, sleep duration, smoking (pack years), systolic blood pressure, waist-to-hip ratio.

Quantitative traits have the suffix _QC.phen. Binary traits have the suffix binary.phen

GWAS for a quantitative trait

To run a very simple GWAS in PLINK with a quantitative trait we use the command --assoc in PLINK. This fits the following model to each SNP, in turn:

The code in PLINK is:

plink --bfile /data/module1/5_gwasPrac/gwasQC --assoc --pheno  /data/module1/5_gwasPrac/BMI_QC.phen

Given the above command and a quantitative phenotype, --assoc writes regression statistic results to plink.qassoc.

If you want to change the output file prefix in PLINK you can try out the --out fileNamePrefix option. The output file below used --out raw and the file is raw.qassoc.

Combining these commands should produce a file like the one below, do you understand what each column of output is?

head raw.qassoc

After running the GWAS there are several checks and diagnostic we can perform:

Manhattan plot
QQ-plot
Genomic inflation factor

Q: How many genome-wide significant hits are there? (\(P<5x10^{-8}\)).

Q: Is there any evidence for genomic inflation in the test statistic?

# Load the qqman package
library(qqman)

# Load our GWAS results
gwasResults <- read.table('raw.qassoc', header = T)

# Calculate genomic inflation factor
qchisq(1-median(gwasResults$P),1)/qchisq(0.5,1)

## [1] 1.014068

# The expected value of the genomic inflation factor is 1.0. Is there any evidence for inflation? 

# We are going to plot the manhattan and QQ-plots and then save them to a pdf. This allows us to download the plots and open them locally when working on a cluster
pdf(file="GWAS_visualisation_quantitative_trait.pdf")
par(mfrow = c(2, 1))

# Plot the Manhattan plot
manhattan(gwasResults)

# Plot the qq-plot
qq(gwasResults$P)
dev.off()

## quartz_off_screen 
##                 2

We can then copy the plots locally to open them.

Do your plots look good?

Adding covariates

Covariates can help control for know sources of variation and bias in GWAS. Examples of discrete covariates include sex and smoking status, and quantitative covariates might include principle components. There are several options for dealing with covariates.

There are commands to add covariates to our regression in when doing our GWAS in PLINK.

e.g. --linear --covar <file>

However this is SLOW! A LOT SLOWER IF YOU INCLUDE PCS AS WELL!

An alternative is to regress the phenotype against the covariates in R and create a new phenotype file with the residuals. This morning we made a file called bmiStd.phen where fitted a linear model to BMI and took the residuals after fitting the covariates to the data. We can use this file now. Regressing out covariates results in some loss in power, but that is generally accepted as it makes running the GWAS much more efficient.

The options are:

fitting the covariates to every SNP, very slow. e.g.

plink --bfile /data/module1/5_gwasPrac/gwasQC --linear --covar /data/module1/5_gwasPrac/covariateFiltered.cov --pheno /data/module1/5_gwasPrac/BMI_QC.phen --out bmiCov

pre-adjusting the phenotypes for covariates, e.g.

plink --bfile /data/module1/5_gwasPrac/gwasQC --assoc --pheno /data/module1/5_gwasPrac/bmiStd.phen --out bmiStd

You can try an analysis in PLINK fitting covariates to your data if you like

Part 2: GWAS of unrelated individuals in PLINK for binary trait

To run a simple GWAS in PLINK with a binary (case/control) trait we use the command: --assoc.

This performs a 1 df chi-square test.

Case/control phenotypes are expected to be encoded as 1 = unaffected (control), 2 = affected (case); 0 is accepted as an alternate missing value encoding. If you use the --1 option, 0 is interpreted as unaffected status instead, while 1 maps to affected. This also forces phenotypes to be interpreted as case/control..

Before running the GWAS, have a look at the phenotype file. How are the cases and controls coded?

Run your GWAS for a binary trait by modifying the example below:

plink --bfile /data/module1/5_gwasPrac/gwasQC --assoc --pheno /data/module1/5_gwasPrac/Fasting_Insulin_binary2.phen --out binary

Given a binary phenotype, --assoc writes out regression statistic results to plink.assoc. We also used --out binary command to change the prefix of the output. Use head at the command line, as below, to see the top of the output file. Make sure you understand each of the output columns.

head binary.assoc

As before we can create a Manhattan plot, a QQ-plot and calculate the genomic inflation factor

Q: How many genome-wide significant hits are there? (\(P<5x10^{-8}\)). If you use the same phenotype for the quantitative trait GWAS, how does this compare?.

The binary version of traits such as BMI were created by thresholding the data into cases and controls (i.e. BMI values above a certain value were cases, otherwise they were controls). What do you think about thresholding quantitative traits into discrete categories?

# Load the qqman package
library(qqman)

# Load our GWAS results
gwasResults <- read.table('binary.assoc', head=T)

# Calculate genomic inflation factor
qchisq(1-median(gwasResults$P),1)/qchisq(0.5,1)

## [1] 1.013126

# The expected value of the genomic inflation factor is 1.0.  

# We are going to plot the manhattan and QQ-plots and then save them to a png. This allows us to open the plots locally when working on a cluster
pdf("GWAS_visualisation_binary_trait.pdf")
par(mfrow = c(2, 1))

# Plot the manhattan plot
manhattan(gwasResults)

# Plot the qq-plot
qq(gwasResults$P)
dev.off()

## quartz_off_screen 
##                 2

We can download the plots locally and open them.

Adding covariates

If you wanted to include covariates in analyses using binary traits, then you can use the --logistic command to perform a logistic regression. Example code is given below for reference, it takes a long time to run.

plink --bfile /data/module1/5_gwasPrac/gwasQC --logistic --pheno /data/module1/5_gwasPrac/BMI_binary1.phen --covar /data/module1/5_gwasPrac/covariateFiltered.cov --1 --out logisticCovar

Part 3: Including relatives in GCTA for a quantitative trait

Up until now we have been using PLINK and (nominally) unrelated individuals for our GWAS. Any closely related individuals were excluded from the analysis. In circumstances where there are too many close relatives in the dataset and excluding individuals is not desirable (i.e. it would substantially reduce the power of the GWAS), we can use a different GWAS model in GCTA. The objective of this part of the practical is to run a GWAS using a sample where relatives are included. The model we will fit is:

GCTA

We will be using GCTA to run our GWAS with related individuals.

Similar to PLINK, the basic command is : gcta64 --bfile <data prefix> --command

The GWAS will be performed for a quantitative trait however there are options for binary traits as well.

We will generate a sparse-GRM implemented in GCTA to account for the covariance between ‘close’ relatives (i.e. \(\pi > 0.05\)).

Data

Data for this practical is found in the directory: /data/module1/5_gwasPrac/

The genotype & phenotype files are:

data.bed → binary file containing all genotypes

data.bim → information about SNP markers

data.fam → information about individuals

simData2.phen → phenotype file

Lets look at the GWAS data. Q: How many individuals & SNPs in the dataset?

Use the following commands at the command line.

head /data/module1/5_gwasPrac/data.fam
head /data/module1/5_gwasPrac/data.bim
wc -l /data/module1/5_gwasPrac/data.fam
wc -l /data/module1/5_gwasPrac/data.bim

GRM

We created a GRM file using GCTA for these individuals using the --make-grm-bin option in GCTA. It produces a GRM in binary format with the following files:

data2.grm.bin → binary file containing genomic relationship matrix

data2.grm.N.bin → binary file with number of SNP markers used in GRM

data2.grm.id → individual ID’s corresponding to grm files

We looked at the g-zipped version of these files this morning. Q: How many individuals are included in the GRM? Does this match the number of individuals in the bfiles?

e.g. at the command line

wc -l  /data/module1/5_gwasPrac/data2.grm.id

To get a better understanding of the grm we are going to use it to identify ‘relative’ and an ‘unrelated’ set. We can do this using GCTA at the command line with the --grm-singleton flag, e.g.

gcta64 --grm /data/module1/5_gwasPrac/data2 --grm-singleton 0.05 --out relatives

The relationship threshold in this example is 0.05, which is the value commonly used in human genetics to define ‘unrelated’ individuals. The GCTA option produces three files:

relatives.family.txt → all relative pairs and their relationship value

relatives.singleton.txt → all ‘singletons’, individuals with no relatives in the dataset

relatives.log → log file

We are running this command just to see what the data is and how many relatives we have in our dataset. Have a look at the output.

Q: How many individuals do you have in each set?

Making a sparse GRM

Use GCTA at the command line with the --make-bK-sparse option to make the sparse GRM. This produces a ‘sparse GRM’, i.e. the sparse GRM only contains values \(> 0.05\). GCTA assumes all the other values (i.e. those \(<0.05\)) are zero.

gcta64 --grm /data/module1/5_gwasPrac/data2 --make-bK-sparse 0.05 --out data2_sparse

Three files produced:

data2_sparse.grm.sp → index and relationships over 0.05 from GRM

data2_sparse.grm.id → corresponding ID file

data2_sparse.grm.log → log file

Use R and unix to investigate your output.

Q: Why are the number of lines in the sparse GRM different from the relatives.families.txt file obtained previously?

Running fastGWA

Use GCTA at the command line with the --fastGWA-mla and --grm-sparse flags to run the GWAS, e.g.

gcta64 --bfile /data/module1/5_gwasPrac/data --fastGWA-mlm --grm-sparse data2_sparse --pheno /data/module1/5_gwasPrac/simData2.phen --out assocSparse

This writes regression statistic results to assocSparse.fastGWA

Using the R code provided previously as a guide, can you create a Manhattan and QQ-plots for this GWAS?

UQWS Genetic Mapping, session 5 - GWAS

Kathryn Kemper and Alesha Hatton

18/6/2024

Objective

Part 1: GWAS of a unrelated individuals in PLINK for a quantitative trait

Data

GWAS for a quantitative trait

Adding covariates

Part 2: GWAS of unrelated individuals in PLINK for binary trait

Part 3: Including relatives in GCTA for a quantitative trait

GCTA

Data

GRM

Making a sparse GRM

Running fastGWA