In this practical session you will learn some basic concepts associated with preparing genotype data for GWAS. We divide the cleaning of genotype data into two parts.
Part 1: removing individuals with poor quality data.
Part 2: removing SNP markers that have substandard genotyping performance.
You you will need to do your analysis on the server, save your plots and download to view. Do not download the data used in this practical
We use statistical measures to detect poor quality data and remove it. Performing the per-individual steps first prevents individuals with poor quality genotypes having an undue influence on the removal of SNP markers in later steps.
The data can be found in the directory
/data/module1/4_cleaningPrac/
on the cluster. Throughout
the practical we will use PLINK
(v1.9) at the command line,
and investigate the PLINK
output using R
. For
more information on the PLINK
commands see https://www.cog-genomics.org/plink/.
Since we will use commands in both unix (at the command line) and in
R
, we have colour coded background of the the code chunks
to help you distinguish unix code from R code.
Blue code chunks are used to denote command line code
Grey chunks are used to denote R code
The genetic data is in PLINK
format and consists of:
gwas.bed
→ binary file containing all genotypes
gwas.bim
→ information about SNP markers
gwas.fam
→ information about individuals
These 3 files come as a set and are often collectively referred to as
bfiles. They are the input for the plink --bfile
option. These are the files we want to perform the QC on.
There are five steps to removing individuals with poor quality data
We will run each of these steps individually to examine what is
happening with our data. In practice, we often remove individuals that
fail QC steps 1-3 simultaneously. The code to do this and make a new
(individual QC’d) set of PLINK
files will be provided at
the end.
We want to identify any sample with “high” (~5%) missingness to remove from further analysis. High missingness generally reflects poor DNA quality
For a SNP array, large numbers of missing SNP calls for an individual indicate intensity measures are failing to fall in any genotype clusters
For sequencing data, large numbers of missing SNP calls indicate low number of reads covering regions of the genome
Samples with a high missingness also tend to have higher genotyping error in the genotypes that are called.
This is particularly important when using a case-control design - cases and controls are often recruited separately and this can cause ‘batch effects’ often confounded with case-control status; causing false-positive results.
We can investigate the rate of missingness using the following
PLINK
command, e.g.
plink --bfile /data/module1/4_cleaningPrac/gwas --missing
This will produce output on the screen similar to that shown below.
The output from this command is two files:
plink.imiss
→ file containing missingness per
individual.
plink.lmiss
→ file containing missingness per SNP, we will
come back to this.
Let’s investigate the plink.imiss
file using the
following unix command:
head plink.imiss
The columns are:
FID
→ family ID, usually this is not used and the same as
IID
.
IID
→ individual ID, individual identifier.
MISS_PHENO
→ is the phenotype missing in the
.fam
file? Y or N.
N_MISS
→ number of missing genotypes.
N_GENO
→ total number of genotypes.
F_MISS
→ missingness rate, i.e. N_MISS/N_GENO.
Now we can use R
to determine how many individuals have
>5% missingness.
# read in the data
miss = read.table("plink.imiss", header=T)
# make a plot and save it for later
jpeg(file="missingness.jpg")
hist(miss$F_MISS, breaks=100)
dev.off()
## quartz_off_screen
## 2
# How many individuals have missingness >0.05%?
sum(miss$F_MISS>0.05)
## [1] 18
You can download your plot to view it. Do you understand what the plot is showing?
Next we want to identify any sample with outlying heterozygosity to remove from further analysis
The proportion of homozygous (or inversely heterozygous) genotypes across an individual’s genome (excluding sex chromosomes) can detect several issues with genotyping, e.g.
Average heterozygosity correlates with genotype missingness such that samples with high missingness tend to have lower average heterozygosity. Lower-than-average heterozygosity can also reflect inbreeding (i.e. an individual that has 1st or 2nd degree relatives as parents) but this is not typical in human data, and we are also looking for extreme cases unlikely to be caused by inbreeding.
High heterozygosity can result from sample contamination, where multiple samples are accidentally genotyped on a single array.
Average heterozygosity depends on population and genotyping platform, therefore we need to determine a threshold for each sample. Typically, individuals whose rates are more than a few standard deviations (sd) from the mean heterozygosity rate are removed.
We can investigate the rates of homozygosity using the following
PLINK
command:
plink --bfile /data/module1/4_cleaningPrac/gwas --het
This produces a file plink.het
. Have a look at the file
using the head
command in unix.
head plink.het
The columns are:
FID
→ family ID, usually this is not used and the same as
IID
.
IID
→ individual ID, individual identifier.
O(HOM)
→ observed number of homozygous positions.
E(HOM)
→ (plink calculated) expected number of homozygous
positions.
N(NM)
→ number of non-missing positions.
F
→ (plink calculated) inbreeding coefficient.
We can now use R
to determine if there any outliers that
should be removed. We need to calculate the homozygosity
[i.e. O(HOM)/N(NM)] and define a threshold to remove individuals.
# read in the data and calculate homozygosity rate
het = read.table("plink.het", header = T)
het$hRate = het$O.HOM./het$N.NM.
# make a plot to download to your local machine
jpeg(file="homozygosity.jpg")
hist(het$hRate, breaks=100)
dev.off()
## quartz_off_screen
## 2
# normalise the homozygosity rate to N(0,1) and
# determine how many individuals have homozygosity >5SD from the mean
het$stdRate = scale(het$hRate)
sum(abs(het$stdRate)>5)
## [1] 22
# save the ids of these individuals for later
write.table(file="excessHomozygosity.list",het[abs(het$stdRate)>5,c("FID","IID")],quote=FALSE, row.names=FALSE, col.names=FALSE)
Next we will check if an individual is (genetically) male or female, and compare this to their self-reported sex. Determining sex is relatively straightforward from genotype data
Males have a single copy of the X chromosome so usually do not have heterozygous genotypes on the X chromosome.
Only a small number of heterozygotes may be attributable to genotyping errors.
We can use the --check-sex
option in PLINK
to check the genetic and reported sex. For example,
plink --bfile /data/module1/4_cleaningPrac/gwas --check-sex
What does the output say?
Error: --check-sex/--impute-sex requires at least one polymorphic X chromosome
Unfortunately there is no sex chromosome data in this dataset, and so we can not perform this check
We have examined whether there are any individuals that need to be removed. Next we will remove those from our bfiles. Remember specific thresholds you use when working with real data will depend on the data itself!
We will use the excessHomozygosity.list
created
previously with the --mind 0.05
command in
PLINK
to filter the individuals and make a new set of bed
files. For example, try the following code:
plink \
--make-bed \
--bfile /data/module1/4_cleaningPrac/gwas \
--remove excessHomozygosity.list \
--mind 0.05 --out gwas_cleanInd
Note also that we have made the PLINK
command easier to
read by using the \
. This is a unix command for line
continuation.
Q: How many individuals were removed in total? Does this match up with how many you expected to be removed?
How many individuals are left?
In GWAS we need to either remove or control for genetic relatedness.
Even distantly related samples can bias GWAS results if not properly accounted for.
Even this small bias is important when considering that GWAS typically have thousands or millions of statistical tests.
For any pair with a genomic relationship > 0.05, remove the
one with the lowest genotyping rate. Alternatively, if there are family
groups, a program such as GCTA
will remove selectively
removes individuals to maximize the remaining sample size rather than
doing it at random. See the --grm-cutoff 0.05
option on the
GCTA
website for details.
Here we will remove related individuals using PLINK
and
the option --rel-cutoff 0.05
.
We’re going to skip this step as it takes a LONG time to run (~30min). However, there is an example of output from this step below.
plink \
--bfile gwas_cleanInd \
--rel-cutoff 0.05 \
--out gwas_cleanInd_rel0.05
Look at the ouput. How many individuals were removed from the
analysis by PLINK
?. An alternative to removing related
individuals is to fit a GWAS model which accounts for related
individuals. We will look into fitting such models using
GCTA
in the next prac.
Typically we also remove individuals who are ancestry outliers. A standard approach is to:
perform a PCA on the genotypes of a diverse set of individuals with ‘known’ ancestry, such as the 1000Genomes dataset.
Project your samples onto PCs.
Remove outliers
The procedure to do this is described in the lecture notes and on the GCTA website. It requires a reference dataset such as the 1000G dataset. We’re going to skip this step also as it takes a LONG time to run.
The second stage of genotype cleaning involves looking at individual SNPs to determine genotype accuracy.
There are four basic steps to removing SNPs with poor quality data
First, we will walk though 1-3 individually to examine what is happening with our data. As before, these steps can be run simultaneously to remove SNPs from the bfile and the code to do this will be provided at the end.
Lastly, we will compare allele frequency to known values.
Aim: remove any SNP with > 5% missing data..
Missingness is an indicator of poor quality SNPs,
caused by poor separation of genotyping clusters on arrays, or low number of sequence reads over a region.
these conditions also make the error rate in the non-missing genotypes higher.
For example, below is an example of a SNP from a genotyping array with poorly separated genotype clusters. We can see the color intensities (x- and y-axes) overlap for the CC (red), CT (green) and TT (blue) genotypes, leading to a high number of missing genotypes (grey).
Run the --missing
option in PLINK
on your
cleaned up genotype file to calculate the missingness rate per SNP.
plink --bfile gwas_cleanInd --missing --out gwas_cleanInd
Examine the output file gwas_cleanInd.lmiss
head gwas_cleanInd.lmiss
F_MISS
is the data column we are interested in, it is
the rate of missingness (i.e. N_MISS/N_GENO). Use R
to
determine if there any SNPs that should be removed due to
missingness?.
Additional checks are important for case-control studies.
Be aware that:
Aim: remove SNPs that show a significant deviation from HWE (\(P<1x10^{-6}\)).
*Humans are a large population, so genotype frequencies tend to satisfy HWE…
We can use the --het
option in PLINK
to
calculate HWE statistics on all SNPs.
plink --bfile gwas_cleanInd --hardy
Examine the output file plink.hwe
head plink.hwe
Now use R
to determine how many SNP will be removed
with \(P_{HEW}<1x10^{-6}\)?
Aim: remove rare variants. We will use a MAF threshold of 1%.
For a SNP with minor allele frequency (\(q\)) of 1%, the frequency of the minor homozygote is \(q^2\) or 0.01% under HWE.
Thus we need 10,000 individuals if we expect to see 1 of the rare homozygous genotypes.
To calculate allele frequencies in PLINK
use the
--freq
option, e.g.
plink --bfile gwas_cleanInd --freq
The output from this command is plink.frq
. We can use
the head
command in unix to take a look at the first few
lines of the file.
head plink.frq
The relevant column in the output is MAF
or minor allele
frequency. This is the frequency of the A1 allele. Use R
to
determine the number of SNP with a minor allele frequency below
1%?.
Luckily, we can run all these filters on our bfiles using one command
in PLINK
. For example, the code below does the cleaning of
SNPs with low genotyping rate, HWE issues and low MAF.
# This cleaning step should be performed on the bfile that has already been cleaned for individuals
plink \
--bfile gwas_cleanInd \
--maf 0.01 --geno 0.05 --hwe 0.000001 \
--make-bed \
--out gwas_cleanInd_cleanSNP
Q: How many SNPs were removed? Does this match up with how many you expected to be removed ?
How many SNPs are left?
The final step is to compare allele frequencies from our sample with a reference population. We have good allele frequency estimates for genetic variants in a range of populations. Differences in allele frequency between populations can indicate poor quality genotypes, it can also detect strand alignment and other issues.
Reference allele frequencies are given in the file below.
/data/module1/downloads/4_cleaningPrac/reference_allele_frequencies.txt
.
Compare these to frequencies to those calculated in previous step in R
dir="/data/module1/downloads/4_cleaningPrac/"
# Read in reference data
reference = read.table(paste0(dir,'reference_allele_frequencies.txt'), header=T)
dim(reference) # 298,255 SNPs
## [1] 298255 4
head(reference)
## SNP A1 A2 MAF
## 1 rs3131972 1 2 0.158
## 2 rs3115850 1 2 0.142
## 3 rs12562034 1 2 0.100
## 4 rs4040617 1 2 0.871
## 5 rs4970383 1 2 0.249
## 6 rs950122 1 2 0.196
# Read in allele frequencies from our data
freq = read.table('plink.frq', header=T)
dim(freq) # 298,255 SNPs
## [1] 298255 6
head(freq)
## CHR SNP A1 A2 MAF NCHROBS
## 1 1 rs3131972 1 2 0.1568 23388
## 2 1 rs3115850 1 2 0.1390 21884
## 3 1 rs12562034 1 2 0.1025 23468
## 4 1 rs4040617 2 1 0.1277 23436
## 5 1 rs4970383 1 2 0.2474 23476
## 6 1 rs950122 1 2 0.1978 23228
# check the SNPs have been provided in the same order, and that reference always freq A1 allele
sum(freq$SNP==reference$SNP) # yes, can compare directly
## [1] 298255
sum(reference$A1==1) # yes, AF always for A1 allele
## [1] 298255
# We want to compare allele frequencies but first we need to align reference alleles
# confusingly, both files are labelled 'MAF'
# --> in fact both are calculating the AF of the 'A1' allele. PLINK does this, be warned!
# so in the reference file, allele frequency is always for the A1==1 allele
# if A1==1 in our sample file then no changes are needed
# but if A1==2 in our sample file then we take 1-MAF
freq$A1.frq = freq$MAF
freq$A1.frq[freq$A1=="2"] = 1 - freq$MAF[freq$A1=="2"]
# We are going to save a plot as a png. This allows us to open the plots locally when working on a cluster
jpeg(filename="Comparing_MAF.jpeg")
plot(reference$MAF ~ freq$A1.frq)
dev.off()
## quartz_off_screen
## 2
Typically SNPs that don’t match the reference would need to be excluded. Outlying SNPs are judged relative to the specific study, where SNPs whose allele frequency is more than a few standard deviations from the mean are excluded.
Once the SNPs to exclude have been identified, we can remove these
SNPs and make new bfiles using --exclude
command in
PLINK
.