Objective

In this practical session you will learn some basic concepts associated with preparing genotype data for GWAS. We divide the cleaning of genotype data into two parts.

Part 1: removing individuals with poor quality data.

Part 2: removing SNP markers that have substandard genotyping performance.

You you will need to do your analysis on the server, save your plots and download to view. Do not download the data used in this practical

We use statistical measures to detect poor quality data and remove it. Performing the per-individual steps first prevents individuals with poor quality genotypes having an undue influence on the removal of SNP markers in later steps.

The data can be found in the directory /data/module1/4_cleaningPrac/ on the cluster. Throughout the practical we will use PLINK (v1.9) at the command line, and investigate the PLINK output using R. For more information on the PLINK commands see https://www.cog-genomics.org/plink/.

Since we will use commands in both unix (at the command line) and in R, we have colour coded background of the the code chunks to help you distinguish unix code from R code.

Blue code chunks are used to denote command line code

Grey chunks are used to denote R code

Data

The genetic data is in PLINK format and consists of:

gwas.bed → binary file containing all genotypes

gwas.bim → information about SNP markers

gwas.fam → information about individuals

These 3 files come as a set and are often collectively referred to as bfiles. They are the input for the plink --bfile option. These are the files we want to perform the QC on.

Part 1: Per Individual QC

There are five steps to removing individuals with poor quality data

removal of individuals with excess missing genotypes
removal of individuals with excess homozygosity
remove of samples showing a discordant sex
removal of related of relatives or duplicate samples
removal of ancestry outliers

We will run each of these steps individually to examine what is happening with our data. In practice, we often remove individuals that fail QC steps 1-3 simultaneously. The code to do this and make a new (individual QC’d) set of PLINK files will be provided at the end.

Excessive Missing Genotypes

We want to identify any sample with “high” (~5%) missingness to remove from further analysis. High missingness generally reflects poor DNA quality

For a SNP array, large numbers of missing SNP calls for an individual indicate intensity measures are failing to fall in any genotype clusters
For sequencing data, large numbers of missing SNP calls indicate low number of reads covering regions of the genome
Samples with a high missingness also tend to have higher genotyping error in the genotypes that are called.
This is particularly important when using a case-control design - cases and controls are often recruited separately and this can cause ‘batch effects’ often confounded with case-control status; causing false-positive results.

We can investigate the rate of missingness using the following PLINK command, e.g.

plink --bfile /data/module1/4_cleaningPrac/gwas --missing

This will produce output on the screen similar to that shown below.

The output from this command is two files:

plink.imiss → file containing missingness per individual.
plink.lmiss → file containing missingness per SNP, we will come back to this.

Let’s investigate the plink.imiss file using the following unix command:

head plink.imiss

The columns are:
FID → family ID, usually this is not used and the same as IID.
IID → individual ID, individual identifier.
MISS_PHENO → is the phenotype missing in the .fam file? Y or N.
N_MISS → number of missing genotypes.
N_GENO → total number of genotypes.
F_MISS → missingness rate, i.e. N_MISS/N_GENO.

Now we can use R to determine how many individuals have >5% missingness.

# read in the data
miss = read.table("plink.imiss", header=T)

# make a plot and save it for later
jpeg(file="missingness.jpg")
hist(miss$F_MISS, breaks=100)
dev.off()

## quartz_off_screen 
##                 2

# How many individuals have missingness >0.05%?
sum(miss$F_MISS>0.05)

## [1] 18

You can download your plot to view it. Do you understand what the plot is showing?

Excessive homozygosity

Next we want to identify any sample with outlying heterozygosity to remove from further analysis

The proportion of homozygous (or inversely heterozygous) genotypes across an individual’s genome (excluding sex chromosomes) can detect several issues with genotyping, e.g.
- Average heterozygosity correlates with genotype missingness such that samples with high missingness tend to have lower average heterozygosity. Lower-than-average heterozygosity can also reflect inbreeding (i.e. an individual that has 1st or 2nd degree relatives as parents) but this is not typical in human data, and we are also looking for extreme cases unlikely to be caused by inbreeding.
- High heterozygosity can result from sample contamination, where multiple samples are accidentally genotyped on a single array.
Average heterozygosity depends on population and genotyping platform, therefore we need to determine a threshold for each sample. Typically, individuals whose rates are more than a few standard deviations (sd) from the mean heterozygosity rate are removed.

We can investigate the rates of homozygosity using the following PLINK command:

plink --bfile /data/module1/4_cleaningPrac/gwas --het

This produces a file plink.het. Have a look at the file using the head command in unix.

head plink.het

The columns are:
FID → family ID, usually this is not used and the same as IID.
IID → individual ID, individual identifier.
O(HOM) → observed number of homozygous positions.
E(HOM) → (plink calculated) expected number of homozygous positions.
N(NM) → number of non-missing positions.
F → (plink calculated) inbreeding coefficient.

We can now use R to determine if there any outliers that should be removed. We need to calculate the homozygosity [i.e. O(HOM)/N(NM)] and define a threshold to remove individuals.

# read in the data and calculate homozygosity rate
het = read.table("plink.het", header = T)
het$hRate = het$O.HOM./het$N.NM.

# make a plot to download to your local machine
jpeg(file="homozygosity.jpg")
hist(het$hRate, breaks=100)
dev.off()

## quartz_off_screen 
##                 2

# normalise the homozygosity rate to N(0,1) and 
# determine how many individuals have homozygosity >5SD from the mean
het$stdRate = scale(het$hRate)
sum(abs(het$stdRate)>5)

## [1] 22

# save the ids of these individuals for later
write.table(file="excessHomozygosity.list",het[abs(het$stdRate)>5,c("FID","IID")],quote=FALSE, row.names=FALSE, col.names=FALSE)

Discordant Sex

Next we will check if an individual is (genetically) male or female, and compare this to their self-reported sex. Determining sex is relatively straightforward from genotype data

Males have a single copy of the X chromosome so usually do not have heterozygous genotypes on the X chromosome.
- The exception is a small pseudo-autosomal region at the end of the X chromosome may show some heterozygosity (with the Y chromosome)
Only a small number of heterozygotes may be attributable to genotyping errors.

We can use the --check-sex option in PLINK to check the genetic and reported sex. For example,

plink --bfile /data/module1/4_cleaningPrac/gwas --check-sex

What does the output say?

Error: --check-sex/--impute-sex requires at least one polymorphic X chromosome

Unfortunately there is no sex chromosome data in this dataset, and so we can not perform this check

Putting the individual QC steps together

We have examined whether there are any individuals that need to be removed. Next we will remove those from our bfiles. Remember specific thresholds you use when working with real data will depend on the data itself!

We will use the excessHomozygosity.list created previously with the --mind 0.05 command in PLINK to filter the individuals and make a new set of bed files. For example, try the following code:

plink \
   --make-bed \
   --bfile /data/module1/4_cleaningPrac/gwas \
   --remove excessHomozygosity.list \
   --mind 0.05 --out gwas_cleanInd

Note also that we have made the PLINK command easier to read by using the \. This is a unix command for line continuation.

Q: How many individuals were removed in total? Does this match up with how many you expected to be removed?

How many individuals are left?

Remove relative or duplicate samples

In GWAS we need to either remove or control for genetic relatedness.

Even distantly related samples can bias GWAS results if not properly accounted for.
- e.g. if we have two related cases in a case-control analysis, their genotypes being on average more similar to each other than the rest of the cohort will provide a slight bias to the estimate of the allele frequency in cases and its associated standard error.
Even this small bias is important when considering that GWAS typically have thousands or millions of statistical tests.
For any pair with a genomic relationship > 0.05, remove the one with the lowest genotyping rate. Alternatively, if there are family groups, a program such as GCTA will remove selectively removes individuals to maximize the remaining sample size rather than doing it at random. See the --grm-cutoff 0.05 option on the GCTA website for details.

Here we will remove related individuals using PLINK and the option --rel-cutoff 0.05.

We’re going to skip this step as it takes a LONG time to run (~30min). However, there is an example of output from this step below.

plink \
--bfile gwas_cleanInd \
--rel-cutoff 0.05 \
--out gwas_cleanInd_rel0.05

Look at the ouput. How many individuals were removed from the analysis by PLINK?. An alternative to removing related individuals is to fit a GWAS model which accounts for related individuals. We will look into fitting such models using GCTA in the next prac.

Remove Ancestry outliers

Typically we also remove individuals who are ancestry outliers. A standard approach is to:

perform a PCA on the genotypes of a diverse set of individuals with ‘known’ ancestry, such as the 1000Genomes dataset.
Project your samples onto PCs.
Remove outliers

The procedure to do this is described in the lecture notes and on the GCTA website. It requires a reference dataset such as the 1000G dataset. We’re going to skip this step also as it takes a LONG time to run.

Part 2: Per SNP QC

The second stage of genotype cleaning involves looking at individual SNPs to determine genotype accuracy.

There are four basic steps to removing SNPs with poor quality data

removal of SNPs with excess missing genotypes
removal of SNPs that deviate from Hardy-Weinberg equilibrium
removal of SNPs with low minor allele frequency
comparing allele frequency to known values

First, we will walk though 1-3 individually to examine what is happening with our data. As before, these steps can be run simultaneously to remove SNPs from the bfile and the code to do this will be provided at the end.

Lastly, we will compare allele frequency to known values.

Excess Missing Genotypes

Aim: remove any SNP with > 5% missing data..

Missingness is an indicator of poor quality SNPs,

caused by poor separation of genotyping clusters on arrays, or low number of sequence reads over a region.
these conditions also make the error rate in the non-missing genotypes higher.

For example, below is an example of a SNP from a genotyping array with poorly separated genotype clusters. We can see the color intensities (x- and y-axes) overlap for the CC (red), CT (green) and TT (blue) genotypes, leading to a high number of missing genotypes (grey).

Run the --missing option in PLINK on your cleaned up genotype file to calculate the missingness rate per SNP.

plink --bfile gwas_cleanInd --missing --out gwas_cleanInd

Examine the output file gwas_cleanInd.lmiss

head gwas_cleanInd.lmiss

F_MISS is the data column we are interested in, it is the rate of missingness (i.e. N_MISS/N_GENO). Use R to determine if there any SNPs that should be removed due to missingness?.

Additional checks are important for case-control studies.

Be aware that:

SNPs that have different rates of missingness between cases and controls can cause bias and false-positive results. These SNPs should be removed.
Missingness can be non-random with respect to the underlying genotype. For example, a structural variant nearby a genotyped SNP in cases causes cases to have a higher proportion of missing genotypes.

Deviation From Hardy-Weinberg Equilibrium (HWE)

Aim: remove SNPs that show a significant deviation from HWE (\(P<1x10^{-6}\)).

*Humans are a large population, so genotype frequencies tend to satisfy HWE…

Poor genotype calling can result in genotype frequencies deviating from Hardy-Weinberg equilibrium, for example a SNP with only homozygous calls.
Poor cluster separation in arrays and lack of heterozygous calls in sequencing can also cause deviations from HWE.

We can use the --het option in PLINK to calculate HWE statistics on all SNPs.

plink --bfile gwas_cleanInd --hardy

Examine the output file plink.hwe

head plink.hwe

Now use R to determine how many SNP will be removed with \(P_{HEW}<1x10^{-6}\)?

Low Minor Allele Frequency (MAF)

Aim: remove rare variants. We will use a MAF threshold of 1%.

For a SNP with minor allele frequency (\(q\)) of 1%, the frequency of the minor homozygote is \(q^2\) or 0.01% under HWE.
Thus we need 10,000 individuals if we expect to see 1 of the rare homozygous genotypes.

To calculate allele frequencies in PLINK use the --freq option, e.g.

plink --bfile gwas_cleanInd --freq

The output from this command is plink.frq. We can use the head command in unix to take a look at the first few lines of the file.

head plink.frq

The relevant column in the output is MAF or minor allele frequency. This is the frequency of the A1 allele. Use R to determine the number of SNP with a minor allele frequency below 1%?.

Running these steps in one go

Luckily, we can run all these filters on our bfiles using one command in PLINK. For example, the code below does the cleaning of SNPs with low genotyping rate, HWE issues and low MAF.

# This cleaning step should be performed on the bfile that has already been cleaned for individuals
plink \
   --bfile gwas_cleanInd \
   --maf 0.01 --geno 0.05 --hwe 0.000001 \
   --make-bed \
   --out gwas_cleanInd_cleanSNP

Q: How many SNPs were removed? Does this match up with how many you expected to be removed ?

How many SNPs are left?

Comparing to Known Allele Frequencies

The final step is to compare allele frequencies from our sample with a reference population. We have good allele frequency estimates for genetic variants in a range of populations. Differences in allele frequency between populations can indicate poor quality genotypes, it can also detect strand alignment and other issues.

Reference allele frequencies are given in the file below. /data/module1/downloads/4_cleaningPrac/reference_allele_frequencies.txt.

Compare these to frequencies to those calculated in previous step in R

dir="/data/module1/downloads/4_cleaningPrac/"

# Read in reference data
reference = read.table(paste0(dir,'reference_allele_frequencies.txt'), header=T)
dim(reference)    # 298,255 SNPs

## [1] 298255      4

head(reference)

##          SNP A1 A2   MAF
## 1  rs3131972  1  2 0.158
## 2  rs3115850  1  2 0.142
## 3 rs12562034  1  2 0.100
## 4  rs4040617  1  2 0.871
## 5  rs4970383  1  2 0.249
## 6   rs950122  1  2 0.196

# Read in allele frequencies from our data
freq = read.table('plink.frq', header=T)
dim(freq)         # 298,255 SNPs

## [1] 298255      6

head(freq)

##   CHR        SNP A1 A2    MAF NCHROBS
## 1   1  rs3131972  1  2 0.1568   23388
## 2   1  rs3115850  1  2 0.1390   21884
## 3   1 rs12562034  1  2 0.1025   23468
## 4   1  rs4040617  2  1 0.1277   23436
## 5   1  rs4970383  1  2 0.2474   23476
## 6   1   rs950122  1  2 0.1978   23228

# check the SNPs have been provided in the same order, and that reference always freq A1 allele
sum(freq$SNP==reference$SNP)    # yes, can compare directly

## [1] 298255

sum(reference$A1==1)        # yes, AF always for A1 allele

## [1] 298255

# We want to compare allele frequencies but first we need to align reference alleles
# confusingly, both files are labelled 'MAF' 
# --> in fact both are calculating the AF of the 'A1' allele. PLINK does this, be warned!
# so in the reference file, allele frequency is always for the A1==1 allele
# if A1==1 in  our sample file then no changes are needed
# but if A1==2 in our sample file then we take 1-MAF
freq$A1.frq = freq$MAF
freq$A1.frq[freq$A1=="2"] = 1 - freq$MAF[freq$A1=="2"]

# We are going to save a plot as a png. This allows us to open the plots locally when working on a cluster
jpeg(filename="Comparing_MAF.jpeg")
plot(reference$MAF ~ freq$A1.frq)
dev.off()

## quartz_off_screen 
##                 2

Typically SNPs that don’t match the reference would need to be excluded. Outlying SNPs are judged relative to the specific study, where SNPs whose allele frequency is more than a few standard deviations from the mean are excluded.

Once the SNPs to exclude have been identified, we can remove these SNPs and make new bfiles using --exclude command in PLINK.

UQWS Genetic Mapping, session 4 - Data Cleaning

Kathryn Kemper and Alesha Hatton

18/6/2024