Genotype data is commonly derived from genotype chips. In the lab, DNA libraries are bound to the chips, and processed in a screening machine as color signals. The signals will be converted to raw PLINK file using GenomeStudio. There are usually half to one million probes in a chip. Sample size varies, and affects how we would like to optimize QC method and computation arrangements.
We have developed an in-house pipeline for processing raw PLINK data all the way to profiling their GPS of interested traits. In this practical, we will go through the pipeline with a small example data, without worrying about computation arrangements in real life research.
The first important tool we will use in this pipeline is PLINK. A lot of functions can be browsed in its new manual or old manual.
We will first take a look at the data for its sample size and number of SNPs.
A raw data in PLINK format includes a MAP file and PED file.
map file has each line describing a single marker, and must contain exactly 4 columns.
chromosome (1-22, X, Y or 0 if unplaced)
rs# or snp identifier
Genetic distance (morgans)
Base-pair position (bp units)
1 rs12562034 0 768448
1 rs4040617 0 779322
1 rs4970383 0 838555
1 rs950122 0 846864
ped file has genotype for each individual in the first six columns as mandatory.
Family ID
Individual ID
Paternal ID
Maternal ID
Sex (1=male; 2=female; other=unknown)
Phenotype
2176306 2176306 0 0 2 -9 2 2 1 1 2 2 2 2 2 2 2 2 1 1 2 2 2
3162383 3162383 0 0 1 -9 2 2 2 1 1 2 2 2 2 1 2 2 2 1 2 2 2
5816422 5816422 0 0 2 -9 2 2 1 1 1 2 1 2 2 1 2 2 2 1 2 2 2
5350254 5350254 0 0 2 -9 2 2 1 1 2 2 2 2 1 1 2 2 1 1 2 2 1
We can simply get number of SNPs and sample size by counting rows of ped file and map file.
# for convinience
data=target
inpath=/data/module5/sim/
wc -l ${inpath}/${data}.ped
wc -l ${inpath}/${data}.map
We usually transfer the data to binary format to save space and accelerate calculation speed.
# check where you are, and set your working directory
pwd
mkdir tmp
cd tmp/
plink \
${inpath}/${data} \
--file --out ${data} --make-bed
This format is explained well in PLINK manual page.
We can use PLINK to generate summary stats at per-SNP and per-individual levels.
plink --bfile ${data} --missing --out ${data}
plink --bfile ${data} --freq --out ${data}
plink --bfile ${data} --hardy --out ${data}
Then we can use R to read in and make histograms of the summary statistics.
="target"
filename
<- read.table(paste0(filename,".imiss"), header=T, check.names=F)
imiss <- read.table(paste0(filename,".lmiss"), header=T, check.names=F)
lmiss <- read.table(paste0(filename,".frq"), header=T, check.names=F)
freq <- read.table(paste0(filename,".hwe"), header=T, check.names=F)
hwe
# You can either run the plotting function in interactive Jupyter R.
# Or un-mute the png and dev.off command lines to save the picture.
#png(paste0(filename,"_Quality_of_genotype.png"), type="cairo")
par(mfrow=c(2,2))
hist(1-imiss$F_MISS, breaks="sturges",main="Individuals",col="tan",
xlab="Genotyping Rate", ylab="Number of Individuals")
hist(1-lmiss$F_MISS, breaks="sturges", main="SNPs", col="tan",
xlab="Genotyping Rate", ylab="Number of SNPs")
hist(hwe$P, breaks="sturges", main="HWE P-Value", col="tan",
xlab="HWE P-value", ylab="Number of SNPs")
hist(freq$MAF, breaks="sturges", main="MAF", col="tan",
xlab="MAF", ylab="Number of SNPs")
#dev.off()
The code generates a figure like this. You will have a general idea of how many samples or SNPs to exclude, but be aware with the scale of x axis.
As a very useful way to confirm that the samples were not messed up in handling, we predict sex from heterozygosity in chromosome X, and compare it to the sex in phenotype column in the fam file.
A PROBLEM arises if the two sexes do not match, or if the SNP data or pedigree data are ambiguous with regard to sex. More details are in PLINK manual page.
We need chrX to do this check, but our example data has only autosomes, so we will skip this step in our practice.
# don't run this chunk with practical data. it does not have X chromosome
plink --bfile ${data} --check-sex --out ${data}
In QC step, we will use PLINK to exclude the SNPs and individuals with high missingess, very low MAF, and SNPs that violate Hardy-Weinberg Equilibrium.
It’s very often that our target data set has a small sample size, like one or two hundred. Some cross-sample QC methods don’t work very well in this circumstance. Comparing to what we have learnt in the QC steps for a GWAS data set, which usually have thousands of samples, we are doing less filtering here.
# We will keep only chromosome 22 in our practice to reduce computation time.
plink \
--bfile ${data} \
--mind 0.05 \
--geno 0.05 \
--hwe 0.000001 \
--maf 0.01 \
--chr 22 \
--make-bed \
--out ${data}_chr22
Parameters in the command:
–mind: missingness per individual threshold
–geno: missingness per SNP threshold
–hwe: Hardy-Weinberg Equilibrium p-value threshold
–maf: Minor allele frequency thresohld
Take a look in the log file to see how many SNPs removed in each filter.
For imputation purpose, we will align the SNPs to reference human genome, which is to make sure that the SNPs in our data and the reference data have the same ID, same location, and the alleles are read from the same strand.
Our example data has already got all SNPs aligned. We will skip this section in practice, but it’s very important to check through each aspect and process your data carefully in your research.
To know which build is used for the SNP information in bim file, we can pick a random SNP and cheek its position in dbSNP at NCBI. .
Such as, we find that rs1807483’s position is chr22:15474749 in our bim file. Comparing to its information in dbSNP, we can see that our bim file used build GRCh37. We are happy with this build.
If you need to change the build of your data, one option is to use LiftOver or use R package bigsnpr with function snp_modifyBuild.
A convenient option is to use the strand file and tool update_build.sh in next step.
The sequencing companies don’t always design chips with probes on the positive strands. We have to flip all the SNPs on minus strand to positive strand for matching to reference data.
To check whether your data has been flipped already, choose around 10 SNPs from bim file and check their alleles in dbSNP as you did for checking genome build.
Strand website has prepared strand files for a lot of chips. If you can find your chip in it, just download the strand file and use the update_build.sh tool to fix your data.
If you find a lot of SNPs in your data using customized IDs, in a form such as kpg1234, gsa.101 or chr1:10011, you can use the following R package to update the bim file.
## don't need to run this chunk. It's just for showing you the R package.
::install("SNPlocs.Hsapiens.dbSNP144.GRCh37")
BiocManagerlibrary(SNPlocs.Hsapiens.dbSNP144.GRCh37)
<- SNPlocs.Hsapiens.dbSNP144.GRCh37 ref
Once we have the data well aligned with reference genomes, we can use the PC loading and projection method in GCTA to predict the ancestries of samples.
### Reference data
refpath="/data/module5/ref/"
pcref='1000G_phase3_20130502_combined_snpsonly'
###picking the shared set between reference and ${data} data
comm -12 <(awk '{print $2}' ${refpath}/${pcref}.05.bim | sort) \
<(awk '{print $2}' ${data}.bim | sort) > common.SNPs.txt
### generate GRM of reference data
gcta --bfile ${refpath}/${pcref}.05 \
--extract common.SNPs.txt \
--make-grm \
--out ${pcref}.05.common
### calculate PC of reference data
gcta --grm ${pcref}.05.common --pca 3 --out ${pcref}.05.common_pca3
### PC loading
gcta \
--bfile ${refpath}/${pcref}.05 \
--extract common.SNPs.txt \
--pc-loading ${pcref}.05.common_pca3 \
--out ${pcref}.05.common_pca3_snp_loading
### PC projection
gcta \
--bfile ${data} \
--extract common.SNPs.txt \
--project-loading ${pcref}.05.common_pca3_snp_loading 3 \
--out ${data}_05.common_pca3
# It took around 4 minutes to run all these scripts.
Taking the PC eigen vectors we got from reference data and target data, we can plot them in a figure using ggplot.
## this is the PC vector of reference data:
="1000G_phase3_20130502_combined_snpsonly.05.common_pca3.eigenvec"
refvec
## this is the PC vector of our data
="target_05.common_pca3.proj.eigenvec"
pcvec
## this is the ancestry information of samples in reference data
="/data/module5/ref/1000GP_Phase3.sample"
popstr
library(ggplot2)
= read.table(popstr, header=T)
str = read.table(refvec)
ref = read.table(pcvec)
pc
$str = str[match(ref$V2, str$ID),"GROUP"]
ref$str <- "Samples"
pccolnames(ref) = c("FID", "IID", "PC1", "PC2", "PC3", "str")
colnames(pc) = c("FID", "IID", "PC1", "PC2", "PC3", "str")
= rbind((ref), (pc))
data
# you can reorder the categories showing up in legend with levels
$str = factor(data$str, levels = c("AFR", "AMR", "EAS", "EUR", "SAS", "Samples"))
data
# you can choose what color for each ancestry by define the cbPalette
<- c("#999999", "#F0E442", "#E69F00", "#009E73", "#FC4E07", "#0072B2")
cbPalette <- ggplot(data=data, aes(x=PC1, y=PC2, color=str)) +
fig1 geom_point(size=0.8) +
scale_color_manual(values=cbPalette) +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank())
fig1
# if you can't plot it in interactive mode, save it with ggsave
#ggsave(fig1, file ="PC_plot_with_samples_projected.png", width = 8, height = 6)
This is a plot with our target samples projected to 1000G:
From the PC plots, we can see that most of our samples are Europeans.
There are many methods to assign ancestry to samples from here, such as PC thresholds calculated from the mean and SD of each reference population and k-mean clustering. We will leave it to yourself to explore and test.
Why we care about the ancestry of samples?
This is a very good check point to confirm you have done each step correctly. If you find your samples don’t co-localize with any ancestry, that could be true, but there could also be errors in your upstream handling.
Knowing the genetic ancestry of samples can help us to decide what reference data to use to impute the data, or use as LD reference in clumping etc.
It also indicates how should we benchmark the PGS scores of the samples. Here is a histogram showing you that different population may have a different distribution of PGS, and that difference varies in different traits. Benchmarking a sample in a wrong population will lead to very wrong risk interpretations.
There are several online imputation servers you can use to impute your data, such as TOPMED imputation server and Sanger imputation server. Here we will use open resource tools and the reference data 1000Genome to do it in-house.
We will convert the data from PLINK format to VCF format, and use BCFTools to align the reference alleles as used in human genome reference data.
chr=22
# Pull out data for relevant chromosome and convert to VCF.
plink --bfile ${data}_chr${chr} --recode vcf --out ${data}_chr${chr}
# Sort and compress the VCF file
vcf-sort ${data}_chr${chr}.vcf | bgzip -c > ${data}_chr${chr}.vcf.gz
# Fix the reference allele to match the GRCh37 reference fasta (human_g1k_v37.fasta).
ref2fix=${refpath}/human_g1k_v37.fasta
BCFTOOLS_PLUGINS=/software/bin/
bcftools \
\
+fixref ${data}_chr${chr}.vcf.gz \
-Oz \
-o fixed_${data}_chr${chr}.vcf.gz \
-- -d \
-f ${ref2fix} \
-m flip
zcat fixed_${data}_chr${chr}.vcf.gz | bgzip -c > indexed_fixed_${data}_chr${chr}.vcf.gz
# create index file.
tabix indexed_fixed_${data}_chr${chr}.vcf.gz
Although it is not required for all imputation softwares, here we will reconstruct the haplotypes from our data with external information, which is called phasing.
Both of the haplotype reference and genetic map used here are from 1000Genome project.
There are many phasing tools. We will use Eagle v2.4.1 in our practice.
geneticmap1=${refpath}/genetic_map_chr${chr}_combined_b37_modified.txt
reference=${refpath}/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
# Use EAGLE to generate phased haplotypes
# it takes about 4 minutes.
eagle \
--vcfRef=$reference \
--vcfTarget=indexed_fixed_${data}_chr${chr}.vcf.gz \
--geneticMapFile=$geneticmap1 \
--vcfOutFormat=z \
--outPrefix=phased_chr${chr} > phasing.log
# index the vcf.gz file
tabix -p vcf phased_chr${chr}.vcf.gz
You can take a look at the phased vcf file with zless.
zless phased_chr${chr}.vcf.gz | less -S
# less -S will align the columns for you, so the file is clearer to visualize.
# press q to exit from the file
We use Impute5 to impute the phased data. To reduce computation time, we will only imputed a chunk of chromosome 22.
# CAUTION: genetic map file is different from the one used in phasing!
# impute5 doesn't want the chr column in genetic map, so we removed that column
# imputing chr22 takes around 8min. If you are in a hurry, try with next chunk.
geneticmap2=${refpath}/genetic_map_chr${chr}_combined_b37.txt
reference=${refpath}/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
impute5_1.1.5_static \
--m $geneticmap2 \
--h $reference \
--g phased_chr${chr}.vcf.gz \
--r ${chr} \
--ne 20000 \
--threads 1 \
--o imputed_chr${chr}.vcf.gz \
--l imputed_chr${chr}.log
If the sample size is big, we can run imputation by chunks and parallel them.
# set the boundary of the chunk to impute
intstart=40000001
intend=50000000
impute5_1.1.5_static \
--m $geneticmap2 \
--h $reference \
--g phased_chr${chr}.vcf.gz \
--r ${chr}:${intstart}-${intend} \
--ne 20000 \
--threads 1 \
--o imputed_chr${chr}_chunk.vcf.gz \
--l imputed_chr${chr}_chunk.log
Imputed data is output as a zipped VCF file. We will change the format back to PLINK for following analysis.
We will use BCFTOOL again to extract info score for the imputed SNPs from the VCF file, which stands for the imputation quality per SNP. Info score is sensitive to sample size, so be careful to use it when you have a very small sample size in real studies.
1000G included a lot of InDels, which sometimes use the same SNP ID. Duplicated SNP IDs will introduce error, so we will simply remove them in our practice.
If you are interested in these InDels or any SNPs with duplicate ID or missing ID, you can refill the bim file in R in the form of chr1:123456 or chr1:1234_dup2, since PLINK can’t deal with duplicated or missing SNP IDs.
## convert the format using plink
plink --vcf imputed_chr${chr}.vcf.gz \
--id-delim '_' \
--keep-allele-order \
--make-bed \
--out imputed_chr${chr}
## get info score, need bcftools
tabix -p vcf imputed_chr${chr}.vcf.gz
bcftools query -f '%CHROM\t%ID\t%QUAL\t%POS\t%REF\t%ALT\t%INFO/AF\t%INFO/INFO\n' \
${chr}.vcf.gz > imputed_chr${chr}.info
imputed_chr
## get the list of SNPs with info > 0.3
awk ' $8 > 0.3 ' imputed_chr${chr}.info | awk '{print $2}' > snps_with_invo_over_0.3.info
## QC with info score
plink2 --bfile imputed_chr${chr} \
\
--extract snps_with_invo_over_0.3.info \
--rm-dup force-first \
--make-bed ${chr}_QCed
--out imputed_chr
## note that we are using PLINK2 here, since PLINK does not have function --rm-dup.
## PLINK and PLINK2 have slight difference, and each of them have a few specific functions. We are more used to PLINK, so I kept to PLINK for all the other steps.
Take a look
How many SNPs in pre-impute data?
How many SNPs in imputed data?
How many SNPs passed QC?
When we work with the whole genome, we can use the following example commands to merge imputed chunks and chromosome.
## don't run this chunk. It's just an example, without usable inputs.
# in vcf format
filelist=$(ls imputed_chr*vcf.gz | tr '\n' '\t')
bcftools concat -Oz -o ${data}_imputed.vcf.gz ${filelist}
# in PLINK format
plink \
\
--bfile imputed_chr1_QCed \
--merge-list file_names_of_imputed_chr_2_to_22.txt \
--allow-no-sex \
--make-bed ${data}_imputed_qced_autosomes --out
Sometimes we would like to keep all tha SNPs no matter of the imputation quality, so that we won’t have missing SNPs which are in the predictor.
Taking the imputed data, we will profile the PGS scores using PLINK.
plink \
\
--bfile imputed_chr22_QCed ${refpath}/Height_SBayesR_predictor.txt 1 2 3 header sum center \
--score ${chr} --out score_of_height_from_chr
The parameters after your predictor file means
Check out how many SNPs in the predictor exist in chr22 in our raw data and how many exist in imputed data?
In research, we can either merge all the chromosomes to profile the score for each individual, or add up the scores from each chromosome under the condition that you used “sum” in the profiling command.
You can’t tell a PGS is high or low without comparing to a reference population. Benchmarking and interpretation of the PGS is essential for a meaningful study. When you benchmark your PGS to a reference data set, make sure your reference data set is representative for the population, instead of biased in some disease, one sex, or high/low education etc.
Once you have a reference data for PGS benchmarking, check out how many SNPs are overlapping between your data and reference data. Try to use the same set of SNPs in the predictor to profile comparable PGS.
A work by Tian Lin