library(ggplot2)
In this practical session you will learn some basic concepts
associated with phenotype QC and the generation of principal components
(in R). Can choose either Part 1 or 2 to complete.
Part 1 will pre-adjust quantitative phenotypes for
covariates.
Part 2 will conduct a simple PCA (Principal Component Analysis,
i.e. an eigenvalue decomposition).
The directories for the data are given below. Please run your
analysis on the cluster in your /scratch/username/
directory and use the paths below to access the data. Once you are in
your scratch directory, open R by typing R
at the command line. If your in any doubt, you can use
getwd()
command in R to see where the files and plots you
generate will be saved.
You have been supplied with 10 quantitative phenotypes, and a file of covariates. We will perform basic QC and standardise on one of these phenotypes. The phenotypes are:
Choose a phenotype to work with. For this demonstration we will use BMI as an example.
You must work on the server for this part of the practical, save your plots and download the plots to your laptop view your results. Do not download the phenotype data
The data is located in the directory
/data/module1/1_phenotypesPrac/
. Define the path to the
data, which will be used below, in R as follows:
dir="/data/module1/1_phenotypesPrac/"
The files used in this part 1 of the prac are:
BMI.phen
→ raw phenotypes file.
covariateFiltered.cov
→ covariates file.
First read in the data and examine the distribution of the trait, what do you expect - do the values make sense?
Use the R commands dim()
and head()
to
examine the data structure. The phenotype file has 3 columns: family ID,
individual ID and the phenotype value. You can use hist()
,
part of the R Base Graphics, to quickly plot and visualise the
distribution of the data.
data = read.table(paste0(dir,"BMI.phen"))
dim(data) # there are 11,793 rows, one record per row
## [1] 11793 3
head(data)
## V1 V2 V3
## 1 7653762 7653762 24.08
## 2 8144519 8144519 24.47
## 3 2337680 2337680 24.37
## 4 5219864 5219864 25.83
## 5 1417721 1417721 28.59
## 6 2371103 2371103 26.80
hist(data[,3], breaks = 100) # view for html
jpeg(file="raw_BMI_distribution1.jpg") # make file for download
hist(data[,3], breaks = 100)
dev.off()
## quartz_off_screen
## 2
There seems to be negative values for BMI. This is not plausible as \(BMI = weight/height^2\) and usually ranges from about 15 - 40. A negative value could be a missing values which was encoded as an unrealistic number. Let us remove the negative values and look at the distribution again.
data = data[data[,3]>0,]
hist(data[,3], breaks = 100) # view for html
jpeg(file="raw_BMI_distribution2.jpg") # make file for download
hist(data[,3], breaks = 100)
dev.off()
## quartz_off_screen
## 2
dim(data) # there are 11,783 records remaining, 10 removed
## [1] 11783 3
Better!
Now let us read in the covariates. The covariate file
(covariateFiltered.cov
) is located in the same directory.
The covariates colunms are age, sex and the first 5 principal
components.
cov = read.table(paste0(dir,"covariateFiltered.cov"), header = T)
dim(cov) # there are 11,780 records, 1 row per record
## [1] 11780 9
colnames(cov)[3:9] = c("age","sex",paste0("PC",1:5))
head(cov)
## FID IID age sex PC1 PC2 PC3 PC4
## 1 7653762 7653762 39 2 0.01008060 -0.00665342 -0.01391810 1.19919e-02
## 2 8144519 8144519 66 2 -0.01998500 -0.00328824 -0.00324179 -9.56271e-03
## 3 2337680 2337680 71 2 -0.00254482 -0.00613635 0.00328573 4.11204e-04
## 4 5219864 5219864 47 1 -0.00693638 -0.00569925 -0.00311431 -3.31955e-03
## 5 1417721 1417721 43 1 -0.01165670 -0.00672607 -0.00211596 -2.10093e-03
## 6 2371103 2371103 38 2 -0.00396343 0.02634790 -0.00540908 -8.12653e-05
## PC5
## 1 0.009173280
## 2 -0.002555470
## 3 -0.000540739
## 4 -0.001726890
## 5 0.014064800
## 6 -0.003650660
We need to be careful of matching the covariate file with the phenotype file correctly. Once that’s done have a look at the ranges/counts of the covariates.
Another useful basic function in R is table()
. This
returns the count of each unique number or item in the supplied
vector.
# first match covariate and phenotype files
s = match(cov[,1],data[,1]) # s gives the order of data to match cov
sum(data[s,1]==cov[,1],na.rm=T) # check
## [1] 11770
data = data[s,] # reorder the file
data2 = data.frame(cov,bmi=data[,3]) # combine and make a dataframe for ggplot
data2 = data2[!is.na(data2$bmi),] # remove records with missing phenotypes
table(cov$age) # range of ages for trait is 37 to 76
##
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
## 121 237 258 259 258 225 239 223 241 239 285 262 259 265 301 306 363 348 388 428
## 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
## 413 462 452 529 488 467 418 382 406 335 333 298 309 283 289 272 135 1 2 1
table(cov$sex) # 1 = males, 2 = females
##
## 1 2
## 5346 6434
data2$sex = as.factor(data2$sex) # interpret 1 and 2 as factors rather than integers
We are now going to plot a the distribution of the data using the
library(ggplot2)
and some code that I adapted from here.
mu = aggregate(data2$bmi~data2$sex,FUN=mean)
names(mu) = c("sex","bmi")
mu
## sex bmi
## 1 1 29.24613
## 2 2 25.88712
savePlot = ggplot(data2, aes(x=bmi, color=sex)) +
geom_histogram(fill="white", position="dodge", bins = 50)+
geom_vline(data=mu,aes(xintercept=bmi,color=sex), linetype="dashed")
savePlot # view for html
ggsave(file="bmiSex.jpg") # save for download
## Saving 7 x 5 in image
Now we will correct the phenotype for sex and age effects, as well as the first 5 principal components by using a linear model. In studies in human genetics, correcting for sex and age effects is the minimum requirement. Often further covariates (such as genotyping batch) or more complex models (for example with sex-specific effects) are used.
Pre-adjusting phenotypes slightly reduces GWAS power but makes the analysis much faster. Many program internally pre-adjust the phenotypes for covariates, even if they are supplied at the time of analysis.
Since we are testing for linear effects of age, we need to center age (i.e. correct for the mean) to help with interpretation of the co-efficients from the model. Sometimes it is worth checking for non-linear age effects as well, so we will add \(age^2\) to the model.
data2$age = data2$age - mean(data2$age)
data2$age2 = data2$age^2
lm0 = lm(bmi ~ sex + age + age2 + PC1 + PC2 + PC3 + PC4 + PC5, data=data2)
summary(lm0)
##
## Call:
## lm(formula = bmi ~ sex + age + age2 + PC1 + PC2 + PC3 + PC4 +
## PC5, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5297 -2.8330 -0.0456 2.8226 29.0687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.2908006 0.0694176 421.950 <2e-16 ***
## sex2 -3.4111714 0.0786246 -43.386 <2e-16 ***
## age 0.1686041 0.0041958 40.184 <2e-16 ***
## age2 -0.0001749 0.0004193 -0.417 0.6767
## PC1 9.0083273 4.2459643 2.122 0.0339 *
## PC2 1.7252037 4.2469427 0.406 0.6846
## PC3 -5.3130620 4.2464166 -1.251 0.2109
## PC4 3.7630622 4.2473917 0.886 0.3757
## PC5 3.1973422 4.2466482 0.753 0.4515
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.245 on 11761 degrees of freedom
## Multiple R-squared: 0.2323, Adjusted R-squared: 0.2318
## F-statistic: 444.8 on 8 and 11761 DF, p-value: < 2.2e-16
The effect estimated by the model for the intercept is the average BMI for males (sex = 1) at the mean age. Thus the average BMI for males in the data is about \(29.3 kg/m^2\). Is this what you expect?
The next line is the sex effect (sex2
) which is the
difference between the average of males and females at the mean age.
What is the mean BMI for females according the the linear model,
what is the raw mean for females?
The mean BMI for females from the linear model \(29.3 - 3.4 = 25.9 kg/m^2\). This is very close to the raw sex averages calculated above.
We can now extract the residuals from the model, i.e. this is the phenotype (\(y\)) minus the variables fitted in the model (overall mean, sex effect - if relevant, linear effects of age and PC1-PC5). Note here that the residuals from the model are in the same order as the input file.
lm0 = lm(bmi ~ sex + age + PC1 + PC2 + PC3 + PC4 + PC5, data=data2)
data2$resid = lm0$residuals
After pre-adjusting the phenotypes, there are a number of checks normally performed. These include:
Traits with skewed distributions may require a log10(x)
transformation to achieve normality. A RINT (rank-based inverse-normal)
transformation is also possible, although the SNP effects for RINT
transformations are difficult to interpret. A RINT transformation can be
performed with the function below.
InvNorm=function(x) {
return(qnorm((rank(x, na.last="keep")-0.5)/sum(!is.na(x))))
}
Try out the RINT function above on the data, what does it do and what types of values does it produce?
Often pre-adjusted phenotypes are standardised to a unit normal \(N(0,1)\) so that effect sizes can be compared across traits. Normally, the SNP effects from a linear model for a quantitative trait are in trait units, what are the trait units if the trait has been standardised to \(N(0,1)\)?
data2$std = scale(data2$resid) # standardise N(0,1)
sum(abs(data2$std)>5) # how many outliers are more than 5 SD from mean?
## [1] 9
data2 = data2[abs(data2$std)<5,] # remove outliers
# plot final QC'd phenotype
savePlot = ggplot(data2, aes(x=std, color=sex)) +
geom_histogram(fill="white", position="dodge", bins = 50)
savePlot # veiw for html
ggsave(file="bmiStd.jpg") # save for download
## Saving 7 x 5 in image
# write file for later
write.table(file="bmiStd.phen",data2[,c("FID","IID","bmi","std")],quote=FALSE,row.names=FALSE,col.names=FALSE)
PCA analysis is an approach often used to identify population structure in genomic data. You will be provide with a pre-computed genomic relationship matrix for a small subset of the 1000 Genomes project data. Further information on the 1000 Genomes Project can be found [here] (https://www.internationalgenome.org/).
The genomic relationship matrix (GRM) in this example has been computed for 50 individuals identified from one of 5 ancestry groups. It was computed from about 1.3M SNP spread evenly throughout the human genome using standard methods in GCTA. We will make a simple GRM in the next practical.
If you need an introduction in Principal Component Analysis (PCA) there is a great 20minute youtube clip called PCA… Step-by-Step by StatQuest. StatQuest also has many other useful bite-size tutorials on topics related to genetics.
You can download the 1000G data directly to your laptop if you wish. Otherwise complete the practical on the server, save your plots and download them to your laptop.
The files used in this part 2 of the prac are:
1kg_winterSchool.grm.gz
→ pre-made genomic relationship
matrix for the 1000G data.
1kg_winterSchool_superPop.txt
→ file defining the
continental populations of the 1000G samples.
the data is located in the directory
/data/module1/downloads/1_phenotypesPrac/
.
Define your path to the data in R, which will be used below, for the cluster using the following code in R. If you have downloaded the data, you will need to update this path to where you saved the data.
dir="/data/module1/downloads/1_phenotypesPrac/"
First we will read in the data, and have a quick look at the format of the file.
# read in the data
x=read.table(paste0(dir,"1kg_winterSchool.grm.gz"))
labels=read.table(paste0(dir,"1kg_winterSchool_superPop.txt"))[,1]
# columns are row number, column number, number of SNP, genomic relationship
head(x)
## V1 V2 V3 V4
## 1 1 1 1365790 0.95116320
## 2 2 1 1365790 0.06590147
## 3 2 2 1365790 0.92498110
## 4 3 1 1365790 0.05928111
## 5 3 2 1365790 0.07948754
## 6 3 3 1365790 0.92168420
tail(x)
## V1 V2 V3 V4
## 1270 50 45 1365790 0.001001333
## 1271 50 46 1365790 0.046047610
## 1272 50 47 1365790 0.034500270
## 1273 50 48 1365790 0.049708950
## 1274 50 49 1365790 0.048485100
## 1275 50 50 1365790 0.878313700
The GRM is a square symmetrical matrix. The data read in are the lower triangular and diagonal elements, as indicated by the row and column number. How many elements do you expect in this file for a 50x50 GRM?
In the first column there will be \(n\) elements, \(n-1\) in the 2nd column, \(n-2\) in the 3rd, etc. until \(1\) element in the 50th column. Check this calculation against the dimensions of the GRM provided.
# n = number of individuals
n=50
# expected number of elements in lower-triangular matrix, inc. diagonals
N=sum(n:1) ; N
## [1] 1275
# how many elements read in by R for the lower-triangle + diagonal.
dim(x)
## [1] 1275 4
Each of the 1000G participants has a ancestry label assigned to it.
Use the table()
function to count the labels, note that AFR
= African, AMR = admixed American, EAS = east Asian, SAS = south Asian,
EUR = European.
table(labels)
## labels
## AFR AMR EAS EUR SAS
## 7 8 12 11 12
Now construct the GRM as a square matrix so that we can use the
eigen()
function in R to do the eigenvalue decomposition.
The relationship values are in the 4th column of x
. We can
access the 4th column by using the notation object[rows,columns], where
x[,4] will access all rows of the 4th column or x[1,4] will access a
single element from the 1st row and 4th column.
This example also uses a for(i in array) {action}
loop
in R. If you are not familiar with loops in R, try a basic loop such as
for (i in 1:10) {print(i)}
to test it out. Make sure you
understand the code!
# define the matrix using NA, then fill the matrix
GRM=matrix(NA,nrow=n,ncol=n)
for (k in 1:N) {
i=x[k,1] ; j=x[k,2]
GRM[i,j]=x[k,4]
GRM[j,i]=x[k,4]
}
Now conduct the PCA analysis using the eigen()
function.
Save the output and check the structure of the returned object using
str()
. If the matrix is full-rank, we expect 50 eigenvalues
with values > 0, and 50 corresponding vectors of length 50 (i.e. a
50x50 matrix). We’ll plot the first 2 PCs with the labels provided and
detailing how much variation the PCs capture.
If you are working on the cluster you will need to save the plots below using the same commands as above, i.e.Â
jpeg(file="myPlot.jpg")
# open the jpeg.
plot(...)
# call the plot.
dev.off()
# close the device.
#eigen value decomposition
PCA=eigen(GRM)
#eigenvalues (vector), eigenvectors (50x50 matrix, 1 column per eigenvector)
str(PCA)
## List of 2
## $ values : num [1:50] 4.78 2.97 1.46 1.43 1.39 ...
## $ vectors: num [1:50, 1:50] 0.0406 0.0471 0.049 0.0482 0.1006 ...
## - attr(*, "class")= chr "eigen"
# proportion of variation explained by PCs
total=sum(PCA$values)
plot(PCA$values[1:10]/total,
las=1, type="b", lwd=1.2, col="blue", xlab="principal component",
ylab="Proportion of total variance", pch=20)
# plot
labels=as.factor(labels)
PC1=PCA$vectors[,1] ; PC2=PCA$vectors[,2]
plot(PC2~PC1, col=labels, pch=20, las=1,
xlab="PC1, 9.2% of variation", ylab="PC2, 5.7% of variation")
legend("bottomleft",legend=levels(labels),fill=1:5)
In practice, we would use software such as GCTA
or
PLINK
at the command line to (1) calculate the GRM, and (2)
conduct the PCA analysis. These programs are much more efficient
compared to R
. However, we have done it in R
here as it is useful for understanding what’s going on.