This practical has been adapted from International Statistical Genetics Workshop and the GenomicSEM wiki
This practical is available from https://cnsgenomics.com/data/teaching/GNGWS24/module4
The objective of this practical is to encourage you to think about how SEM could help you understand trait relationships within your own work.
GenomicSEM is a nice starting place as it only requires summary statistics but there will be a learning curve with the syntax! In this practical we will introduce GenomicSEM and help familiarise the basics.
What is GenomicSEM?
Application of SEM to summary level GWAS data for modelling the joint genetic architecture of genetically correlated traits.
Can be used to estimate the shared information across correlated polygenic traits either aggregate (genome-wide) or at the individual SNP level.
Utilises the genetic variance-covariance matrix estimated from summary-level data using LDSC
(Grotzinger et al. 2019): https://www.nature.com/articles/s41562-019-0566-x
We know there is a large degree of comorbidity across psychiatric disorders, in particular major depressive disorder, anxiety, alcohol use and PTSD.
It has been hypothesised the key mechanisms for risk of each of these psychiatric disorders operate through a general factor. This is supported by moderate to high genetic correlations across the 4 disorders.
Given the availability of summary statistics for each of these disorders, we can use GenomicSEM to examine the shared genetic architecture and construct the latent psychiatric factor F above (common factor model). We can then determine if there are any genetic variants associated with the factor by adding individual SNPs to the model (common factor model with individual SNP effects). The final part of the practical introduces you to some more complex models and is an opportunity for you to brainstorm how GenomicSEM (or SEM, more generally) can be informative for the questions you are trying to answer.
We are going to run this practical using R on the cluster. To log in start by opening the terminal.
You can login using the following command:
ssh <username>@<host>
You should have received a welcome package email that contains your username, host and password.
The data for this practical is located in
/data/module4/Practice_3_genomicSEM/data
If you haven’t already you will want to make a directory within your scratch for this module e.g.
mkdir -p /scratch/<username>/module4/Practice_3_genomicSEM
cd /scratch/<username>/module4/Practice_3_genomicSEM
Open R by typing R
# load in the package
library(GenomicSEM)
There will be 25 warnings (that’s expected..)
We will be using the following data
Major Depressive Disorder - cases=170k; controls=329k; (Howard et al. 2018)
Anxiety Disorders - cases=32k, controls=82; (Purves et al. 2019)
Alcohol use disorder - cases=8.5k; controls=20k; (Walters et al. 2018)
PTSD - cases=2.5k; controls=7k; (Duncan et al. 2017)
Description of preprocessing for these files is described here
The data will be available in the following directory
/data/module4/Practice_3_genomicSEM/data
Things to consider about the data if you intend to run genomicSEM
you must use the same ancestral background across your summary
statistics AND that this ancestral background must match that of the
LD-scores used for the ldsc
function described
below
participant samples used to produce the summary statistics can range from entirely overlapping to entirely independent and you DO NOT need to know the level of overlap in order to run Genomic SEM; this is one of the major advantages of the method.
you want the full or very lightly cleaned summary statistics, not pruned or conditionally independent SNPs - this has implications for estimation of variance components (heritability and correlations)
We are going to run the common factor model in the path diagram above. In practice, you would first investigate possible underlying factor structure with methods such as exploratory factor analysis, but here, we are just going to fit the model.
There are three steps to running a common factor model that does not include SNP effects
Munge the summary statistics (munge)
Run LDSC to obtain the genetic covariance and sample covariance matrices (ldsc)
Specify and the run model (usermodel)
The first step in running a Genomic SEM model is to munge the summary statistics. This is akin to harmonising the data. The munge function works to convert the summary statistics to the format expected by LDSC (i.e., on a z-statistic metric). The summary statistics files input into the munge function at a minimum need to contain five pieces of information:
# 1. files = the name of the summary statistics files
files = c("/data/module4/Practice_3_genomicSEM/data/ALCH_withrsID.txt",
"/data/module4/Practice_3_genomicSEM/data/SORTED_PTSD_EA9_ALL_study_specific_PCs1.txt",
"/data/module4/Practice_3_genomicSEM/data/MDD_withNeff.txt",
"/data/module4/Practice_3_genomicSEM/data/ANX_withNeff.txt")
#2. hm3 = the name of the reference file to use for alligning effects to same ref allele across traits
hm3 <-"/data/module4/Practice_3_genomicSEM/data/eur_w_ld_chr/w_hm3.snplist"
#3. trait.names = names used to create the .sumstats.gz output files
trait.names<-c("ALCH", "PTSD", "MDD", "ANX")
#4. N = total sample size for traits
# All but PTSD have SNP-specific sum of effective sample sizes so only its sample size is listed here
N=c(NA,5831.346,NA,NA)
#Run the munge function. This creates four .sumstats.gz files (e.g., PTSD.sumstats.gz).
munge(files=files,hm3=hm3, trait.names=trait.names, N=N)
This will take a couple of minutes. Have a read of the output log as this runs. It should give you an idea of what is happening. In general this function is very good at interpreting the files, but you will need to check it has done this correctly!
The second step is to run multivariable LD-Score regression to obtain the genetic covariance (S) matrix and corresponding sampling covariance matrix (V).
The ldsc
function takes 5 necessary arguments:
#1. traits = the name of the .sumstats.gz traits
traits<-c("ALCH.sumstats.gz", "PTSD.sumstats.gz", "MDD.sumstats.gz", "ANX.sumstats.gz")
#2. sample.prev = the proportion of cases to total sample size. For quantitative traits list NA
#here we use .5 because we are using the sum of effective sample size across cohorts
#which is more appropriate for performing the asscertainment correction.
sample.prev <- c(.5,.5,.5,.5)
#3. population.prev : the population lifetime prevalence of the traits. For quantitative traits list NA
population.prev <-c(.159, .3, .15,.20)
##4. ld: folder of LD scores
#can be found on the genomicSEM github (https://github.com/GenomicSEM/GenomicSEM)
ld <- "/data/module4/Practice_3_genomicSEM/data/eur_w_ld_chr/"
#5. wld: folder of LD score weights [typically the same as the ld argument]
wld <- "/data/module4/Practice_3_genomicSEM/data/eur_w_ld_chr/"
#6. trait.names: optional sixth argument to list trait names so they can be named in your model
trait.names<-c("ALCH", "PTSD", "MDD", "ANX")
#Run the ldsc function
LDSC_INT<-ldsc(traits=traits, sample.prev=sample.prev, population.prev=population.prev, ld=ld, wld=wld,trait.names=trait.names)
##in many cases will want to save the LDSC data for later use. Example code provided here
# save(LDSC_INT, file = "LDSC_INT.RData")
# Running LDSC for all files took 0 minutes and 29 seconds
The output (named LDSC output here) is a list with 5 named variables in it:
LDSCoutput$S
is the covariance matrix (on the
liability scale for case/control designs).
LDSCoutput$V
which is the sampling covariance matrix
in the format expected by lavaan.
LDSCoutput$I
is the matrix of LDSC intercepts and
cross-trait (i.e., bivariate) intercepts.
LDSCoutput$N
contains the sample sizes (N) for the
heritabilities and sqrt(N1N2) for the co-heritabilities. These are the
sample sizes provided in the munging process.
LDSCoutput$m
is the number of SNPs used to construct
the LD score.
print(LDSC_INT)
## $V
## [,1] [,2] [,3] [,4] [,5]
## [1,] 6.041603e-04 4.630259e-05 1.329209e-05 2.110818e-05 1.699375e-04
## [2,] 4.630259e-05 1.683117e-03 5.491624e-06 1.012613e-04 7.541207e-05
## [3,] 1.329209e-05 5.491624e-06 3.692467e-05 4.865869e-05 8.966010e-05
## [4,] 2.110818e-05 1.012613e-04 4.865869e-05 2.627033e-04 1.244898e-05
## [5,] 1.699375e-04 7.541207e-05 8.966010e-05 1.244898e-05 1.135989e-02
## [6,] 1.928842e-05 1.794540e-05 1.773068e-06 -7.693756e-08 4.536925e-05
## [7,] 5.939947e-06 9.955430e-05 2.035833e-05 1.510451e-05 3.350365e-04
## [8,] 3.078558e-06 2.224531e-07 6.028298e-06 6.091000e-06 -1.606011e-05
## [9,] 6.866515e-06 3.119102e-06 9.777927e-06 1.858721e-05 5.419991e-05
## [10,] 1.292371e-05 1.406092e-05 7.764292e-06 3.524720e-05 1.212087e-06
## [,6] [,7] [,8] [,9] [,10]
## [1,] 1.928842e-05 5.939947e-06 3.078558e-06 6.866515e-06 1.292371e-05
## [2,] 1.794540e-05 9.955430e-05 2.224531e-07 3.119102e-06 1.406092e-05
## [3,] 1.773068e-06 2.035833e-05 6.028298e-06 9.777927e-06 7.764292e-06
## [4,] -7.693756e-08 1.510451e-05 6.091000e-06 1.858721e-05 3.524720e-05
## [5,] 4.536925e-05 3.350365e-04 -1.606011e-05 5.419991e-05 1.212087e-06
## [6,] 1.336556e-04 1.255636e-04 7.752931e-06 1.314640e-05 3.972090e-05
## [7,] 1.255636e-04 8.250458e-04 -4.762316e-07 5.242814e-06 3.447767e-05
## [8,] 7.752931e-06 -4.762316e-07 1.219319e-05 1.921343e-05 1.590190e-05
## [9,] 1.314640e-05 5.242814e-06 1.921343e-05 5.109221e-05 5.444954e-05
## [10,] 3.972090e-05 3.447767e-05 1.590190e-05 5.444954e-05 2.843342e-04
##
## $S
## ALCH PTSD MDD ANX
## [1,] 0.13938790 0.05977947 0.05943021 0.08500236
## [2,] 0.05977947 0.23937808 0.05799439 0.11428679
## [3,] 0.05943021 0.05799439 0.08503281 0.12667327
## [4,] 0.08500236 0.11428679 0.12667327 0.23329360
##
## $I
## [,1] [,2] [,3] [,4]
## [1,] 1.01069811 0.023791915 0.015906770 0.017004740
## [2,] 0.02379192 0.993973587 0.004565391 0.003975452
## [3,] 0.01590677 0.004565391 0.994969603 0.262655462
## [4,] 0.01700474 0.003975452 0.262655462 1.007313416
##
## $N
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 21647.63 11235.43 96488.81 32830.69 5831.346 49943.61 16951.93 427751
## [,9] [,10]
## [1,] 145314.9 49279.86
##
## $m
## [1] 1173569
How well do you understand the output. Can you use the S matrix to obtain the genetic correlation between MDD and Anxiety?
# Genetic correlation is the genetic covariance between traits scaled by the sqrt of the product of their genetic variances
# The S matrix contains the genetic variances on the diagonals and the genetic covariances on the off diagonals
LDSC_INT$S[3,4]/sqrt(LDSC_INT$S[3,3]*LDSC_INT$S[4,4])
# This should match the genetic correlation reported in the ldsc output log
We use the lavaan language to specify the model. Here is a quick introduction to the syntax
For example:
We are going to run a common factor model (Model3). An explanation of the output is provided below
#load in the LDSC object made in step 2 above
#load("LDSC_INT.RData")
#usermodel takes two necessary arguments:
#1. covstruc: the output from multivariable ldsc
covstruc<-LDSC_INT
#2. model: the user specified model
#here we specify a common factor model
INT.model<-"F1=~MDD+PTSD+ALCH+ANX"
#std.lv: optional argument specifying whether variances of latent variables should be set to 1
std.lv=TRUE # Otherwise the first factor loading will be set to 1
# One of the parameters will need to be set to 1 to make our model identifiable
#Run your model
IntResults <- usermodel(covstruc=covstruc, model=INT.model, std.lv=std.lv)
## [1] "Running primary model"
## [1] "Calculating CFI"
## [1] "Calculating Standardized Results"
## [1] "Calculating SRMR"
## elapsed
## 0.262
#print the results of your model
IntResults$results
## lhs op rhs Unstand_Est Unstand_SE STD_Genotype STD_Genotype_SE
## 5 F1 =~ MDD 0.283806747 0.0210027454143111 0.97326125 0.0720249151103093
## 6 F1 =~ PTSD 0.221278068 0.0402049336519291 0.45226835 0.0821745168615321
## 3 F1 =~ ALCH 0.205225639 0.0243218023255069 0.54969157 0.0651453167312673
## 4 F1 =~ ANX 0.445784745 0.0324561138311457 0.92294074 0.0671962594330495
## 1 ALCH ~~ ALCH 0.097270350 0.0258991464601678 0.69783919 0.185806270394915
## 9 PTSD ~~ PTSD 0.190414108 0.106977265855824 0.79545336 0.446896663107149
## 8 MDD ~~ MDD 0.004486541 0.0109566843444685 0.05276254 0.12885240971739
## 2 ANX ~~ ANX 0.034569557 0.0301103372628622 0.14818043 0.129066285483485
## 7 F1 ~~ F1 1.000000000 1.00000000
## STD_All p_value
## 5 0.97326125 1.313540e-41
## 6 0.45226834 3.717880e-08
## 3 0.54969157 3.230100e-17
## 4 0.92294072 6.265549e-43
## 1 0.69783918 1.728330e-04
## 9 0.79545335 7.508426e-02
## 8 0.05276254 6.821876e-01
## 2 0.14818043 2.509289e-01
## 7 1.00000000 NA
#print the model fit of your model
IntResults$modelfit
## chisq df p_chisq AIC CFI SRMR
## df 1.283452 2 0.526383 17.28345 1 0.03621695
This produces an R object, IntResults
, that contains the
two elements shown above:
IntResults$results
prints the unstandardized and
standardized results for the common factor model.
“F1 =~ ?” lists the indicator loadings on the common factors.
“V ~~ V” lists the residual variances of the indicators after removing the variance explained by the common factor.
Standardized results (STD_Genotype
and
STD_Genotype_SE
) are obtained from the standardized input
(i.e., a genetic correlation matrix) vs unstandardised
(Unstand_Est
and Unstand_SE
) are based on the
genetic covariance matrix. This means standardised results are
standardized with respect to the genetic variance in your phenotypes.
However, these results will not be fully standardized if the latent
factor is not itself specified to have a variance of 1.0
IntResults$modelfit
prints the model chi-square, degrees
of freedom, the p-value for the chi-square test, Aikake Information
Criterion (AIC), Comparative Fit Index (CFI), and Standardized Root Mean
Square Residual (SRMR). This tells us how well the observed covariance
matrix fits the implied covariance matrix.
Can you fill out the common factor model diagram above with the factor loadings and variance terms?
These results can be inserted into a path diagram to produce the following:
We can now add individual SNP effects onto our factor.
A powerful extension of Genomic SEM is to run a multivariate GWAS for which a common factor defined by genetic indicators is regressed on a SNP. This allows for estimation of a set of summary statistics for the common factor that represent the SNP effects on the common factor.
For this part of the analysis we are going to use a drastically
reduced subsets of SNPs so the model doesn’t take too long to run. These
data have already been harmonised so we can skip the munge step
from before. We will use the covariance matrix we made earlier in the
LDSC step, LDSC_INT
.
As we are regressing the factor onto SNPs we need to summary statistics as well as our LDSC covariance matrix.
#Takes four necessary arguments along with other optional arguments that
#you may need for your use case. Here we need six arguments to run the function
#1. files = the name of the summary statistics file
##**note that these are a drastically reduced subsets of SNPs for the practical ONLY
##*that reflect a selected set of chromosome 4 variants
##*Also note that these need to be in the same order as your ldsc object
files<-c("/data/module4/Practice_3_genomicSEM/data/ALCH4.txt",
"/data/module4/Practice_3_genomicSEM/data/PTSD4.txt",
"/data/module4/Practice_3_genomicSEM/data/MDD4.txt",
"/data/module4/Practice_3_genomicSEM/data/ANX4.txt")
#2. ref = the name of the reference file used to obtain SNP MAF
#**note again that this is a drastically reduced subset of SNPs for chromosome 4
#the full reference set is available on the genomicSEM github (https://github.com/GenomicSEM/GenomicSEM)
ref <- "/data/module4/Practice_3_genomicSEM/data/reference.1000G.ch4.txt"
#3. trait.names = the name of the traits to use for column labeling
trait.names<-c("ALCH","PTSD", "MDD", "ANX")
#4. se.logit = whether the standard errors are on an logistic scale
se.logit<-c(F,T,T,T)
#5. linprob: whether it was a binary outcome that was analyzed as continuoue -or-
#it is a file with only Z-statistics. This is true for ALCH for our data
linprob<-c(T,F,F,F)
#6. sample size. This is only needed for continuous outcomes or outcomes where linprob is TRUE
#we do not provide sample size for ALCH as it is already a column in the GWAS data
N<-c(NA,NA,NA,NA)
#run sumstats putting these pieces together
INT_sumstats<-sumstats(files=files,ref=ref,trait.names=trait.names,se.logit=se.logit,linprob=linprob,N=N)
# Running sumstats for all files took 4 minutes and 52.6425039768219 seconds
#would typically the output for future use
#write.csv(INT_sumstats, file = "INT4_sumstats.csv")
To add a SNP to our common factor model we include a single line that
specifies we want to regress the factor on a SNP by adding
F1~SNP
to the model
command. The model is then
run for each SNP individually.
We specify for the model output to only save the regression of the
factor on the SNP using the sub
option (otherwise, the
output will contain all factor loading and variance terms estimated for
each SNP).
#userGWAS takes three necessary arguments:
#1. covstruc = the output from the ldsc function
covstruc<-LDSC_INT
#2. SNPs = output from sumstats function
SNPs<-INT_sumstats
#3. model = the model to be run
#going to troubleshoot estimated ov variances are negative for 4 SNPs
#by adding model constraint for all residuals to be above 0
model<-"F1=~MDD+PTSD+ALCH+ANX
F1~SNP"
#4. sub = optional argument specifying component of model output to be saved
sub<-"F1~SNP"
#5. parallel = optional argument specifying whether it should be run in parallel
#set to FALSE here just for the practical
parallel<-FALSE
#6. Q_SNP = optional argument specifying whether you want
#the heterogeneity index calculated for your factors
Q_SNP<-TRUE
#run the multivariate GWAS below
INT_GWAS<-userGWAS(covstruc=covstruc, SNPs=SNPs, model=model, sub=sub, parallel=parallel, Q_SNP=Q_SNP)
# sub model means we only get the factor loading and variance of the SNP
Lets look at the results
##print the first five rows of the userGWAS output
INT_GWAS[[1]][1:5,]
## SNP CHR BP MAF A1 A2 lhs op rhs free label est
## 1 rs10030871 4 68786 0.0765408 T C F1 ~ SNP 6 9.072853e-04
## 2 rs6599368 4 69567 0.0755467 A T F1 ~ SNP 6 1.009834e-03
## 3 rs7678633 4 69713 0.0755467 G A F1 ~ SNP 6 1.051505e-03
## 4 rs13130581 4 70392 0.0725646 A G F1 ~ SNP 6 8.300959e-05
## 5 rs13125929 4 71566 0.0725646 T C F1 ~ SNP 6 4.857341e-04
## SE Z_Estimate Pval_Estimate chisq chisq_df chisq_pval AIC
## 1 0.004121185 0.22015156 0.8257531 4.905573 8 0.7676193 18.90557
## 2 0.004110347 0.24568102 0.8059292 4.839512 8 0.7745835 18.83951
## 3 0.004106863 0.25603613 0.7979229 4.901614 8 0.7680382 18.90161
## 4 0.004146428 0.02001954 0.9840278 5.029483 8 0.7544203 19.02948
## 5 0.004153705 0.11693995 0.9069076 5.787671 8 0.6710020 19.78767
## Q_SNP Q_SNP_df Q_SNP_pval error warning
## 1 3.622121 3 0.3052654 0 0
## 2 3.556059 3 0.3135639 0 0
## 3 3.618163 3 0.3057570 0 0
## 4 3.746033 3 0.2902261 0 0
## 5 4.504216 3 0.2119145 0 0
We essentially have a GWAS of our latent factor
#examine warnings; (0 = good to go)
table(INT_GWAS[[1]]$warning)
##
## 0
## 1000
#examine errors; (0 = good to go)
table(INT_GWAS[[1]]$error)
##
## 0
## 1000
Are there any genome-wide significant factor hits?
#any genome-wide significant factor hits?
table(INT_GWAS[[1]]$Pval_Estimate < 5e-8)
##
## FALSE
## 1000
What about genome-wide significant Q_SNP hits? This is a test of whether the effect of the SNP operates through the common factor or on the individual traits. It can serve as a quality control metric to remove SNPs that do not operate via the factor. In that sense, it also identifies trait-specific SNPs relative to the other traits that load on a factor.
#genome-wide significant Q_hits?
table(INT_GWAS[[1]]$Q_SNP_pval < 5e-8)
#trending significant Q_hits?
table(INT_GWAS[[1]]$Q_SNP_pval < 1e-5)
#What are these hits (it would be interesting to look at what gene region they are in?)
INT_GWAS[[1]][(INT_GWAS[[1]]$Q_SNP_pval < 1e-5)==TRUE,]
Below are two examples of more complex models that can be fit using GenomicSEM.
Optional exercise
Take a couple of minutes to chat with the people on your desk about the projects you are working on at the moment. Can you conceive of any ways in which GenomicSEM (or SEM) could be applied within your own projects? Perhaps you can draw this as a path diagram?
For example: Understanding how risk factors contribute to disease? Do they operate through shared pathways, or do they act on the disease through different genetic mechanisms?
Common factor models with multiple factors
We can use the common factor model to understand the underlying structure among different traits.
GWAS by subtraction
This model can be used to explore how genetically correlated traits differ. For example, educational attainment and cognitive performance are strongly genetically correlated. What contributes to the genetic component of educational attainment other than cognitive ability? We can subtract out the influence of cognitive performance from educational attainment to construct a latent variable Non-cog. Adding a SNP to this model allows us to understand the individual genetic components that contribute to Non-cog.
(Demange et al. 2021) https://www.nature.com/articles/s41588-020-00754-2
References