Conditional GWAS
Overview
Models
Here we present the results of a genome-wide association study (GWAS) conducted using pyseer (John A. Lees et al. 2018), a tool for identifying genetic associations with complex traits. The GWAS using the linear mixed effects model such that the minimum inhibitory concentration (MIC) of Penicillin G: \[\mathbf{y} \sim \mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon\]
with
\[ \begin{align*} \mathbf{u} \sim \mathcal{N} \left( 0, \sigma^2_g \mathbf{G} \right), \quad &\mathbf{\epsilon} \sim \mathcal{N} \left( 0, \sigma^2_g \mathbf{I} \right) \end{align*} \]
Where \(\mathbf{G}\) is the similarity matrix that was obtained from the phylogenetic relationship of the samples, and \(\mathbf{I}\) is the identity matrix.
Symbol | Meaning |
---|---|
𝑦 | A vector containing the logarithm of the MIC data for each sample. |
𝑊 | A design matrix containing the conditional covariates: We used a categorical variable for each unique combination of pencillin binding proteins pbp1a, pbp2b and pbp2x, coded according to the allele scheme proposed in (Li et al. 2016), and benchmarked in (Li et al. 2017). The allele codes were assigned using an inhouse script, but should be comparable (although with different names) to those of Pathogenwatch pipeline. |
𝑎 | Fixed effects for the covariates. |
𝑋 | A design matrix containing the presence or absence of the unitigs. |
𝑏 | Fixed effects of the unitigs. |
𝐾 | The similarity matrix between all pairs of samples. We used a neighbor-joining phylogeny estimated using PopPUNK (John A. Lees et al. 2019). |
𝑢 | Random effects for each row of the kinship matrix. |
ε | Normally distributed error terms. |
Thus we assume that the MIC can be predicted by the covariates with the correlation structure implied by the phylogenetic relationship of the samples. If we take the expectation of the equation we obtain \[E(\mathbf{y}) \sim \mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon\] \[ = E(\mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon)\] \[ = \mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b}\] and variance \[V(\mathbf{y}) = V(\mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon)\] \[ \begin{aligned} V(\mathbf{y}) = V(\mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon) \\ = V(\mathbf{K} \mathbf{u} + \epsilon) \\ = V(\mathbf{K} \mathbf{u}) + V(\epsilon) \\ = \sigma^2_g \mathbf{G} + \sigma^2_g \mathbf{I} \end{aligned} \]
Filtering of unitigs and defining significance threshold
We filtered the unitigs based on a minimum allele frequency (MAF) between [0.01, 0.99] meaning that the unitigs have to be present in more than 1% and less than 99% of the isolates. We used the number of patterns in the unitigs to define a significance threshold of the p-values of the unitigs. We detected 974212 unique patterns translating to a significance threshold of \(p = 5.13\cdot 10^{-08}\). All downstream results and visualizations were reported for unitigs with a smaller p-value than the threshold. Many of these hits were then annotated using the nomenclature of Bakta (Schwengers et al. 2021). The annotations can be looked up on NCBI for further information.
Results
Significant hits ordered by effect size
In the figures below the unitigs are mapped to reference genomes and. As references we used multiple genomes, and additionally we annotated our
Summary of annotated regions matching the unitigs (Table)
Unitig sequences, significance, effect size and heritability (Table)
Unitigs mapped to reference genomes
In total we extracted 394 significant unitigs.
Note: In past version of this analysis we found 2527 significant unitigs. In the current analysis we have conditioned on all PBPs and assigned new codes to unobserved alleles. This removes most significant hits, demonstrating the importance of properly accounting for PBP composition in the data.
We mapped these unitigs to different reference genomes, summarized in the table below. Streptococcus pneumoniae has a large pan-genome, and none of the references contain all of the unitigs.
Reference strain | Description |
---|---|
ATCC70066 | Multidrug-Resistant Pandemic Clone Streptococcus pneumoniae. 106 out of 394 unitigs were mapped to this genome. https://doi.org/10.1128/JB.01343-08 |
Hu17 - NCBI reference. | Current NCBI reference genome. 116 out of 394 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP020549.1 |
Hu17 | A high-level beta-lactam resistance and penicillin sensitive phenotype. 116 out of 394 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_002076835.1/ |
EF3030 | A serotype 19F isolate that colonizes the nasopharynx of mice while being mostly noninvasive. 144 out of 394 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6509521/ |
ST556 | A Multidrug-Resistant Isolate from an Otitis Media Patient. 124 out of 394 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3370858/ |
TIGR4 | From the paper: The acquisition of clinically relevant amoxicillin resistance in Streptococcus pneumoniae requires ordered horizontal gene transfer of four loci. 143 out of 394 unitigs were mapped to this genome. https://pubmed.ncbi.nlm.nih.gov/35877768/ |
SN75752 | From the paper: The acquisition of clinically relevant amoxicillin resistance in Streptococcus pneumoniae requires ordered horizontal gene transfer of four loci. 135 out of 394 unitigs were mapped to this genome. https://pubmed.ncbi.nlm.nih.gov/35877768/ |