Conditional GWAS

Published

January 3, 2023

Overview

Models

Here we present the results of a genome-wide association study (GWAS) conducted using pyseer (John A. Lees et al. 2018), a tool for identifying genetic associations with complex traits. The GWAS using the linear mixed effects model such that the minimum inhibitory concentration (MIC) of Penicillin G: \[\mathbf{y} \sim \mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon\]

with

\[ \begin{align*} \mathbf{u} \sim \mathcal{N} \left( 0, \sigma^2_g \mathbf{G} \right), \quad &\mathbf{\epsilon} \sim \mathcal{N} \left( 0, \sigma^2_g \mathbf{I} \right) \end{align*} \]

Where \(\mathbf{G}\) is the similarity matrix that was obtained from the phylogenetic relationship of the samples, and \(\mathbf{I}\) is the identity matrix.

Table 1: Information about the GWAS model
Symbol	Meaning
𝑦	A vector containing the logarithm of the MIC data for each sample.
𝑊	A design matrix containing the conditional covariates: We used a categorical variable for each unique combination of pencillin binding proteins pbp1a, pbp2b and pbp2x, coded according to the allele scheme proposed in (Li et al. 2016), and benchmarked in (Li et al. 2017). The allele codes were assigned using an inhouse script, but should be comparable (although with different names) to those of Pathogenwatch pipeline.
𝑎	Fixed effects for the covariates.
𝑋	A design matrix containing the presence or absence of the unitigs.
𝑏	Fixed effects of the unitigs.
𝐾	The similarity matrix between all pairs of samples. We used a neighbor-joining phylogeny estimated using PopPUNK (John A. Lees et al. 2019).
𝑢	Random effects for each row of the kinship matrix.
ε	Normally distributed error terms.

Thus we assume that the MIC can be predicted by the covariates with the correlation structure implied by the phylogenetic relationship of the samples. If we take the expectation of the equation we obtain \[E(\mathbf{y}) \sim \mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon\] \[ = E(\mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon)\] \[ = \mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b}\] and variance \[V(\mathbf{y}) = V(\mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon)\] \[ \begin{aligned} V(\mathbf{y}) = V(\mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon) \\ = V(\mathbf{K} \mathbf{u} + \epsilon) \\ = V(\mathbf{K} \mathbf{u}) + V(\epsilon) \\ = \sigma^2_g \mathbf{G} + \sigma^2_g \mathbf{I} \end{aligned} \]

Filtering of unitigs and defining significance threshold

We filtered the unitigs based on a minimum allele frequency (MAF) between [0.01, 0.99] meaning that the unitigs have to be present in more than 1% and less than 99% of the isolates. We used the number of patterns in the unitigs to define a significance threshold of the p-values of the unitigs. We detected 974212 unique patterns translating to a significance threshold of \(p = 5.13\cdot 10^{-08}\). All downstream results and visualizations were reported for unitigs with a smaller p-value than the threshold. Many of these hits were then annotated using the nomenclature of Bakta (Schwengers et al. 2021). The annotations can be looked up on NCBI for further information.

Results

Significant hits ordered by effect size

In the figures below the unitigs are mapped to reference genomes and. As references we used multiple genomes, and additionally we annotated our

Figure 1: Interactive plot of unitigs mapped to annotated regions of references and assemblies. This scatter plot shows the relationship between the average effect size and the maximum -log10(p-value) of genes associated with penicillin resistance. The size of the points represents the number of unitigs, and the color represents the average minor allele frequency (MAF). Points are labeled with gene name. The plot highlights genes that have high effect size and large -log10(p-value) as well as high unitig count and high MAF value.

Summary of annotated regions matching the unitigs (Table)

Figure 2: Gene hits Annotated regions of mapped unitigs. The table summarizes the locations of unitigs in the genome, including the gene annotated region (gene), the number of unitigs in that region (HITS), the highest observed p-value (MAXP), the average minimum allele frequency (AVG_MAF), and the average effect size (AVG_BETA) of unitigs within the region. The table also includes a column (NAMES_FOR_DISPLAY) to indicate regions that have common gene names, which match the names in the interactive figures above.

Unitig sequences, significance, effect size and heritability (Table)

Figure 3: Unitig sequences. The table displays various characteristics of the significant unitigs including allele frequency (AF), p-value (filter-pvalue), likelihood (lrt-pvalue), effect size (beta), estimated standard error of the effect size (beta-std-err), estimated heritability of the unitig sequence (variant_h2), and the annotated region in which the unitig is found (annotation) as well as the nearest upstream and downstream annotated regions (gene-upstream gene-downstream gene).

Unitigs mapped to reference genomes

In total we extracted 394 significant unitigs.

Note: In past version of this analysis we found 2527 significant unitigs. In the current analysis we have conditioned on all PBPs and assigned new codes to unobserved alleles. This removes most significant hits, demonstrating the importance of properly accounting for PBP composition in the data.

We mapped these unitigs to different reference genomes, summarized in the table below. Streptococcus pneumoniae has a large pan-genome, and none of the references contain all of the unitigs.

Table 2: Information about the reference strains
Reference strain	Description
ATCC70066	Multidrug-Resistant Pandemic Clone Streptococcus pneumoniae. 106 out of 394 unitigs were mapped to this genome. https://doi.org/10.1128/JB.01343-08
Hu17 - NCBI reference.	Current NCBI reference genome. 116 out of 394 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP020549.1
Hu17	A high-level beta-lactam resistance and penicillin sensitive phenotype. 116 out of 394 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_002076835.1/
EF3030	A serotype 19F isolate that colonizes the nasopharynx of mice while being mostly noninvasive. 144 out of 394 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6509521/
ST556	A Multidrug-Resistant Isolate from an Otitis Media Patient. 124 out of 394 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3370858/
TIGR4	From the paper: The acquisition of clinically relevant amoxicillin resistance in Streptococcus pneumoniae requires ordered horizontal gene transfer of four loci. 143 out of 394 unitigs were mapped to this genome. https://pubmed.ncbi.nlm.nih.gov/35877768/
SN75752	From the paper: The acquisition of clinically relevant amoxicillin resistance in Streptococcus pneumoniae requires ordered horizontal gene transfer of four loci. 135 out of 394 unitigs were mapped to this genome. https://pubmed.ncbi.nlm.nih.gov/35877768/

Figure 4: Unitigs mapped to various reference genomes. The reference strains used for the mapping are described in the table above. The data displayed when hovering over the points in the figure represents the annotation attributes extracted from the genes that overlap with the starting position of the unitig (basepair position). It’s noteworthy that the “product” and “gene” attributes can provide valuable insights into the function of the genomic regions that the unitigs overlap with.

References

Lees, John A, Marco Galardini, Stephen D Bentley, Jeffrey N Weiser, and Jukka Corander. 2018. “Pyseer: A Comprehensive Tool for Microbial Pangenome-Wide Association Studies.” Edited by Oliver Stegle. Bioinformatics 34 (24): 4310–12. https://doi.org/10.1093/bioinformatics/bty539.

Lees, John A., Simon R. Harris, Gerry Tonkin-Hill, Rebecca A. Gladstone, Stephanie W. Lo, Jeffrey N. Weiser, Jukka Corander, Stephen D. Bentley, and Nicholas J. Croucher. 2019. “Fast and Flexible Bacterial Genomic Epidemiology with PopPUNK.” Genome Research 29 (2): 304–16. https://doi.org/10.1101/gr.241455.118.

Li, Yuan, Benjamin J. Metcalf, Sopio Chochua, Zhongya Li, Robert E. Gertz, Hollis Walker, Paulina A. Hawkins, et al. 2016. “Penicillin-Binding Protein Transpeptidase Signatures for Tracking and Predicting β-Lactam Resistance Levels in Streptococcus Pneumoniae.” Edited by James M. Hughes. mBio 7 (3). https://doi.org/10.1128/mbio.00756-16.

Li, Yuan, Benjamin J. Metcalf, Sopio Chochua, Zhongya Li, Robert E. Gertz, Hollis Walker, Paulina A. Hawkins, Theresa Tran, Lesley McGee, and Bernard W. Beall. 2017. “Validation of β-Lactam Minimum Inhibitory Concentration Predictions for Pneumococcal Isolates with Newly Encountered Penicillin Binding Protein (PBP) Sequences.” BMC Genomics 18 (1). https://doi.org/10.1186/s12864-017-4017-7.

Schwengers, Oliver, Lukas Jelonek, Marius Alfred Dieckmann, Sebastian Beyvers, Jochen Blom, and Alexander Goesmann. 2021. “Bakta: Rapid and Standardized Annotation of Bacterial Genomes via Alignment-Free Sequence Identification.” Microbial Genomics 7 (11). https://doi.org/10.1099/mgen.0.000685.