Disentangling
  • Home
  • About this project

On this page

  • Overview
    • Models
    • Filtering of unitigs and defining significance threshold
  • Results
    • Significant hits ordered by effect size
    • Summary of annotated regions matching the unitigs (Table)
    • Unitig sequences, significance, effect size and heritability (Table)
    • Unitigs mapped to reference genomes
  • References

Conditional GWAS

Published

January 3, 2023

Overview

Models

Here we present the results of a genome-wide association study (GWAS) conducted using pyseer (John A. Lees et al. 2018), a tool for identifying genetic associations with complex traits. The GWAS using the linear mixed effects model such that the minimum inhibitory concentration (MIC) of Penicillin G: y∼Wa+Xb+Ku+ϵ

with

u∼N(0,σg2G),ϵ∼N(0,σg2I)

Where G is the similarity matrix that was obtained from the phylogenetic relationship of the samples, and I is the identity matrix.

Table 1: Information about the GWAS model
Symbol Meaning
𝑦 A vector containing the logarithm of the MIC data for each sample.
𝑊 A design matrix containing the conditional covariates: We used a categorical variable for each unique combination of pencillin binding proteins pbp1a, pbp2b and pbp2x, coded according to the allele scheme proposed in (Li et al. 2016), and benchmarked in (Li et al. 2017). The allele codes were assigned using an inhouse script, but should be comparable (although with different names) to those of Pathogenwatch pipeline.
𝑎 Fixed effects for the covariates.
𝑋 A design matrix containing the presence or absence of the unitigs.
𝑏 Fixed effects of the unitigs.
𝐾 The similarity matrix between all pairs of samples. We used a neighbor-joining phylogeny estimated using PopPUNK (John A. Lees et al. 2019).
𝑢 Random effects for each row of the kinship matrix.
ε Normally distributed error terms.

Thus we assume that the MIC can be predicted by the covariates with the correlation structure implied by the phylogenetic relationship of the samples. If we take the expectation of the equation we obtain E(y)∼Wa+Xb+Ku+ϵ =E(Wa+Xb+Ku+ϵ) =Wa+Xb and variance V(y)=V(Wa+Xb+Ku+ϵ) V(y)=V(Wa+Xb+Ku+ϵ)=V(Ku+ϵ)=V(Ku)+V(ϵ)=σg2G+σg2I

Filtering of unitigs and defining significance threshold

We filtered the unitigs based on a minimum allele frequency (MAF) between [0.01, 0.99] meaning that the unitigs have to be present in more than 1% and less than 99% of the isolates. We used the number of patterns in the unitigs to define a significance threshold of the p-values of the unitigs. We detected 974212 unique patterns translating to a significance threshold of p=5.13⋅10−08. All downstream results and visualizations were reported for unitigs with a smaller p-value than the threshold. Many of these hits were then annotated using the nomenclature of Bakta (Schwengers et al. 2021). The annotations can be looked up on NCBI for further information.

Results

Significant hits ordered by effect size

In the figures below the unitigs are mapped to reference genomes and. As references we used multiple genomes, and additionally we annotated our

pbp1adusBsufDrecF-2024610152025
0.000.250.500.75Average AFAverage effect sizeMaximum -log10(p-value)
plotly-logomark

Figure 1: Interactive plot of unitigs mapped to annotated regions of references and assemblies. This scatter plot shows the relationship between the average effect size and the maximum -log10(p-value) of genes associated with penicillin resistance. The size of the points represents the number of unitigs, and the color represents the average minor allele frequency (MAF). Points are labeled with gene name. The plot highlights genes that have high effect size and large -log10(p-value) as well as high unitig count and high MAF value.

Summary of annotated regions matching the unitigs (Table)

test
genehitsmaxpavg_mafavg_betanames_for_display
gene
hits
maxp
avg_maf
avg_beta
names_for_display
1cds-AFC94349.1127.09312646527790.016
2cds-AFC94350.2127.09312646527790.016
3cds-AFC94371.1127.09312646527790.01096
4cds-USV25787.1327.09312646527790.01056666666666674.84
5cds-AFC94373.1227.09312646527790.011354.305
6cds-SPNHU17_RS033001127.09312646527790.01234545454545453.71818181818182
7cds-WP_000766577.1|cds-SPNHU17_RS03300116.56224943717960.01163.6
8cds-WP_001032463.117.450996737974210.0173.51
9cds-USV25279.1327.09312646527790.01306666666666673.45666666666667
10pabB3127.09312646527790.01272580645161293.38454838709677pabB
Showing 1 to 10 of 57 entries
Previous123456Next

Figure 2: Gene hits Annotated regions of mapped unitigs. The table summarizes the locations of unitigs in the genome, including the gene annotated region (gene), the number of unitigs in that region (HITS), the highest observed p-value (MAXP), the average minimum allele frequency (AVG_MAF), and the average effect size (AVG_BETA) of unitigs within the region. The table also includes a column (NAMES_FOR_DISPLAY) to indicate regions that have common gene names, which match the names in the interactive figures above.

Unitig sequences, significance, effect size and heritability (Table)

Unitigsaffilter-pvaluelrt-pvaluebetabeta-std-errvariant_h2annotation
Unitigs
af
filter-pvalue
lrt-pvalue
beta
beta-std-err
variant_h2
annotation
1GAGATTCCAGTCAAGGAAGTTCGTGATGAAATGATTG0.9813.14e-371.5e-8-1.640.290.0677NZ_CP020549.1:840583-840619;sufD;sufD;sufD
2CTGCTAATTTAAGCGGGCATCCTGATAGAACCAAAGCTCGAAGA0.01244.38e-103.45e-142.510.3310.0906NZ_CP020549.1:670101-670144;cds-WP_000766577.1;cds-WP_000766577.1;cds-WP_000766577.1
3CTAACGTAGTAGAAAAAAACTCCTCCGCCTAAGAC0.84601.57e-8-2.970.5250.0676NZ_CP020549.1:390377-390411;pbp1a;pbp1a;pbp1a
4GCATGGTCAATATCAACACAAAAAACAATTGTTTT0.990.001251.22e-8-0.6510.1140.0681NZ_CP020549.1:523950-523984;cds-WP_000229101.1;cds-WP_000229101.1;cds-WP_000229101.1
5ACTGAGAGGTGTTAATACTTGACGATTGACTTCTATC0.01244.94e-101.58e-152.840.3560.0952NZ_CP020549.1:669462-669498;pabB;pabB;pabB
6CATAGGCATTGTAGCCCGCCTCCTGCTCTATC0.01081.94e-128.07e-2860.5470.13NZ_CP020549.1:668601-668632;pabB;pabB;pabB
7CACAATCATCATGTTTTCAGAGCGATTTTTGGGGTCCTGTTC0.01241.11e-91.58e-152.840.3560.0952NZ_CP020549.1:668779-668820;pabB;pabB;pabB
8AAGGTCAAGACTTTCTCTTCTACCTGTCCTCTTTCCAGCAAATGCTGACGGTAAATTCCTG0.01241.81e-91.58e-152.840.3560.0952NZ_CP020549.1:669716-669776;pabB;pabB;pabB
9TTGAGCGATAGCTTCTTTATTGGCAACAGGGAT0.9882.9e-152.43e-8-1.120.20.0667NZ_CP020549.1:1809049-1809081;groL;groL;groL
10GTCAATATCAACACAAAAAACAATTGTTTTAT0.990.0009342.14e-8-0.6330.1130.067NZ_CP020549.1:523955-523986;cds-WP_000229101.1;cds-WP_000229101.1;cds-WP_000229101.1
Showing 1 to 10 of 394 entries
Previous12345…40Next

Figure 3: Unitig sequences. The table displays various characteristics of the significant unitigs including allele frequency (AF), p-value (filter-pvalue), likelihood (lrt-pvalue), effect size (beta), estimated standard error of the effect size (beta-std-err), estimated heritability of the unitig sequence (variant_h2), and the annotated region in which the unitig is found (annotation) as well as the nearest upstream and downstream annotated regions (gene-upstream gene-downstream gene).

Unitigs mapped to reference genomes

In total we extracted 394 significant unitigs.

Note: In past version of this analysis we found 2527 significant unitigs. In the current analysis we have conditioned on all PBPs and assigned new codes to unobserved alleles. This removes most significant hits, demonstrating the importance of properly accounting for PBP composition in the data.

We mapped these unitigs to different reference genomes, summarized in the table below. Streptococcus pneumoniae has a large pan-genome, and none of the references contain all of the unitigs.

Table 2: Information about the reference strains
Reference strain Description
ATCC70066 Multidrug-Resistant Pandemic Clone Streptococcus pneumoniae. 106 out of 394 unitigs were mapped to this genome. https://doi.org/10.1128/JB.01343-08
Hu17 - NCBI reference. Current NCBI reference genome. 116 out of 394 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP020549.1
Hu17 A high-level beta-lactam resistance and penicillin sensitive phenotype. 116 out of 394 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_002076835.1/
EF3030 A serotype 19F isolate that colonizes the nasopharynx of mice while being mostly noninvasive. 144 out of 394 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6509521/
ST556 A Multidrug-Resistant Isolate from an Otitis Media Patient. 124 out of 394 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3370858/
TIGR4 From the paper: The acquisition of clinically relevant amoxicillin resistance in Streptococcus pneumoniae requires ordered horizontal gene transfer of four loci. 143 out of 394 unitigs were mapped to this genome. https://pubmed.ncbi.nlm.nih.gov/35877768/
SN75752 From the paper: The acquisition of clinically relevant amoxicillin resistance in Streptococcus pneumoniae requires ordered horizontal gene transfer of four loci. 135 out of 394 unitigs were mapped to this genome. https://pubmed.ncbi.nlm.nih.gov/35877768/
  • ATCC700669
  • Hu17 - reference
  • Hu17
  • EF3030
  • ST556
  • TIGR4
  • SN75752
0200000400000600000800000100000012000001400000160000018000002000000220000001020
log10.p
plotly-logomark
40000060000080000010000001200000140000016000001800000200000001020
log10.p
plotly-logomark
40000060000080000010000001200000140000016000001800000200000001020
log10.p
plotly-logomark
020000040000060000080000010000001200000140000016000001800000200000001020
log10.p
plotly-logomark
020000040000060000080000010000001200000140000016000001800000200000001020
log10.p
plotly-logomark
020000040000060000080000010000001200000140000016000001800000200000001020
log10.p
plotly-logomark
020000040000060000080000010000001200000140000016000001800000200000001020
log10.p
plotly-logomark

Figure 4: Unitigs mapped to various reference genomes. The reference strains used for the mapping are described in the table above. The data displayed when hovering over the points in the figure represents the annotation attributes extracted from the genes that overlap with the starting position of the unitig (basepair position). It’s noteworthy that the “product” and “gene” attributes can provide valuable insights into the function of the genomic regions that the unitigs overlap with.

References

Lees, John A, Marco Galardini, Stephen D Bentley, Jeffrey N Weiser, and Jukka Corander. 2018. “Pyseer: A Comprehensive Tool for Microbial Pangenome-Wide Association Studies.” Edited by Oliver Stegle. Bioinformatics 34 (24): 4310–12. https://doi.org/10.1093/bioinformatics/bty539.
Lees, John A., Simon R. Harris, Gerry Tonkin-Hill, Rebecca A. Gladstone, Stephanie W. Lo, Jeffrey N. Weiser, Jukka Corander, Stephen D. Bentley, and Nicholas J. Croucher. 2019. “Fast and Flexible Bacterial Genomic Epidemiology with PopPUNK.” Genome Research 29 (2): 304–16. https://doi.org/10.1101/gr.241455.118.
Li, Yuan, Benjamin J. Metcalf, Sopio Chochua, Zhongya Li, Robert E. Gertz, Hollis Walker, Paulina A. Hawkins, et al. 2016. “Penicillin-Binding Protein Transpeptidase Signatures for Tracking and Predicting β-Lactam Resistance Levels in Streptococcus Pneumoniae.” Edited by James M. Hughes. mBio 7 (3). https://doi.org/10.1128/mbio.00756-16.
Li, Yuan, Benjamin J. Metcalf, Sopio Chochua, Zhongya Li, Robert E. Gertz, Hollis Walker, Paulina A. Hawkins, Theresa Tran, Lesley McGee, and Bernard W. Beall. 2017. “Validation of β-Lactam Minimum Inhibitory Concentration Predictions for Pneumococcal Isolates with Newly Encountered Penicillin Binding Protein (PBP) Sequences.” BMC Genomics 18 (1). https://doi.org/10.1186/s12864-017-4017-7.
Schwengers, Oliver, Lukas Jelonek, Marius Alfred Dieckmann, Sebastian Beyvers, Jochen Blom, and Alexander Goesmann. 2021. “Bakta: Rapid and Standardized Annotation of Bacterial Genomes via Alignment-Free Sequence Identification.” Microbial Genomics 7 (11). https://doi.org/10.1099/mgen.0.000685.