Global genomic framework for typhoid

It’s been over a year since we published the first global whole-genome snapshot of nearly 2000 genomes of the typhoid bacterium, Salmonella Typhi in Nature Genetics.

That paper focused on the emergence and global dissemination of what we’ve been calling for years the “H58” clone (see this blog post). This clone accounted for nearly half of all the isolates sequenced, and is a big deal because it tends to be multidrug resistant (MDR), carrying a suite of resistance genes that render all the cheap, first-line drugs like chloramphenicol, ampicillin, and trimethoprim-sufamethoxazole useless for treatment. Detailed genomic epi studies show the local impact of the arrival of MDR H58 in countries as widespread as Malawi and Cambodia; and the emergence of fluoroquinolone resistant H58 sublineage in India and Nepal recently stopped a treatment trial because the current standard of care – ciprofloxacin – was resulting in frequent treatment failure.

While H58 is important, the global Typhi population contains a lot of genomic diversity outside the H58 clone, and we’ve turned our attention to the rest of the population now in a new paper in Nature Communications: “An extended genotyping framework for Salmonella enterica serovar Typhi, the cause of human typhoid

First, we decided that we needed to revisit the haplotyping scheme of Roumagnac et al (from which H58 gets its name), which was based on just ~80 genes, using the whole genome phylogeny. Here is the tree inferred from core genome SNPs in 1832 Typhi strains, with the old haplotypes indicated by the coloured ring around the outside. It’s pretty easy to see that some haplotypes (like H52 and H1) actually comprise multiple distinct phylogenetic lineages (low resolution), while others subdivide lineages (excessive resolution).


Whole genome SNP tree for 1832 strains, outer ring indicates haplotypes based on mutations in 80 genes as defined in Roumagnac et al, Science 2006.

We used BAPS to define genetic clusters at various levels (thanks to Tom Connor for running this). We settled on 3 levels of hierarchical clustering, indicated in the tree below:

• 4 nested primary clusters (inner-most ring; yellow, green, blue, red). These have 100% bootstrap support and are each characterised by >20 SNPs

• Clusters are further divided into 16 clades (middle ring and labels). The median pairwise distance between isolates in the same clade is 109 SNPs, while the inter-clade SNP distance averages 243 SNPs.

• Clades are further divided into 49 subclades, indicated by alternating background shading colours. The median pairwise distance between isolates in the same subclade is 25 SNPs.


Tree indicating new phylo-informed genotypes. Primary clusters 1-4 are indicated in the inner ring. Branch colours indicate clades, which are also labelled on the outside and coloured in the outer ring. Subclades are indicated by alternating background shading.

subcladerectOne of the key reasons we wanted to define the phylogenetic lineages in this way is to make them easier to identify and talk about. I’ve always been a fan of MLST for this reason, since it’s much easier to talk about K. pneumoniae ST258, ST11, ST15 etc than ‘that lineage that has reference strain X in it’. So we introduce a hierarchical nomenclature system, similar to the one currently in use for Mycobacterium tuberculosis, where the 4 primary clusters (1, 2, 3, 4) are subdivided into 16 clades (1.1, 1.2; 2.1, 2.2, etc) which in turn are subdivided into 49 subclades (1.1.1, 1.1.2, etc). This has the advantage of conveying hierarchical relationships between groups – e.g. 2.2.1 and 2.2.2 are sister subclades within clade 2.2, which is a sister clade of 2.1.

The subclades are easier to distinguish in the collapsed rectangular tree on the right, where each subclade is represented by just one strain.

Some BAPS clusters were polyphyletic and consisted of isolates belonging to rare phylogenetic lineages whose common ancestor in the tree coincided with the common ancestor of an entire clade (n=9) or primary cluster (n=2). These groups contain isolates that, given increased numbers, may emerge as distinct clusters that form sister taxa within the parent clade (or primary cluster), and were given the suffix ‘.0’ rather than a defined cluster number (e.g. 3.0 or 3.1.0) to indicate non-equivalence with the properly differentiated sister clades (n=16) or subclades (n=49). As more genomes are added, these are expected to be more clearly differented into distinct groups and given proper clade/subclade designations.

Next we defined a set of 68 SNPs that can be used to genotype isolates into these groups. We chose one SNP for each primary cluster, clade and subclade (preferentially choosing intragenic SNPs in well-conserved core genes). The SNPs are detailed in a supplementary spreadsheet, and we provide a script to assign strains to genotypes based on an input BAM or VCF file generated by mapping to the reference genome for Typhi strain CT18.

An isolate that belongs to a differentiated subclade such as 2.1.4 will be hierarchically identified by carrying the derived allele for primary cluster 2 (but not the nested clusters 3 and 4); the derived allele for clade 2.1 (but no other clades) and the derived allele for subclade 2.1.4 (but no other subclades). It is possible for an isolate to carry derived alleles for a primary cluster and clade with no further differentiation into subclade.

The clone formally known as H58
Under the new scheme, the infamous H58 clone is named subclade 4.3.1, which so far has no sister clades. I suspect those of us familiar with Typhi population genomics will keep referring to it informally as H58 for some time, since that name is now well known… but I will try to re-train myself to call it 4.3.1 (H58).

Now the fun part: exploring the geographical distribution of these lineages.


Figure 1c from the paper. Pie colours indicate clades found in each WHO region in the global data set (key is in the tree figure above).

In the paper we go on to show that:
• clades are widely geographically distributed, while subclades are geographically constrained (see heatmap below);
• genotyping can be used to predict the geographical origin of travel-associated typhoid in patients in London;
• even better predictions can be obtained based on genome-wide SNP distances to our reference panel of >1800 isolates… but of course that involves a lot more computationally intensive comparisons than a quick screen of a new isolate’s BAM file.

Screen Shot 2016-08-29 at 11.47.15 pm.png

Figure 2 of the paper, showing the geographical distribution of subclades, which shows most subclades are restricted to a single region. For this analysis, the effect of local outbreaks has been minimised by replacing groups of strains that share the same subclade and year and country of isolation with a single representative strain.

You can read the full details in the paper, but here I just want to highlight that you can now explore the global genomic framework for Typhi – including genotype designations as well as temporal and geographic data – interactively in MicroReact.



How does genotyping help with studying local populations?

We have already begun using the new genotyping scheme in local typhoid studies. I find this a really helpful way to describe/summarise the local populations, and place them in the context of the global population without resorting to large trees.

For example in this recent Nigerian study, we described the population like this: “The majority of isolates (84/128, 66%) belonged to genotype 3.1.1 , which is relatively common across Africa, predominantly western and central countries. In the wider African collection genotype 3.1.1 was represented by isolates from neighbouring Cameroon and across West Africa (Benin, Togo, Ivory Coast, Burkina Faso, Mali, Guinea and Mauritania) suggesting long-term inter-country exchange within the region. Most of the remaining isolates belonged to four other genotypes (4.1, 2.2, 2.3.1 and 0.0.3).”

Of course genotype assignment is not the end of the story – we still want to build whole-genome trees to explore the relationships of local isolates with those from other countries. Importantly, working with genotypes means that we can achieve this without needing to build a megatree of all isolates in the local + global collections (n>2000). Instead, we can use the genotypes to identify which strains from the global collection are relatives of the Nigerian isolates, and build a much smaller tree that still captures all of the information about transmission/transfer between Nigeria and other countries:


The tree and map were made using MicroReact, you can recreate theme here: http://microreact.org/project/styphi_nigeria To get this colour scheme just click on the eye icon (bottom left) and select ‘country’; and to get the fan style tree, click the settings button (top right) and click the fan shape.

Another example is in our recent paper on isolates collected in Thailand before and after the introduction of their national vaccination program (pre-print here):

  • Genotype 3.2.1 was the most common (n=14, 32%), followed by genotype 2.1.7 (n=10, 23%)
  • Genotypes 2.0 (n=1, 2%) and 4.1 (n=3, 7%) were observed only in 1973 (pre-vaccine period)
  • Genotypes 2.1.7 (n=10, 23%), 2.3.4 (n=1, 2%), 3.4.0 (n=2, 5%), 3.0.0 (n=3, 7%), 3.1.2 (n=2, 5%), were observed only after 1981 (post-vaccine period)
  • Genotypes 3.2.1 and 2.4.0 were observed amongst both pre- and post-vaccine isolates, but the subclade phylogenies show that these more likely to represent re-introduction of strains from neighbouring countries than persistence within Thailand throughout the immunisation program.

Clarifier: Bacterial populations and communities

[This was originally posted by Kat on her BacPathGenomics blog, April 2011]

Two areas where next-gen sequencing is making a big impact in the bacterial world are the analysis of ‘bacterial populations’ and ‘bacterial communities’. While these might sound similar, they are actually very different.

In common parlance we sometimes use ‘population’ and ‘community’ somewhat interchangeably in talking about groups of humans. We might say that in Melbourne, coffee drinking is common in the local population, or that it is common in the local community. What we mean is that it’s common among people living in Melbourne.

In biology, the term ‘population’ has a specific meaning – a group of individuals of the same species (i.e. able to interbreed; but note this concept is complex in bacteria), defined by time and space. So we can talk about the currrent human population of the Earth, or the human population of Melbourne 20 years ago. Note that this is intimately tied up with the concept of species, as separation into two distinct populations is a key step towards diverging into different species. On the other hand, ‘community’ refers more generally to the group of organisms inhabiting a particular ecological niche, which could include any number of species. So for example we could talk about the population of karri trees (a species of eucalyptus found in the south west of Western Australia), or the community of plants inhabiting the karri forrest.

Bacterial populations

When we talk about bacterial populations, what we mean is investigating the population structure of a particular bacterial species/subtype… in theory this aims to understand the population in its entirety, but in practice usually involves studying lots of individual members of the population and making inferences about the population as a whole. We can attempt to understand populations at different levels of localisation…. e.g. we can study a highly localised population, like the population of Salmonella Typhi inhabiting the gall bladder of a typhoid carrier; or more expansive populations of Salmonella Typhi circulating in a city, a country or around the globe.

Sequencing has been a great tool for understanding bacterial populations, by allowing lots of individual members of a population (i.e. individual bacterial isolates or colonies) to be compared at the sequence level. Sequence data is ideal for this, as the differences between individuals are often tiny  (i.e. there is very little variation) since they belong to a single population, and DNA sequence data allows us to detect single nucleotide changes (ie provides high resolution). Also, since we have well-developed models of sequence evolution (ie how nucleotide changes accumulate), sequence data can be interpreted using phylogenetic analysis. This really kicked off a decade ago with multi-locus sequence typing (MLST; see wikipedia entry(!) or Maiden et al, 1998 for more info) and is now expanding rapidly with the advent of sequencing platforms that allow whole genomes of hundreds of isolates to be sequenced (e.g. 96 bacterial isolates can be readily sequenced in a single run of the Illumina HiSeq, using multiplexing).

This kind of analysis can be used in public health microbiology and infectious disease epidemiology to trace outbreaks or transmission (sometimes called molecular epidemiology or genomic epidemiology). It can also be used to study the evolution of drug resistance or pathogenesis/virulence in bacterial populations (microevolution, since it is occurring within populations), or the impact of a novel vaccine or drug on a given bacterial population, all of which can be useful for designing and monitoring public health interventions or making treatment recommendations.

A great recent example is the study by Nick Croucher (an immensely talented PhD student) from the Sanger Institute, and numerous collaborators, who compared the genomes of 240 Streptococcus pneumoniae isolates of the PMEN1 subtype, collected from all over the world since 1984. By comparing the genomes of these isolates, they found evidence of frequent homologous recombination with other S. pneumoniae, including exchange of genes encoding the capsule targeted by vaccination and acquisition of drug resistance genes. Assuming the sequenced isolates are reasonably representative of the global population of S. pneumoniae PMEN1, this indicates that the PMEN1 population is not isolated from the rest of the S. pneumoniae population but that there is constant gene exchange within and between S. pneumoniae groups, allowing the bacteria to escape the effects of human interventions including vaccine-induced immunity and exposure to antimicrobial drugs. We already ‘knew’ this could happen in bacterial pathogen populations, but this study provides direct evidence of it occurring in response to a specific vaccine and specific drugs used for treatment. See pubmed entry, unfortunately you need access to Science magazine to read the article.

Bacterial communities

On the other hand, when we talk about bacterial communities, what we mean is investigating the communities of bacteria present in a given sample. This is akin to walking through the forest and taking note of each plant you see, and the analysis methods borrow heavily from ecology. Studies of bacterial communities are being done in just about every kind of sample in which you would expect to find bacteria – from environmental samples (e.g. underwater caves; windscreen splatter) to human body sites (faeces or the gut; skin; nasal passages; read more at the Human Microbiome Project site).

The analysis usually focuses on determining which bacterial taxa (e.g. a genus, species or subgroup) were present in each sample, and their relative abundance. These can be compared across samples to identify taxa that are only present in certain kinds of environments, or whose presence is associated with another property of the sample (e.g. presence in the nose may be associated with development of otitis media). Communities can be examined more holistically to identify broad differences in the bacterial community structures associated with different samples.

Sequencing has dramatically improved the ease with which bacterial communities can be studied, via sequencing of DNA extracted from a given sample (e.g. a soil sample; a fecal sample). Two approaches are possible – sequence the raw DNA extract or amplify a conserved bacterial gene (using PCR) and sequence that. The first is true ‘metagenomics’, as you are sequencing all of the genomes present in the original sample, but this takes a lot sequencing effort and you may not need or want to know every single gene present in the sample. At the moment, Illumina platforms are most appropriate for this application as they have the highest throughput, however their short read lengths make assembly and analysis difficult. The second way, which usually targets the conserved 16S ribosomal RNA gene (‘16S sequencing’), is a more tractable way of determining what species/subgroups of bacteria are present in the sample and estimating their relative abundance. Multiplexing can be achieved by incorporating sample-specific barcodes into the amplicons during PCR, allowing hundreds of samples to be analysed in a single sequencing run.