Global genomic framework for typhoid

It’s been over a year since we published the first global whole-genome snapshot of nearly 2000 genomes of the typhoid bacterium, Salmonella Typhi in Nature Genetics.

That paper focused on the emergence and global dissemination of what we’ve been calling for years the “H58” clone (see this blog post). This clone accounted for nearly half of all the isolates sequenced, and is a big deal because it tends to be multidrug resistant (MDR), carrying a suite of resistance genes that render all the cheap, first-line drugs like chloramphenicol, ampicillin, and trimethoprim-sufamethoxazole useless for treatment. Detailed genomic epi studies show the local impact of the arrival of MDR H58 in countries as widespread as Malawi and Cambodia; and the emergence of fluoroquinolone resistant H58 sublineage in India and Nepal recently stopped a treatment trial because the current standard of care – ciprofloxacin – was resulting in frequent treatment failure.

While H58 is important, the global Typhi population contains a lot of genomic diversity outside the H58 clone, and we’ve turned our attention to the rest of the population now in a new paper in Nature Communications: “An extended genotyping framework for Salmonella enterica serovar Typhi, the cause of human typhoid

First, we decided that we needed to revisit the haplotyping scheme of Roumagnac et al (from which H58 gets its name), which was based on just ~80 genes, using the whole genome phylogeny. Here is the tree inferred from core genome SNPs in 1832 Typhi strains, with the old haplotypes indicated by the coloured ring around the outside. It’s pretty easy to see that some haplotypes (like H52 and H1) actually comprise multiple distinct phylogenetic lineages (low resolution), while others subdivide lineages (excessive resolution).


Whole genome SNP tree for 1832 strains, outer ring indicates haplotypes based on mutations in 80 genes as defined in Roumagnac et al, Science 2006.

We used BAPS to define genetic clusters at various levels (thanks to Tom Connor for running this). We settled on 3 levels of hierarchical clustering, indicated in the tree below:

• 4 nested primary clusters (inner-most ring; yellow, green, blue, red). These have 100% bootstrap support and are each characterised by >20 SNPs

• Clusters are further divided into 16 clades (middle ring and labels). The median pairwise distance between isolates in the same clade is 109 SNPs, while the inter-clade SNP distance averages 243 SNPs.

• Clades are further divided into 49 subclades, indicated by alternating background shading colours. The median pairwise distance between isolates in the same subclade is 25 SNPs.


Tree indicating new phylo-informed genotypes. Primary clusters 1-4 are indicated in the inner ring. Branch colours indicate clades, which are also labelled on the outside and coloured in the outer ring. Subclades are indicated by alternating background shading.

subcladerectOne of the key reasons we wanted to define the phylogenetic lineages in this way is to make them easier to identify and talk about. I’ve always been a fan of MLST for this reason, since it’s much easier to talk about K. pneumoniae ST258, ST11, ST15 etc than ‘that lineage that has reference strain X in it’. So we introduce a hierarchical nomenclature system, similar to the one currently in use for Mycobacterium tuberculosis, where the 4 primary clusters (1, 2, 3, 4) are subdivided into 16 clades (1.1, 1.2; 2.1, 2.2, etc) which in turn are subdivided into 49 subclades (1.1.1, 1.1.2, etc). This has the advantage of conveying hierarchical relationships between groups – e.g. 2.2.1 and 2.2.2 are sister subclades within clade 2.2, which is a sister clade of 2.1.

The subclades are easier to distinguish in the collapsed rectangular tree on the right, where each subclade is represented by just one strain.

Some BAPS clusters were polyphyletic and consisted of isolates belonging to rare phylogenetic lineages whose common ancestor in the tree coincided with the common ancestor of an entire clade (n=9) or primary cluster (n=2). These groups contain isolates that, given increased numbers, may emerge as distinct clusters that form sister taxa within the parent clade (or primary cluster), and were given the suffix ‘.0’ rather than a defined cluster number (e.g. 3.0 or 3.1.0) to indicate non-equivalence with the properly differentiated sister clades (n=16) or subclades (n=49). As more genomes are added, these are expected to be more clearly differented into distinct groups and given proper clade/subclade designations.

Next we defined a set of 68 SNPs that can be used to genotype isolates into these groups. We chose one SNP for each primary cluster, clade and subclade (preferentially choosing intragenic SNPs in well-conserved core genes). The SNPs are detailed in a supplementary spreadsheet, and we provide a script to assign strains to genotypes based on an input BAM or VCF file generated by mapping to the reference genome for Typhi strain CT18.

An isolate that belongs to a differentiated subclade such as 2.1.4 will be hierarchically identified by carrying the derived allele for primary cluster 2 (but not the nested clusters 3 and 4); the derived allele for clade 2.1 (but no other clades) and the derived allele for subclade 2.1.4 (but no other subclades). It is possible for an isolate to carry derived alleles for a primary cluster and clade with no further differentiation into subclade.

The clone formally known as H58
Under the new scheme, the infamous H58 clone is named subclade 4.3.1, which so far has no sister clades. I suspect those of us familiar with Typhi population genomics will keep referring to it informally as H58 for some time, since that name is now well known… but I will try to re-train myself to call it 4.3.1 (H58).

Now the fun part: exploring the geographical distribution of these lineages.


Figure 1c from the paper. Pie colours indicate clades found in each WHO region in the global data set (key is in the tree figure above).

In the paper we go on to show that:
• clades are widely geographically distributed, while subclades are geographically constrained (see heatmap below);
• genotyping can be used to predict the geographical origin of travel-associated typhoid in patients in London;
• even better predictions can be obtained based on genome-wide SNP distances to our reference panel of >1800 isolates… but of course that involves a lot more computationally intensive comparisons than a quick screen of a new isolate’s BAM file.

Screen Shot 2016-08-29 at 11.47.15 pm.png

Figure 2 of the paper, showing the geographical distribution of subclades, which shows most subclades are restricted to a single region. For this analysis, the effect of local outbreaks has been minimised by replacing groups of strains that share the same subclade and year and country of isolation with a single representative strain.

You can read the full details in the paper, but here I just want to highlight that you can now explore the global genomic framework for Typhi – including genotype designations as well as temporal and geographic data – interactively in MicroReact.



How does genotyping help with studying local populations?

We have already begun using the new genotyping scheme in local typhoid studies. I find this a really helpful way to describe/summarise the local populations, and place them in the context of the global population without resorting to large trees.

For example in this recent Nigerian study, we described the population like this: “The majority of isolates (84/128, 66%) belonged to genotype 3.1.1 , which is relatively common across Africa, predominantly western and central countries. In the wider African collection genotype 3.1.1 was represented by isolates from neighbouring Cameroon and across West Africa (Benin, Togo, Ivory Coast, Burkina Faso, Mali, Guinea and Mauritania) suggesting long-term inter-country exchange within the region. Most of the remaining isolates belonged to four other genotypes (4.1, 2.2, 2.3.1 and 0.0.3).”

Of course genotype assignment is not the end of the story – we still want to build whole-genome trees to explore the relationships of local isolates with those from other countries. Importantly, working with genotypes means that we can achieve this without needing to build a megatree of all isolates in the local + global collections (n>2000). Instead, we can use the genotypes to identify which strains from the global collection are relatives of the Nigerian isolates, and build a much smaller tree that still captures all of the information about transmission/transfer between Nigeria and other countries:


The tree and map were made using MicroReact, you can recreate theme here: http://microreact.org/project/styphi_nigeria To get this colour scheme just click on the eye icon (bottom left) and select ‘country’; and to get the fan style tree, click the settings button (top right) and click the fan shape.

Another example is in our recent paper on isolates collected in Thailand before and after the introduction of their national vaccination program (pre-print here):

  • Genotype 3.2.1 was the most common (n=14, 32%), followed by genotype 2.1.7 (n=10, 23%)
  • Genotypes 2.0 (n=1, 2%) and 4.1 (n=3, 7%) were observed only in 1973 (pre-vaccine period)
  • Genotypes 2.1.7 (n=10, 23%), 2.3.4 (n=1, 2%), 3.4.0 (n=2, 5%), 3.0.0 (n=3, 7%), 3.1.2 (n=2, 5%), were observed only after 1981 (post-vaccine period)
  • Genotypes 3.2.1 and 2.4.0 were observed amongst both pre- and post-vaccine isolates, but the subclade phylogenies show that these more likely to represent re-introduction of strains from neighbouring countries than persistence within Thailand throughout the immunisation program.

Locating drug resistance regions in short reads, using Bandage and ISMapper

In our recent paper on Salmonella Typhi, we described the multidrug resistant (MDR) clone H58 that is sweeping the world.

We’ve been monitoring this clone for almost 10 years (and the data suggests it actually emerged in the early 1990s), but the big change noted recently is the movement of the drug resistance locus out of the large conjugative (IncHI1) plasmid and into the chromosome.

I think this is critically important, because it means the large plasmid that carried the multidrug resistance genes into the cell can be lost, relieving the bacteria of the burden associated with replicating and expressing the plasmid genome, without losing multidrug resistance.

However, figuring out the location of the MDR locus is tricky, because it is flanked by copies of the IS1 transposase. This is what it looks like (the red genes on the ends are the IS1 copies).

Multidrug resistance locus in Salmonella Typhi

Multidrug resistance locus in Salmonella Typhi

Repeated sequences like these flanking IS1 sequences complicate assemblies… because there are multiple possible paths in and out of the repeated sequence, most assemblers will place the repeated sequence in its own contig, and the various paths in and out of it in separate contigs.

Here’s an example of an assembly graph showing the MDR locus above, visualised using Bandage, developed by Ryan Wick in my group… The green bit is a BLAST hit to the whole IS1 transposase, and you can see there are multiple alternative paths through the IS:

bandage IS1

And here is the same graph, but this time colouring in all the genes encoded within the MDR locus (BLAST hits to each open reading frame are shown in a different colour; the IS1 has 2 ORFs):

Screen Shot 2015-07-20 at 3.33.59 pm

So, inferring the location of the MDR locus from short read data is important but tricky. What do do?

Well, turns out it’s pretty easy to figure out what’s happening using Bandage! In this example, we can’t tell where the IS1 (and MDR locus) is inserted by looking at the assembled contigs. But by looking at this graph, we can see there are only two paths out of the IS1; one leads into the MDR locus (and back to the IS1), and the other leads to a single contig. By exporting that contig sequence from Bandage and blasting it in NCBI, it’s pretty trivial to discover that this is a piece of Typhi chromosome!

Here’s an example assembly graph from a different strain:
Screen Shot 2015-07-20 at 3.45.54 pmWhen I use Bandage to blast for the IncHI1 plasmid sequence (hits in blue) as well as the MDR locus genes, it’s pretty easy to see that the MDR locus is located in the plasmid:

Screen Shot 2015-07-20 at 3.47.05 pm

So basically – Bandage can be super useful for locating resistance genes (or any genes of interest really), in the presence of repeat sequences!

In the Typhi paper, we had hundreds of genomes to analyse so of course didn’t manually inspect each graph. Instead we used ISMapper (from my PhD student Jane Hawkey) to determine the IS1 insertion sites in each strain, and coupled this with what we knew about the presence of the plasmid and resistance genes in each strain (using SRST2, also from our group), to figure out which strains had the MDR locus and where:


Of course, all this is easy with long read sequencing… in the paper we used PacBio to confirm some the insertion sites in cyaA and yidA, and the guys at Public Health England independently found the yidA site using Nanopore.

But with the right tools, you can actually figure out a lot of these questions using Illumina data alone.

A global picture of typhoid bacteria

New paper out: (a bit delayed due to travelling the world for science…)

Phylogeographical analysis of the dominant multidrug-resistant H58 clade of Salmonella Typhi identifies inter- and intracontinental transmission events

Nature Genetics 2015 Jun;47(6):632-9. doi: 10.1038/ng.3281

This paper provides a whole-genome snapshot of nearly 2000 genomes of the typhoid bacterium, Salmonella Typhi. The strains involved come from 63 countries contributed by dozens of people around the world, and were sequenced at the Sanger Institute with funding from the Wellcome Trust.

You can get the raw sequence reads under accession ERP001718, and play with the phylogenetic tree and associated map at the new MicroReact website:

Countries included in the study

Countries included in the study

I will post more later about plotting trees & metadata dynamically with MicroReact, and statically with Python and R.

But back to typhoid. This project is special for me for a number of reasons…

  • It is about Typhi, the bug that suckered me into directing my genomics skills into studying pathogens and infectious disease, and was the subject of my PhD project with Gordon Dougan and Julian Parkhill at the Sanger Institute, and Duncan Maskell at Cambridge.
  • It is a natural continuation of my PhD project, with the grunt work done by the new PhD student who took over typhoid genomics work in the Dougan lab when I moved back to Australia (Vanessa Wong, MD PhD), with me helping to direct the analysis from down here in Melbourne.
  • It is a great illustration of how sequencing has changed… The first Typhi genome sequence was done at the Sanger Institute using capillary sequencing, and was published in 2001 (Parkhill et al, Nature). In my PhD project (also at the Sanger Institute), I analysed 19 Typhi genomes sequenced with two sequencing platforms that were new and super-duper back in 2006: 454 (now dead) and Solexa (now known as Illumina – currently ruling the sequencing world globally). This was published in 2008, seven years after the first genome (Holt et al, Nature Genetics). Now, another 7 years later in 2015, we are publishing almost 2000 genomes.
  • Typhi is still one of the best examples I know of how sequencing has transformed bacterial surveillance and opened up a whole new field of genomic epidemiology. Before we could look at whole genomes, every Typhi strain looked pretty much the same genetically… there is so little variation, that lower resolution approaches like MLST just couldn’t tell us anything. Now that we can capture whole genomes relatively easily, we can track the transmission and evolution of these bugs essentially in real time.

So what did all this sequencing achieve? Basically we learnt a lot about a particularly tricky clone, called H58, that has spread quite rapidly across Asia and Africa and is responsible for most cases of multidrug resistant typhoid (infections that don’t respond to treatment with most antibiotics). About half of all our isolates belonged to this clone.

  • By comparing root-to-tip branch lengths in the phylogenetic tree of H58 to the isolation dates of each strain, we found evidence of a temporal signal. So we did BEAST analysis, using the isolation date of each strain to date the tips and model mutation rates and divergence dates for H58. This showed that SNPs accumulated slowly in the Typhi H58 genome, at a rate of ~2 SNPs every 3 years. This placed the emergence of H58 at ~1989, just before our oldest example of H58 (1992). We haven’t been able to do proper dating in Typhi before, probably because most of the samples we’ve looked at previously have been phylogenetically diverse strains that are separated by centuries of evolution including periods of epidemic transmission (higher mutation rate per unit time) and long-term carriage (lower mutation rate per unit time). Here we probably have enough data from a period of epidemic transmission of H58 that the signal from epidemic transmission is detectable. I think this is very similar to Mycobacterium tuberculosis (TB), which notoriously has very little temporal signal, and yet a localised 4-decade transmission chain in Argentina showed very strong temporal signal.pathogen_linear_regression_fullANDh58
  • The geographical distribution of the H58 isolates tell us a lot about the routes by which H58 has travelled the world. The tree of H58 is so big that it’s hard to see what’s happening…. so to make it easier, I used R to collapse localised subclades of H58 that contained isolates from a single country (panel A – the size of the circle reflects the number of isolates in the subclade), and showed the time span for each subclade next to the tree (panel B). Occasionally there were one or two isolates within a localised subclade that were sourced from neighbouring countries, indicating transfer to those countries… these are shown in panel C.


  • We inferred these geographical patterns of the spread of H58, based on the tree and the regions of isolation:


  • We learnt a lot about the evolution of multidrug resistance in Typhi H58. We knew that resistance to all first-line antibiotics was usually encoded in one big transposon, which came into H58 in a IncHI1 plasmid. But the new collection showed that this transposon has transferred into the Typhi H58 chromosome, not once but many times! These transfers have happened in separate events, in different parts of the world, and into different parts of the chromosome. This is what the transposon looks like, and two of the insertion sites relative to the reference chromosome (CT18):

Acquired multidrug resistance in Typhi H58

  • Finding transposon insertion sites is tricky! The transposon has copies of the IS1 transposase at either end, which we think are responsible for moving the whole transposon around. This poses a problem for genome assembly with short reads. One way around this is to sequence with long reads… the figure above shows two different insertion sites that we confirmed by using PacBio sequencing to get complete genomes. But we had >850 H58 genomes sequenced using Illumina which gives us short reads, so really we needed to figure out the insertion sites as best we could using the Illumina data. Luckily my PhD student Jane Hawkey had been working on a method to do this, called ISMapper. Using this approach, we could identify all the IS1 insertion sites in every Illumina-sequenced genome. We also found a couple of additional plasmids. This is where all the different multidrug resistance determinants are in the H58 population:


Finally, a clear message from this study is that we need to do a lot more sequencing of Typhi! While we have a lot of genomes here, there are large geographic areas that we just don’t know much about. Plus, we have seen that antibiotic resistance is evolving and changing fast, and we will need to keep up with this using ongoing genomic surveillance.

Typhoid in Kathmandu and Open Biology OA journal

[Originally posted by Kat on her BacPathGenomics blog, Oct 2011]

A paper I’ve been working on for a few years on typhoid in Kathmandu yesterday had the honour of being the first paper ever published by the new open access journal of the Royal Society, Open Biology. I’m very keen on open access publishing and always try to submit to OA journals, but there is still a limited choice of truly OA journals. I love PLoS and BMC and submit to both regularly, but I think it’s really important to have a diverse range of OA journals – and therefore diversity in editors, editorial policies & styles, subject areas, etc – to make open access work for everyone.

So I’m excited to be have a paper in the new Open Biology, who publish under a Creative Commons 3 license (reuse/modify/distribute with attribution). Only time will tell how well the journal does, but it will only become great if us scientists are willing to submit good manuscripts. One incentive to do this is that Open Biology aims for a quick turn-around time of 4 weeks from submission to decision. Much as I love PLoS and BMC, they’ve never managed anywhere near that sort of turn-around. For info on Open Biology, see their ‘About’ page http://rsob.royalsocietypublishing.org/site/misc/about.xhtml.

So what is the paper? Thanks to OA, I can reproduce it here… (or you can read online or PDF)

Combined high-resolution genotyping and geospatial analysis reveals modes of endemic urban typhoid fever transmission

Stephen Baker1,2,*,†, Kathryn E. Holt3,4,†, Archie C. A. Clements5, Abhilasha Karkey2, Amit Arjyal2, Maciej F. Boni1,6, Sabina Dongol2, Naomi Hammond4, Samir Koirala2, Pham Thanh Duy1, Tran Vu Thieu Nga1, James I. Campbell1, Christiane Dolecek1,2, Buddha Basnyat2, Gordon Dougan4 and Jeremy J. Farrar1,2

Open Biol October 2011 1:110008; doi:10.1098/rsob.110008.

Basically, it uses genotyping and GPS to study typhoid fever in Kathmandu, Nepal. We examined 4-years worth of typhoid cases and looked at where the patients lived within the city (using GPS) and subtyped the bacteria responsible for their infections using high throughput SNP typing.

Firstly, we found that about 3/4 of the patients were infected with Salmonella Typhi and 1/4 were infected with Salmonella Paratyphi A. If you aren’t familiar with Salmonella, these are two serotypes of Salmonella enterica which, rather than causing gastrointestinal disease (ie food poisoning) like most Salmonella serotypes, cause the systemic infection known as typhoid. Typhi and Paratyphi A are quite different genetically, but have undergone convergent evolution to cause the same disease syndrome (see earlier paper in BMC Genomics).

Temporal distribution of Typhi (red) and Paratyphi A (blue) cases

Then we looked at the spatial distribution of the patients homes, and found that they were clustered in specific “hotspot” areas of Kathmandu:

Spatial risk model for Typhi infection (see paper for separate map for Paratyphi A risk)

Contrary to expectation, these hotspots weren’t the most densely populated areas…you might expect more people = more cases, but this wasn’t the case. Some complicated spatial statistics, done by Archie Clements at University of Queensland, confirmed that the hostpots weren’t associated with population density or hospital referral patterns, but were in low-elevation areas local people source their water from stone waterspouts.

Spatial distribution of Typhi cases, and location of water spouts

To see if the waterspouts could really be a source of typhoid transmission, we tested water samples for the presence of Typhi or Paratyphi A using culture and PCR. Culturing didn’t work, but it is notoriously difficult to culture Typhi from water samples that are not pre-enriched for bacteria…however PCR (using this method we published earlier in BMC Infectious Diseases) detected Typhi in 3/4 of water samples and Paratyphi A in 2/3.

Stone water spouts in Kathmandu (taken by co-author Stephen Baker)

We also looked at the population of bacteria causing the typhoid fever. We examined Salmonella Typhi isolated from the blood of typhoid fever patients, and used SNP typing to analyse the Typhi DNA and examine the population structure. We typed 113 SNPs (single nucleotide polymorphisms, ie point mutations) that we already knew about from previous variation discovery efforts. About 2/3 of isolates had the same haplotype, so to discriminate further within this local subgroup we sequenced 40 of the Typhi to identify novel SNPs arising in the local population (local microevolution) and typed these SNPs as well. Most of the Typhi belonged to the H58 lineage, which is common in other typhoid endemic zones we’ve looked at previously (Mekong Delta Vietnam – Holt & Dolecek 2011, PLoS NTD [OA]; Nairobi, Kenya – Kariuki 2010, J Clin Micro [free]; globally – Holt & Phan 2011, PLoS NTD [OA], Roumagnac 2006, Science [free in PMC]).

Typhi tree, red bars indicate frequency of genotypes in Kathmandu collection; red zones are H58 lineage and H58G sublineage

As the map above shows, the different Typhi genotypes were distributed randomly, with no spatial or temporal clustering. The exception was a probable outbreak in the west of the city, outside the hotspot zone, where 28 cases of infection with the same Typhi genotype were recorded in a two-month period – see yellow shaded area in map above, and zoomed in below:

Localised outbreak of Typhi genotype H58G-b4

Finally, we looked at what was happening in households from which multiple typhoid infections were studied. You might expect these household disease clusters to represent shared infections, which are transmitted between members of the household. However in most of these household typhoid clusters, the cases were caused by different organisms – either Paratyphi in one case and Typhi in the other, or multiple cases caused by different genotypes of Typhi. Cases with the same causative Typhi genotype are linked with dashed lines in this figure, you can see they are the exception rather than the rule:

Distribution of typhoid-causing bacteria in households with multiple typhoid cases

So, all in all we found typhoid fever infections clustered in spatial hotspots within Kathmandu, and that this clustering was explained not by population density but by low elevation and proximity to stone water spouts which are used to supply water. This implicates the water spouts in typhoid transmission via dissemination of Typhi and Paratyphi A around the city, supported by the detection of Typhi and Paratyphi A in the majority of water samples taken from these spouts. The diversity of Typhi genotypes we detected indicates that transmission occurs via water that is contaminated with a diverse population of Typhi, rather than point source outbreaks (with the exception of one outbreak, which actually occurred outside the hotspot zone). The diversity of Typhi genotypes within households suggests that this sort of transmission – ie dissemination via contamination of the water supply – contributes more to the overall typhoid burden than direct person-to-person transmission.

How does this contamination happen? Well, it is possible for people to carry Typhi and Paratyphi A in the gall bladder, without ever noticing an infection…for example, Typhoid Mary was a famous carrier of Typhi. Carriers shed the bacteria in their feces, so any food or water contaminated with their fecal material becomes a vehicle for typhoid transmission. Our study suggests that there are many Typhi and Paratyphi A carriers in Kathmandu who are unknowingly shedding the bacteria, so that whenever sewage seeps into the groundwater that feeds the stone water spout, the water becomes contaminated and can pass on the infection to those who drink the water. Most of the typhoid cases occur in the monsoon season, when flooding is likely to promote seepage of sewage into the underground aquifers that supply the water spouts. Hence the study suggests that endemic typhoid in Kathmandu is essentially a question of water infrastructure, and could potentially be dramatically reduced by supplying clean drinking water to people living in these few hotspot areas.