genomic epi

Global genomic framework for typhoid

It’s been over a year since we published the first global whole-genome snapshot of nearly 2000 genomes of the typhoid bacterium, Salmonella Typhi in Nature Genetics.

That paper focused on the emergence and global dissemination of what we’ve been calling for years the “H58” clone (see this blog post). This clone accounted for nearly half of all the isolates sequenced, and is a big deal because it tends to be multidrug resistant (MDR), carrying a suite of resistance genes that render all the cheap, first-line drugs like chloramphenicol, ampicillin, and trimethoprim-sufamethoxazole useless for treatment. Detailed genomic epi studies show the local impact of the arrival of MDR H58 in countries as widespread as Malawi and Cambodia; and the emergence of fluoroquinolone resistant H58 sublineage in India and Nepal recently stopped a treatment trial because the current standard of care – ciprofloxacin – was resulting in frequent treatment failure.

While H58 is important, the global Typhi population contains a lot of genomic diversity outside the H58 clone, and we’ve turned our attention to the rest of the population now in a new paper in Nature Communications: “An extended genotyping framework for Salmonella enterica serovar Typhi, the cause of human typhoid

First, we decided that we needed to revisit the haplotyping scheme of Roumagnac et al (from which H58 gets its name), which was based on just ~80 genes, using the whole genome phylogeny. Here is the tree inferred from core genome SNPs in 1832 Typhi strains, with the old haplotypes indicated by the coloured ring around the outside. It’s pretty easy to see that some haplotypes (like H52 and H1) actually comprise multiple distinct phylogenetic lineages (low resolution), while others subdivide lineages (excessive resolution).


Whole genome SNP tree for 1832 strains, outer ring indicates haplotypes based on mutations in 80 genes as defined in Roumagnac et al, Science 2006.

We used BAPS to define genetic clusters at various levels (thanks to Tom Connor for running this). We settled on 3 levels of hierarchical clustering, indicated in the tree below:

• 4 nested primary clusters (inner-most ring; yellow, green, blue, red). These have 100% bootstrap support and are each characterised by >20 SNPs

• Clusters are further divided into 16 clades (middle ring and labels). The median pairwise distance between isolates in the same clade is 109 SNPs, while the inter-clade SNP distance averages 243 SNPs.

• Clades are further divided into 49 subclades, indicated by alternating background shading colours. The median pairwise distance between isolates in the same subclade is 25 SNPs.


Tree indicating new phylo-informed genotypes. Primary clusters 1-4 are indicated in the inner ring. Branch colours indicate clades, which are also labelled on the outside and coloured in the outer ring. Subclades are indicated by alternating background shading.

subcladerectOne of the key reasons we wanted to define the phylogenetic lineages in this way is to make them easier to identify and talk about. I’ve always been a fan of MLST for this reason, since it’s much easier to talk about K. pneumoniae ST258, ST11, ST15 etc than ‘that lineage that has reference strain X in it’. So we introduce a hierarchical nomenclature system, similar to the one currently in use for Mycobacterium tuberculosis, where the 4 primary clusters (1, 2, 3, 4) are subdivided into 16 clades (1.1, 1.2; 2.1, 2.2, etc) which in turn are subdivided into 49 subclades (1.1.1, 1.1.2, etc). This has the advantage of conveying hierarchical relationships between groups – e.g. 2.2.1 and 2.2.2 are sister subclades within clade 2.2, which is a sister clade of 2.1.

The subclades are easier to distinguish in the collapsed rectangular tree on the right, where each subclade is represented by just one strain.

Some BAPS clusters were polyphyletic and consisted of isolates belonging to rare phylogenetic lineages whose common ancestor in the tree coincided with the common ancestor of an entire clade (n=9) or primary cluster (n=2). These groups contain isolates that, given increased numbers, may emerge as distinct clusters that form sister taxa within the parent clade (or primary cluster), and were given the suffix ‘.0’ rather than a defined cluster number (e.g. 3.0 or 3.1.0) to indicate non-equivalence with the properly differentiated sister clades (n=16) or subclades (n=49). As more genomes are added, these are expected to be more clearly differented into distinct groups and given proper clade/subclade designations.

Next we defined a set of 68 SNPs that can be used to genotype isolates into these groups. We chose one SNP for each primary cluster, clade and subclade (preferentially choosing intragenic SNPs in well-conserved core genes). The SNPs are detailed in a supplementary spreadsheet, and we provide a script to assign strains to genotypes based on an input BAM or VCF file generated by mapping to the reference genome for Typhi strain CT18.

An isolate that belongs to a differentiated subclade such as 2.1.4 will be hierarchically identified by carrying the derived allele for primary cluster 2 (but not the nested clusters 3 and 4); the derived allele for clade 2.1 (but no other clades) and the derived allele for subclade 2.1.4 (but no other subclades). It is possible for an isolate to carry derived alleles for a primary cluster and clade with no further differentiation into subclade.

The clone formally known as H58
Under the new scheme, the infamous H58 clone is named subclade 4.3.1, which so far has no sister clades. I suspect those of us familiar with Typhi population genomics will keep referring to it informally as H58 for some time, since that name is now well known… but I will try to re-train myself to call it 4.3.1 (H58).

Now the fun part: exploring the geographical distribution of these lineages.


Figure 1c from the paper. Pie colours indicate clades found in each WHO region in the global data set (key is in the tree figure above).

In the paper we go on to show that:
• clades are widely geographically distributed, while subclades are geographically constrained (see heatmap below);
• genotyping can be used to predict the geographical origin of travel-associated typhoid in patients in London;
• even better predictions can be obtained based on genome-wide SNP distances to our reference panel of >1800 isolates… but of course that involves a lot more computationally intensive comparisons than a quick screen of a new isolate’s BAM file.

Screen Shot 2016-08-29 at 11.47.15 pm.png

Figure 2 of the paper, showing the geographical distribution of subclades, which shows most subclades are restricted to a single region. For this analysis, the effect of local outbreaks has been minimised by replacing groups of strains that share the same subclade and year and country of isolation with a single representative strain.

You can read the full details in the paper, but here I just want to highlight that you can now explore the global genomic framework for Typhi – including genotype designations as well as temporal and geographic data – interactively in MicroReact.



How does genotyping help with studying local populations?

We have already begun using the new genotyping scheme in local typhoid studies. I find this a really helpful way to describe/summarise the local populations, and place them in the context of the global population without resorting to large trees.

For example in this recent Nigerian study, we described the population like this: “The majority of isolates (84/128, 66%) belonged to genotype 3.1.1 , which is relatively common across Africa, predominantly western and central countries. In the wider African collection genotype 3.1.1 was represented by isolates from neighbouring Cameroon and across West Africa (Benin, Togo, Ivory Coast, Burkina Faso, Mali, Guinea and Mauritania) suggesting long-term inter-country exchange within the region. Most of the remaining isolates belonged to four other genotypes (4.1, 2.2, 2.3.1 and 0.0.3).”

Of course genotype assignment is not the end of the story – we still want to build whole-genome trees to explore the relationships of local isolates with those from other countries. Importantly, working with genotypes means that we can achieve this without needing to build a megatree of all isolates in the local + global collections (n>2000). Instead, we can use the genotypes to identify which strains from the global collection are relatives of the Nigerian isolates, and build a much smaller tree that still captures all of the information about transmission/transfer between Nigeria and other countries:


The tree and map were made using MicroReact, you can recreate theme here: To get this colour scheme just click on the eye icon (bottom left) and select ‘country’; and to get the fan style tree, click the settings button (top right) and click the fan shape.

Another example is in our recent paper on isolates collected in Thailand before and after the introduction of their national vaccination program (pre-print here):

  • Genotype 3.2.1 was the most common (n=14, 32%), followed by genotype 2.1.7 (n=10, 23%)
  • Genotypes 2.0 (n=1, 2%) and 4.1 (n=3, 7%) were observed only in 1973 (pre-vaccine period)
  • Genotypes 2.1.7 (n=10, 23%), 2.3.4 (n=1, 2%), 3.4.0 (n=2, 5%), 3.0.0 (n=3, 7%), 3.1.2 (n=2, 5%), were observed only after 1981 (post-vaccine period)
  • Genotypes 3.2.1 and 2.4.0 were observed amongst both pre- and post-vaccine isolates, but the subclade phylogenies show that these more likely to represent re-introduction of strains from neighbouring countries than persistence within Thailand throughout the immunisation program.

Genomics of atypical enteropathogenic E. coli

Well our paper on “atypical” enteropathogenic E. coli is finally out, in the new journal Nature Microbiology. The hard work on this paper was done by Danielle Ingle, whom I co-supervise together with Roy Robins-Browne and will soon become the first of my students to complete a PhD. Unfortunately the paper is paywalled at the moment, so I’ll try to summarise the key elements here.

For those who aren’t familiar with the wacky world of E. coli pathotypes, the LEE genomic island is the darling of E. coli pathogenicity genetics. It encodes a type three secretion system, essentially a syringe structure that enables the bacteria to inject its own proteins (known as secreted effector proteins) directly into human cells and – ahem – mess them up (for more technical details see the impressive body of work from Melbourne University’s very own Jacqueline Pearson and Liz Hartland). E. coli that possess shiga toxin as well as the LEE cause nasty bloody diarrhoea and are defined as enterohaemorrhagic E. coli (EHEC), while those with LEE + the bfp adhesion factor cause diarrhoea and are defined as enteropathogenic E. coli (EPEC). Strains carrying the LEE but lacking both shiga toxin and bfp can also cause diarrhoea, but are also sometimes isolated from healthy individuals with no symptoms of diarrhoaea, and it is this big group of LEE + ??? strains that are defined as atypical EPEC (aEPEC), while those with bfp are dubbed typical EPEC.

No one has been able to find a common virulence factor that differentiates diarrhoeagenic aEPEC from asymptomatic strains… which is not that surprising considering that aEPEC is really just an umbrella term for a diverse group of organisms that happen to carry the LEE genomic island.

In this paper, we set out to define the population structure of aEPEC strains using genomics, and also to investigate variation in the LEE island itself. Our strain set is ~200 newly sequenced aEPEC that were isolated from children with diarrhoea, and from asymptomatic age-matched controls, as part of the Global Enteric Multicenter Study (GEMS) – a massive study into the etiology of childhood diarrhoea across seven sites in Africa and Asia, funded by the Gates Foundation. We also included lots of publicly available genomes from NCBI, which included ~60 additional aEPEC.

So what did we find? Well from a population structure point of view, we confirmed what we suspected from the beginning – that the ~200 aEPEC strains actually represent dozens of distinct lineages or clonal groups within the E. coli whole genome phylogeny. We tried making the core genome tree in a few different ways, including mapping reads to a reference genome vs using Roary to generate core gene alignments from assemblies; with and without removing recombinant regions identified using ClonalFrameML. The alternative trees all tell the same population structure story, as you can see below. An interactive version of the mapping-based, recombination-filtered tree (which appears in Figure 1 of the paper, panel a below) is available to play with in MicroReact.


Supplementary Figure 1

The 10 clades highlighted in the tree are those containing >5 aEPEC in our collection, which represent the most common aEPEC lineages. The figures below show that these 10 aEPEC groups are present across the Asian and African GEMS sites; most also appear in non-GEMS collections from Europe as well as North and South America, indicating they are globally distributed.


Next we looked in depth at the LEE genomic island itself. ClonalFrame analysis identified lots of recombination across the locus, particularly affecting the proteins that make direct contact with host cells:

Screen Shot 2016-01-19 at 9.46.45 am

Danielle also did a lot of work to examine selection pressures on the various LEE genes, and found evidence for co-evolutionary constraints on groups of LEE genes (see the paper for details).

We used the recombination-free ClonalFrame tree, which represents the underlying clonal or vertical evolutionary relationships between LEE sequences, to define LEE lineages and subtypes. We also made a MLST-style scheme for the LEE, to make it easy to identify types from sequence data without having to compare to reference trees. The sequence files are included in the SRST2 distribution and can be downloaded from

Screen Shot 2016-01-19 at 9.50.41 am

There were three clearly distinct LEE lineages, separated from one other by 4-7% divergence (similar to genus-level differences between bacteria) and with different preferences for chromosomal insertion site. One of the LEE lineages (lineage 1) was entirely novel and we found it only in aEPEC isolates from the GEMS study. Functionality of the LEE was confirmed in all GEMS strains, including these novel lineage 1 examples.

Importantly, different aEPEC lineages had distinct LEE variants, often integrated at different chromosomal sites, confirming that aEPEC lineages have evolved numerous times independently via distinct LEE acquisition events. Also, some aEPEC lineages contained strains carrying different LEE lineages and/or subtypes, sometimes integrated at different sites, indicating ongoing evolution within some aEPEC clades.


Unsurprisingly, some of the “aEPEC” lineages also include examples of EHEC and/or typical EPEC isolates, i.e. subclades within the LEE-containing lineage that have subsequently acquired shiga toxin (via phage) or a bfp plasmid. We also found extensive variation in the complement of known LEE-secreted effectors (which are scattered around the genome and known as non-LEE encoded effectors or nle genes), both within and between lineages.

There’s lots more work to do to unravel the genetic basis for pathogenicity in E. coli, including identifying more of the colonisation factors, regulators and effectors that are important for causing disease in humans. But I think this analysis is a great step towards clarifying the population structure of EPEC and the LEE more generally, which opens the door for much more informative diagnostics and epidemiological studies. This work should also help immensely in guiding future pathogenicity research… most research on EPEC or EHEC pathogenicity has so far focused on a handful of strains, but our genomic analysis shows there is a much wider diversity of strain backgrounds, LEE variants and effector gene content that also needs to be considered.

Microbial Genomics methods

MGen (Microbial Genomics) was created last year by the UK’s Microbiology Society, with the aim of becoming the go-to journal for microbial genomics research. As a Senior Editor, I was asked to help mark the occasion of MGen’s six month anniversary (on Jan 15, 2016), by reviewing the 24 papers published in the journal’s first six months since launch.
The full review is is available over on the MGen site, but I wanted to draw particular attention here to trends in the analysis tools used by the genomic epidemiology papers published so far in MGen, which largely reflects what’s been happening across the field generally.

tools word cloud

One third of the articles are cutting edge genomic epidemiologyin action, using genomics to investigate the evolution and transmission of a range of pathogens, from anthrax to dysentery and food poisoning.

Interestingly all of these studies used similar methodologies, reflecting the maturity of this field of research.

Common approaches include: (i) sequencing large numbers of isolates using high-throughput Illumina platforms; (ii) the identification of SNPs (single nucleotide polymorphisms) using read mapping approaches (with BWA, SMALT, SAMtools and GATK being popular tools); and (iii) uniform use of RAxML for generating maximum likelihood phylogenies.

Some also used BEAST to estimate mutation rates, divergence dates and phylogeographical patterns. Interestingly, half of these papers utilised MLST (multi-locus sequence typing) to identify clades, showing that this sub-genomic approach based on capillary sequencing of ~7 gene fragments is still considered useful by many genomic researchers. [NOTE: nice to see that most who were inferring MLST from Illumina data were using our SRST or SRST2 software.]

Almost all of the genomic epidemiology studies took steps to remove SNPs introduced via recombination, in order to capture the underlying signals of vertical inheritance that are so important for transmission studies. Popular tools were BratNextGen, Gubbins and ClonalFrameML, which were all published within the last 3 years.

For pan genome analysis, Velvet and SPAdes were the most popular tools for bacterial genome assembly, with Prokka and Prodigal for gene annotation, and LS-BSR and related approaches being commonly used to cluster orthologous groups of genes.


Most of this won’t be any surprise to people working in bacterial genomic epi, but I think it’s great to see that consensus is emerging on how best to do these sorts of analyses, and to at last have some reliable tools for detecting and accounting for recombination.

The area of least agreement remains SNP calling – which mainly comes down to which read mappers to use, and which SNP calling algorithms and filtering to go with? This is a complex area, as highlighted in the recent review “Best practices for evaluating single nucleotide variant calling methods for microbial genomics” in Frontiers in Genetics, which did a very thorough job of examining the issues that need to be considered, but (quite deliberately) doesn’t provide an answer to “which tool is best?”.

Although there is still no real consensus on exact methods for SNP calling, I think most of the tools people are using (ie a good, stable read mapper followed by SNP calling with an established tool like SAMtools or GATK, with some basic filtering to remove low-evidence or ambiguous calls) end up with very similar answers (as we saw with the NGS outbreak analysis challenge session held at the ASM NGS meeting in September 2015). All in all it seems to me that the use of genomics for public health & diagnostic microbiology is in far better shape in this respect than clinical human genomics, which is going through something of a crisis involving wide discrepancies in variant calling as well as uncertainty around data interpretation.

A global picture of typhoid bacteria

New paper out: (a bit delayed due to travelling the world for science…)

Phylogeographical analysis of the dominant multidrug-resistant H58 clade of Salmonella Typhi identifies inter- and intracontinental transmission events

Nature Genetics 2015 Jun;47(6):632-9. doi: 10.1038/ng.3281

This paper provides a whole-genome snapshot of nearly 2000 genomes of the typhoid bacterium, Salmonella Typhi. The strains involved come from 63 countries contributed by dozens of people around the world, and were sequenced at the Sanger Institute with funding from the Wellcome Trust.

You can get the raw sequence reads under accession ERP001718, and play with the phylogenetic tree and associated map at the new MicroReact website:

Countries included in the study

Countries included in the study

I will post more later about plotting trees & metadata dynamically with MicroReact, and statically with Python and R.

But back to typhoid. This project is special for me for a number of reasons…

  • It is about Typhi, the bug that suckered me into directing my genomics skills into studying pathogens and infectious disease, and was the subject of my PhD project with Gordon Dougan and Julian Parkhill at the Sanger Institute, and Duncan Maskell at Cambridge.
  • It is a natural continuation of my PhD project, with the grunt work done by the new PhD student who took over typhoid genomics work in the Dougan lab when I moved back to Australia (Vanessa Wong, MD PhD), with me helping to direct the analysis from down here in Melbourne.
  • It is a great illustration of how sequencing has changed… The first Typhi genome sequence was done at the Sanger Institute using capillary sequencing, and was published in 2001 (Parkhill et al, Nature). In my PhD project (also at the Sanger Institute), I analysed 19 Typhi genomes sequenced with two sequencing platforms that were new and super-duper back in 2006: 454 (now dead) and Solexa (now known as Illumina – currently ruling the sequencing world globally). This was published in 2008, seven years after the first genome (Holt et al, Nature Genetics). Now, another 7 years later in 2015, we are publishing almost 2000 genomes.
  • Typhi is still one of the best examples I know of how sequencing has transformed bacterial surveillance and opened up a whole new field of genomic epidemiology. Before we could look at whole genomes, every Typhi strain looked pretty much the same genetically… there is so little variation, that lower resolution approaches like MLST just couldn’t tell us anything. Now that we can capture whole genomes relatively easily, we can track the transmission and evolution of these bugs essentially in real time.

So what did all this sequencing achieve? Basically we learnt a lot about a particularly tricky clone, called H58, that has spread quite rapidly across Asia and Africa and is responsible for most cases of multidrug resistant typhoid (infections that don’t respond to treatment with most antibiotics). About half of all our isolates belonged to this clone.

  • By comparing root-to-tip branch lengths in the phylogenetic tree of H58 to the isolation dates of each strain, we found evidence of a temporal signal. So we did BEAST analysis, using the isolation date of each strain to date the tips and model mutation rates and divergence dates for H58. This showed that SNPs accumulated slowly in the Typhi H58 genome, at a rate of ~2 SNPs every 3 years. This placed the emergence of H58 at ~1989, just before our oldest example of H58 (1992). We haven’t been able to do proper dating in Typhi before, probably because most of the samples we’ve looked at previously have been phylogenetically diverse strains that are separated by centuries of evolution including periods of epidemic transmission (higher mutation rate per unit time) and long-term carriage (lower mutation rate per unit time). Here we probably have enough data from a period of epidemic transmission of H58 that the signal from epidemic transmission is detectable. I think this is very similar to Mycobacterium tuberculosis (TB), which notoriously has very little temporal signal, and yet a localised 4-decade transmission chain in Argentina showed very strong temporal signal.pathogen_linear_regression_fullANDh58
  • The geographical distribution of the H58 isolates tell us a lot about the routes by which H58 has travelled the world. The tree of H58 is so big that it’s hard to see what’s happening…. so to make it easier, I used R to collapse localised subclades of H58 that contained isolates from a single country (panel A – the size of the circle reflects the number of isolates in the subclade), and showed the time span for each subclade next to the tree (panel B). Occasionally there were one or two isolates within a localised subclade that were sourced from neighbouring countries, indicating transfer to those countries… these are shown in panel C.


  • We inferred these geographical patterns of the spread of H58, based on the tree and the regions of isolation:


  • We learnt a lot about the evolution of multidrug resistance in Typhi H58. We knew that resistance to all first-line antibiotics was usually encoded in one big transposon, which came into H58 in a IncHI1 plasmid. But the new collection showed that this transposon has transferred into the Typhi H58 chromosome, not once but many times! These transfers have happened in separate events, in different parts of the world, and into different parts of the chromosome. This is what the transposon looks like, and two of the insertion sites relative to the reference chromosome (CT18):

Acquired multidrug resistance in Typhi H58

  • Finding transposon insertion sites is tricky! The transposon has copies of the IS1 transposase at either end, which we think are responsible for moving the whole transposon around. This poses a problem for genome assembly with short reads. One way around this is to sequence with long reads… the figure above shows two different insertion sites that we confirmed by using PacBio sequencing to get complete genomes. But we had >850 H58 genomes sequenced using Illumina which gives us short reads, so really we needed to figure out the insertion sites as best we could using the Illumina data. Luckily my PhD student Jane Hawkey had been working on a method to do this, called ISMapper. Using this approach, we could identify all the IS1 insertion sites in every Illumina-sequenced genome. We also found a couple of additional plasmids. This is where all the different multidrug resistance determinants are in the H58 population:


Finally, a clear message from this study is that we need to do a lot more sequencing of Typhi! While we have a lot of genomes here, there are large geographic areas that we just don’t know much about. Plus, we have seen that antibiotic resistance is evolving and changing fast, and we will need to keep up with this using ongoing genomic surveillance.

Global and local views of Shigella sonnei population genomics

If you have seen me give a talk in the last couple of years, chances are you would have heard a bit about Shigella sonnei. This is because it has been my favourite project in recent years, for two main reasons:

(1) it involved looking in-depth at phylogeography and evolution of the same organism at two different scales – first globally, over hundreds of years and then locally in Vietnam, over about 15 years; and

(2) it was done with two people I really enjoy working with – Steve Baker (based at the Oxford University Clinical Research Unit in Vietnam) and Nick Thomson (based at the Sanger Institute).

Here are the papers:

Shigella sonnei genome sequencing and phylogenetic analysis indicate recent global dissemination from Europe

Holt KE, et al. Nature Genetics 2012 [PubMed]

This study used whole genome sequencing of a global collection of 132 Shigella sonnei, an increasingly important cause of dysentery, to reconstruct the evolutionary history of the bacterium. Phylogenetic analysis showed that the current S. sonnei population descends from a common ancestor that existed less than 500 years ago and that diversified into several distinct lineages with unique characteristics. Furthermore the analysis suggests that the majority of this diversification occurred in Europe and was followed by more recent establishment of local pathogen populations on other continents, predominantly due to the pandemic spread of a single, rapidly evolving, multidrug-resistant lineage.

Commentaries on the paper are available in Nature Genetics and Nature Reviews Gastroenterology and Hepatology.

Dissemination of S. sonnei lineages out of Europe. Reprinted by permission from Macmillan Publishers Ltd: Nature Genetics 44:1056, copyright 2012.

Dissemination of S. sonnei lineages out of Europe. Reprinted by permission from Macmillan Publishers Ltd: Nature Genetics 44:1056, copyright 2012.

Tracking the establishment of local endemic populations of an emergent enteric pathogen

Holt KE, et al. PNAS 2013 [PubMed]

This study continues the Shigella sonnei story by examining the arrival of the rapidly evolving multidrug-resistant lineage in one particular country – Vietnam. We sequenced over 250 genomes of S. sonneiisolated over a 15-year period, and found that the multidrug-resistant lineage successfully established itself in Ho Chi Minh City, pushing out other dysentery-causing bacteria to become the dominant cause of dysentery.

This was likely helped by the acquisition of a colicin (toxin) system that enabled it to kill competing bacteria it came into contact with (including otherShigella), forming a new clone we called the VN (Viet Nam) clone. The VN clone spread to other cities in Vietnam, and we found evidence of convergent evolution of drug resistance mutations and plasmids in all three local populations we examined.

Phylogeny of Vietnamese S. sonnei and map of Vietnam, showing the inferred path of evolution and geographical spread.

Phylogeny of Vietnamese S. sonnei and map of Vietnam, showing the inferred path of evolution and geographical spread.