Population genomics of Klebsiella

Well, after almost 6 years, our Klebsiella pneumoniae genomics paper is finally out!

It’s a beast of a thing and there are still a million and one questions to address just from this one data set. For those interested in looking at the data for themselves, the raw reads are available under accession ERP000165, the assemblies are in Sylvain Brisse’s Klebsiella pneumoniae BIGSdb at the Pasteur Institute, and the tree + metadata are available for your interactive viewing pleasure in MicroReact.

The paper itself is open access in PNAS, you can read it here.

Genomic analysis of diversity, population structure, virulence, and antimicrobial resistance in Klebsiella pneumoniae, an urgent threat to public health

Screen Shot 2015-06-18 at 4.10.42 pm

Whole genome diversity in K. pneumoniae

There have been lots of really nice Klebs genomics papers out in the last 18 months or so, examining the evolution of the ST258 clone that carries the KPC gene (K. pneumoniae carbapaenemase) and is wreaking havoc in hospitals all over the place (including recently in Melbourne), and also several hospital-based studies tracking transmission and evolution of local drug-resistant outbreaks.

But that is just the tip of the K. pneumoniae iceberg.

Our paper asks a completely different set of questions, which you could basically sum up as “what the hell is Klebsiella pneumoniae anyway?”

To do this, we sequenced ~300 genomes of really diverse K. pneumoniae strains. We didn’t have much information about genetic diversity to go on, so we chose strains with different phenotypes (antimicrobial resistance patterns, capsular serotypes or sequence types where known), from different sources (human and animal, asymptomatic carriage and infections of various kinds), and from different geographical locations.

This was done by an international group of collaborators who pooled their resources, not only sharing their precious strain collections but also digging through hospital and other records to find as much information about the strains as possible.

You can view the tree and associated metadata, including geographical origin and source information, over on Microreact.Screen Shot 2015-06-18 at 4.09.33 pm

We found out some pretty interesting things about Klebsiella pneumoniae, including the fact that what’s identified as K. pneumoniae using standard tests is actually a mixed bag of three related species, that now have their own names: K. pneumoniae (KpI group, which includes the majority of clinical isolates and all the stuff you might have heard of like the clone that causes rhinoscleromatis, and the KPC clone ST258, and the hypervirulent clone ST23); K. quasipneumoniae; and K. variicola (plant associated and usually nitrogen-fixing).

By now, this species stuff has been nutted out (mainly by co-author Sylvain Brisse from Institut Pasteur) by analysing marker gene sequences, but it’s really important to be able to show that those patterns hold at the whole-genome level, and we found some interesting things about the distribution of the rarer species (see the paper for details).


Importantly, we did the whole pan-genome analysis thing and found that as a population, K. pneumoniae has more genes than humans. Almost 30,000 in fact. Each individual strain has ~5,500 genes, but <2,000 of those are core genes that are common to all K. pneumoniae. The rest are accessory genes that can come and go, helping the bug to adapt to new environments.


One of the cool things we were able to do with our data set, which you just can’t do with genomic studies focused on specific clones or outbreaks, was to look at statistical associations between accessory genes and phenotypes. Admittedly our available phenotypes were pretty limited, but we found a few important things.


We screened for genes associated with virulence in humans by focusing in on invasive infections, and comparing gene frequencies in human isolates from invasive community-acquired infections (i.e. the kind of infections that land you in hospital) vs. those in human carriage isolates or hospital acquired infections (i.e. the kind of infections that get you when you are already in hospital for something else and are particularly vulnerable to infection).

The only genes that were significantly associated with invasive infection in humans were rmpA and rmpA2, which upregulate capsule production, and genes related to iron acquisition (specifically acquired siderophore systems that can help to steal iron from animal hosts – see paper for details). These genes have been known about for some time, based on mouse models and knowledge of other pathogens, however we were able to show that these genes are significantly associated with invasive K. pneumoniae disease in humans, which is not something that can be proven directly using experimental systems. (The siderophore story actually goes a bit deeper than the iron issue… it’s a bit too complex to go into here but I recommend reading Michael Bachman’s work e.g. “Interaction of lipocalin 2, transferrin, and siderophores determines the replicative niche of Klebsiella pneumoniae during pneumonia” in MBio, 2012).


Interestingly, doing the same test in bovine isolates showed that the story is very different: we had a lot of isolates from dairy herds, including clinical and subclinical mastitis; asymptomatic carriage isolates and strains from the farm environment… and found that an acquired lactose operon was almost perfectly associated with mastitis in cows! Something similar has been observed before in Streptococcus agalactiae.


Resistance genes were associated with human hospital isolates and human carriage isolates. This is far from an ideal study design to test this, as we had different types of collections from different geographical regions; however, even when you look within different local collections you see the same patterns: (a) comparing bovine and human isolates from NY state, the resistance genes were all in human isolates not cow isolates; (b) comparing human carriage and infection isolates (both nosocomial and community acquired) in Vietnam, the resistance genes were mainly in human carriage and hospital isolates, not in community infections; (c) in the remaining countries, isolates from infections acquired in hospital had more resistance genes than those that were considered nosocomial (diagnosed within 48 hours of admission).

Screen Shot 2015-06-18 at 5.12.24 pm

What’s really interesting is that while resistance genes and virulence genes are both highly mobile components of the accessory genome, they were essentially orthogonal in their distribution. The resistance genes were mainly in hospital acquired infections and carriage isolates, whereas the virulence strains were mainly found in isolates from community acquired infections.

resistance-virulence-axis-2So far, this has resulted in the emergence of two very different kinds of K. pneumoniae clones of importance to human health: hypervirulent clones, and multidrug resistant clones. This is pretty lucky, as it means the hypervirulent clones are generally sensitive to antibiotics (although antimicrobial treatment is difficult for some conditions, like liver abscess), and the problem of untreatable highly drug resistant Klebs infections has not spread outside of hospitals.

Unfortunately, our luck appears to be runnning out and we are already starting to see the convergence of virulence and resistance. Hypervirulent ST23 strains, which have all four of the acquired siderophore systems, are accumulating antibiotic resistance genes. And about half of the KPC Klebs ST258 strains causing problems in hospitals globally have one of the siderophore gene clusters, yersiniabactin, which has been shown in clinical ST258 isolates to confer enhanced ability to cause pneumonia. How long till the other virulence genes creep in? We need to be watching!

Also, our data indicates that there are plenty of other hypervirulent or multidrug resistant Klebs clones emerging out there… convergence of virulence and resistance could happen in any one of them, so we need to be thinking and monitoring beyond the well-known ST23 and ST258 strains.

In any case, genomic surveillance is going to become really important for Klebsiella

A global picture of typhoid bacteria

New paper out: (a bit delayed due to travelling the world for science…)

Phylogeographical analysis of the dominant multidrug-resistant H58 clade of Salmonella Typhi identifies inter- and intracontinental transmission events

Nature Genetics 2015 Jun;47(6):632-9. doi: 10.1038/ng.3281

This paper provides a whole-genome snapshot of nearly 2000 genomes of the typhoid bacterium, Salmonella Typhi. The strains involved come from 63 countries contributed by dozens of people around the world, and were sequenced at the Sanger Institute with funding from the Wellcome Trust.

You can get the raw sequence reads under accession ERP001718, and play with the phylogenetic tree and associated map at the new MicroReact website:

Countries included in the study

Countries included in the study

I will post more later about plotting trees & metadata dynamically with MicroReact, and statically with Python and R.

But back to typhoid. This project is special for me for a number of reasons…

  • It is about Typhi, the bug that suckered me into directing my genomics skills into studying pathogens and infectious disease, and was the subject of my PhD project with Gordon Dougan and Julian Parkhill at the Sanger Institute, and Duncan Maskell at Cambridge.
  • It is a natural continuation of my PhD project, with the grunt work done by the new PhD student who took over typhoid genomics work in the Dougan lab when I moved back to Australia (Vanessa Wong, MD PhD), with me helping to direct the analysis from down here in Melbourne.
  • It is a great illustration of how sequencing has changed… The first Typhi genome sequence was done at the Sanger Institute using capillary sequencing, and was published in 2001 (Parkhill et al, Nature). In my PhD project (also at the Sanger Institute), I analysed 19 Typhi genomes sequenced with two sequencing platforms that were new and super-duper back in 2006: 454 (now dead) and Solexa (now known as Illumina – currently ruling the sequencing world globally). This was published in 2008, seven years after the first genome (Holt et al, Nature Genetics). Now, another 7 years later in 2015, we are publishing almost 2000 genomes.
  • Typhi is still one of the best examples I know of how sequencing has transformed bacterial surveillance and opened up a whole new field of genomic epidemiology. Before we could look at whole genomes, every Typhi strain looked pretty much the same genetically… there is so little variation, that lower resolution approaches like MLST just couldn’t tell us anything. Now that we can capture whole genomes relatively easily, we can track the transmission and evolution of these bugs essentially in real time.

So what did all this sequencing achieve? Basically we learnt a lot about a particularly tricky clone, called H58, that has spread quite rapidly across Asia and Africa and is responsible for most cases of multidrug resistant typhoid (infections that don’t respond to treatment with most antibiotics). About half of all our isolates belonged to this clone.

  • By comparing root-to-tip branch lengths in the phylogenetic tree of H58 to the isolation dates of each strain, we found evidence of a temporal signal. So we did BEAST analysis, using the isolation date of each strain to date the tips and model mutation rates and divergence dates for H58. This showed that SNPs accumulated slowly in the Typhi H58 genome, at a rate of ~2 SNPs every 3 years. This placed the emergence of H58 at ~1989, just before our oldest example of H58 (1992). We haven’t been able to do proper dating in Typhi before, probably because most of the samples we’ve looked at previously have been phylogenetically diverse strains that are separated by centuries of evolution including periods of epidemic transmission (higher mutation rate per unit time) and long-term carriage (lower mutation rate per unit time). Here we probably have enough data from a period of epidemic transmission of H58 that the signal from epidemic transmission is detectable. I think this is very similar to Mycobacterium tuberculosis (TB), which notoriously has very little temporal signal, and yet a localised 4-decade transmission chain in Argentina showed very strong temporal signal.pathogen_linear_regression_fullANDh58
  • The geographical distribution of the H58 isolates tell us a lot about the routes by which H58 has travelled the world. The tree of H58 is so big that it’s hard to see what’s happening…. so to make it easier, I used R to collapse localised subclades of H58 that contained isolates from a single country (panel A – the size of the circle reflects the number of isolates in the subclade), and showed the time span for each subclade next to the tree (panel B). Occasionally there were one or two isolates within a localised subclade that were sourced from neighbouring countries, indicating transfer to those countries… these are shown in panel C.


  • We inferred these geographical patterns of the spread of H58, based on the tree and the regions of isolation:


  • We learnt a lot about the evolution of multidrug resistance in Typhi H58. We knew that resistance to all first-line antibiotics was usually encoded in one big transposon, which came into H58 in a IncHI1 plasmid. But the new collection showed that this transposon has transferred into the Typhi H58 chromosome, not once but many times! These transfers have happened in separate events, in different parts of the world, and into different parts of the chromosome. This is what the transposon looks like, and two of the insertion sites relative to the reference chromosome (CT18):

Acquired multidrug resistance in Typhi H58

  • Finding transposon insertion sites is tricky! The transposon has copies of the IS1 transposase at either end, which we think are responsible for moving the whole transposon around. This poses a problem for genome assembly with short reads. One way around this is to sequence with long reads… the figure above shows two different insertion sites that we confirmed by using PacBio sequencing to get complete genomes. But we had >850 H58 genomes sequenced using Illumina which gives us short reads, so really we needed to figure out the insertion sites as best we could using the Illumina data. Luckily my PhD student Jane Hawkey had been working on a method to do this, called ISMapper. Using this approach, we could identify all the IS1 insertion sites in every Illumina-sequenced genome. We also found a couple of additional plasmids. This is where all the different multidrug resistance determinants are in the H58 population:


Finally, a clear message from this study is that we need to do a lot more sequencing of Typhi! While we have a lot of genomes here, there are large geographic areas that we just don’t know much about. Plus, we have seen that antibiotic resistance is evolving and changing fast, and we will need to keep up with this using ongoing genomic surveillance.

Bacterial colonization of the airways during the first year of life

The latest paper from the group is out in Cell Host & Microbe:

The Infant Nasopharyngeal Microbiome Impacts Severity of Lower Respiratory Infection and Risk of Asthma Development
Shu Mei Teo, Danny Mok, Kym Pham, Merci Kusel, Michael Serralha, Niamh Troy, Barbara J. Holt, Belinda J. Hales, Michael L. Walker, Elysia Hollams, Yury A. Bochkov, Kristine Grindle, Sebastian L. Johnston, James E. Gern, Peter D. Sly, Patrick G. Holt, Kathryn E. Holt*, Michael Inouye*

Also see preprint in BioRxiv

Graphical abstract

Basically, what we did is to perform 16S rRNA sequencing (V4 region, via MiSeq) on post-nasal aspirates (basically snot washed out of babies’ noses) collected from children as part of a cohort study called the Childhood Asthma Study run by the Telethon Kids institute in Perth, Western Australia.

Samples were collected at around 2 months, 6 months and 12 months of age, during periods of respiratory health (i.e. no symptoms of respiratory illnesss for at least 1 month). Also, every time a baby developed symptoms of respiratory illness, a study nurse would come and examine the child and take another sample. The snots were divided into 4 aliquots each and cryofrozen until last year, when one aliquot each was thawed for DNA extraction.

The 16S data showed very simple communities of bacteria in each sample… basically each sample was dominated by a single genus, either Staphylococcus, Corynebacterium, Alloiococcus, Moraxella, Streptococcus, or Haemophilus. Temporal colonization patterns indicated that most children were initially colonized with skin-associated organisms (Staphylococcus or Corynebacterium), which was replaced during the first year of life by stable colonization with Alloiococcus or Moraxella.


Amongst the samples taken during episodes of respiratory illness, we found that Moraxella, Streptococcus, or Haemophilus were very common. We also found the presence of these bacterial genera was associated with increased severity of symptoms of respiratory illness – samples with these bacteria were more likely to be from infections that spread to the lower respiratory tract (i.e. where the child had wheeze or rattly chest) and experienced fever (signalling an inflammatory response).

Importantly, we found that the children who experienced lower respiratory illness with fever were more likely to be chronic wheezers (i.e. asthmatic) at the age of 5 or 10 years, especially if they became sensitised to aeroallergens during their second year of life.

This suggests that Moraxella, Streptococcus, or Haemophilus colonization in infancy may be risk factors for severe respiratory illness and later asthma development. This could potentially explain earlier studies showing that colonization with M. catarrhalis, S. pneumoniae or H. influenzae (detected by culture) at the age of 1 month is associated with increased risk of asthma in childhood.

Many more details are available in the paper!

Global and local views of Shigella sonnei population genomics

If you have seen me give a talk in the last couple of years, chances are you would have heard a bit about Shigella sonnei. This is because it has been my favourite project in recent years, for two main reasons:

(1) it involved looking in-depth at phylogeography and evolution of the same organism at two different scales – first globally, over hundreds of years and then locally in Vietnam, over about 15 years; and

(2) it was done with two people I really enjoy working with – Steve Baker (based at the Oxford University Clinical Research Unit in Vietnam) and Nick Thomson (based at the Sanger Institute).

Here are the papers:

Shigella sonnei genome sequencing and phylogenetic analysis indicate recent global dissemination from Europe

Holt KE, et al. Nature Genetics 2012 [PubMed]

This study used whole genome sequencing of a global collection of 132 Shigella sonnei, an increasingly important cause of dysentery, to reconstruct the evolutionary history of the bacterium. Phylogenetic analysis showed that the current S. sonnei population descends from a common ancestor that existed less than 500 years ago and that diversified into several distinct lineages with unique characteristics. Furthermore the analysis suggests that the majority of this diversification occurred in Europe and was followed by more recent establishment of local pathogen populations on other continents, predominantly due to the pandemic spread of a single, rapidly evolving, multidrug-resistant lineage.

Commentaries on the paper are available in Nature Genetics and Nature Reviews Gastroenterology and Hepatology.

Dissemination of S. sonnei lineages out of Europe. Reprinted by permission from Macmillan Publishers Ltd: Nature Genetics 44:1056, copyright 2012.

Dissemination of S. sonnei lineages out of Europe. Reprinted by permission from Macmillan Publishers Ltd: Nature Genetics 44:1056, copyright 2012.

Tracking the establishment of local endemic populations of an emergent enteric pathogen

Holt KE, et al. PNAS 2013 [PubMed]

This study continues the Shigella sonnei story by examining the arrival of the rapidly evolving multidrug-resistant lineage in one particular country – Vietnam. We sequenced over 250 genomes of S. sonneiisolated over a 15-year period, and found that the multidrug-resistant lineage successfully established itself in Ho Chi Minh City, pushing out other dysentery-causing bacteria to become the dominant cause of dysentery.

This was likely helped by the acquisition of a colicin (toxin) system that enabled it to kill competing bacteria it came into contact with (including otherShigella), forming a new clone we called the VN (Viet Nam) clone. The VN clone spread to other cities in Vietnam, and we found evidence of convergent evolution of drug resistance mutations and plasmids in all three local populations we examined.

Phylogeny of Vietnamese S. sonnei and map of Vietnam, showing the inferred path of evolution and geographical spread.

Phylogeny of Vietnamese S. sonnei and map of Vietnam, showing the inferred path of evolution and geographical spread.

Behind the paper: SRST2 for short read sequence typing of bacterial pathogens

[Originally posted by Kat on her BacPathGenomics blog, Nov 2014]

I’m pleased to say our paper SRST2: Rapid genomic surveillance for public health and hospital microbiology labs is now out in the Genome Medicine/Genome Biology special issue Genomics of Infectious Diseases!

Having posted the code to GitHub over a year ago, and the paper to BioRxiv in June, it is great to have it finally published. I must say I love the BioRxiv experience, it is great to know the paper was viewed >1500 times while it was stuck in the peer review/publication process.

This project has been a fun journey, which started when I was lucky enough to get a fantastic summer intern, Harriet Dashnow (follow: @hdashnow), funded by VSLCI (our wonderful local supercompute cluster). Harriet’s job was to figure out how to apply our SRST code, which was designed to extract MLST information from Illumina reads (see code and BMC Genomics paper), to detect and type resistance genes and alleles. She did a fantastic job, but in doing so we realised that we needed to overhaul the scoring system to something that was faster and more robust. So, Mike Inouye designed a new scoring approach from the ground up, which Harriet and Bernie Pope from VLSCI turned into the new Python code that is SRST2. We then did a whole lot of testing on public data (mainly by Lesley Raven and Mark Schultz from the Holt & Inouye labs, with some guidance from Justin Zobel in the Computing & Info Systems department) and had our friends at the local reference lab, the Microbiological Diagnostic Unit (mainly Takehiro Tomita), test it out for themselves on a load of Listeria monocytogenes they had recently sequenced on their MiSeq.

What we came up with was some code that draws on the fundamental tools of the genomics trade (bowtie2, samtools & python) to rapidly and reliably detect genes and alleles direct from short read data. SRST2 has two modes, (i) MLST, where you can provide MLST alleles and profile definitions, and it will report sequence types; and (ii) gene detection and allele calling, suitable for screening for acquired genes such as resistance genes, virulence genes and/or plasmid genes.

Figure 1 from the paper outlines how SRST2 works:

The coolest thing about SRST2 is that it is consistent and reliable, even at low read depths, which is absolutely critical for use in a routine diagnostic/surveillance workflow. The robustness of SRST2 stems from the fact that we use a mapping approach coupled with a pretty clever approach to scoring potential matches.

The alternative is to generate de novo assemblies, and then interrogate them with BLAST or other tools… the problem with that approach is that, while de novo assemblers like Velvet and SPAdes can do a great a job of assembling bacterial genomes, there is no way to guarantee you will get a good assembly with every read set. Anyone who works with biggish bacterial data sets of 100s-1000s of genomes knows – it is very hard to standardize the assembly process and establish a protocol that works consistently and reliably every time. And if you are needing to type loads of strains as part of routine service or investigations, you need to know you will get a reliable result every time. Figure 3 in the paper compares the overall performance (call rate x accuracy) of SRST2 vs BLAST analysis of highly optimised assemblies:

As you can see, once you get up to 15-20x read depth, SRST2 will consistently give you an answer, and that answer will be accurate. Whereas using assemblies – even highly optimised ones – you will always miss out on 1-5% of the answers.

We were also asked by a reviewer to compare SRST2 with the only other publicly-available program designed for MLST typing from short reads: the MLST Typer web service at the Danish Center for Genomic Epidemiology. Because this requires uploading read sets to the other side of the world, we only did this for one of our test data sets, 44 Salmonella genomes. The results are in Table 5, and showed that the website got the correct ST for only 14/44 genomes. SRST2 got correct STs for 43/44 genomes (for the 44th strain, SRST2 returned 6/7 alleles correctly and suggested that there is actually a SNP in the 7th allele, which could actually be correct but we cannot verify; MLST Typer did not call anything for this strain).

When it comes to detecting the presence of genes, such as acquired antibiotic resistance genes, SRST2 also outperformed assembly based approaches. We got accurate detection at depths >3x, and picked up many resistance genes that did not assemble well, especially at depths below 15x (see Fig 5 of the paper). Again, we compared to ResFinder on the Center for Genomic Epidemiology website, and found SRST2 performs much better (see Table 3 in the paper). I don’t want to criticize the tools on the Center for Genomic Epidemiology website, as they are really useful for people who have limited resources to analyse genome data… but I do think people need to be aware that the successful use of those tools rely heavily on first obtaining a high quality assembly, and you can very easily miss important antibiotic resistance genes.

So what are the main uses for SRST2?

The key jobs SRST2 is good for are:

  • detecting the presence of genes
  • identifying specific alleles for those genes from a known set of sequencescalculating sequence types (STs) specified by MLST-style schemas.

So it is great for detecting acquired genes, such as antibiotic resistance genes, virulence genes or plasmid genes; or extracting MLST information. To this end, the SRST2 download includes:

  • a script to help you pull down the latest MLST data from for >100 MLST schemes
  • instructions and code for typing virulence genes from the Virulence Factor Database (VFDB)
  • SRST2-compatible versions of popular resistance gene databases (ARG-Annot, ResFinder) and the PlasmidFinder database of plasmid replicons

For example, in the paper we show the use of SRST2 for investigating changing patterns of vancomycin resistance amongst Enterococcus faecium isolated at a local hospital (data from this 2013 MBio paper). We analysed the Illumina readsets using SRST2 to detect resistance genes and perform MLST. SRST2 returned a table of strains with columns to indicate sequence types, alleles for each locus in the MLST scheme, and the presence of acquired resistance genes (shown here viewed in Excel):


You will notice that the depth values are VERY high for this data set, as the sequencing was done at relatively low-plex on a HiSeq. This means that it takes a long time to map (and VERY long time to assemble) these read sets… however note that you can obtain the same results much faster by telling SRST2 to use just the first 1 or 2 million reads per strain, which is more than enough to get accurate results (–stop_after 1000000).

We used R to summarize and plot the results, which are shown in Figure 7 in the paper (R code is in the /scripts directory in the SRST2 download):

Panel (a) shows that vanB vancomycin resistance became more frequent over the 12 years investigated, while panel (b) shows that the STs also changed during this time period… so is the increase in resistance linked to the introduction of new van-resistant ST252 and ST203 strains?

Panel (c) puts the ST & resistance results together… the tree on the left is a dendrogram depicting the similarity between sequence types based on number of shared alleles; MLST profiles (STs and their component alleles) are shown next to this. On the right, the frequency of each ST is shown as a bar graph, which highlights that there are 3 main types, ST17, ST252 and ST203, which all share 5-6 alleles. The heatmap in the middle shows the frequency of the various resistance genes within each ST – it is clear that the vanB locus is present in all 3 STS, but only at ~50% in each… so the increase in vancomycin resistance is not explained by the introduction of resistant ST252 and ST203 clones; rather, vancomycin resistance comes and goes, via movement of the vanB operon, in all 3 common STs.

The whole genome phylogenetic analysis in the 2013 MBio paper shows this in much finer detail of course; but the basic conclusions could be drawn from a very rapid analysis with SRST2.

The same is true for hospital outbreak investigations (see examples in the paper) – in a few minutes SRST2 can show whether a potential transmission case has the same ST and acquired genes profile as confirmed outbreak cases; if the ST matches, genome-wide SNP trees can be used to investigate further.

My next post will have some pro tips for using SRST2.