Author: kat

Bacterial genomics researcher in Melbourne, Australia

Postdoc jobs available in the lab

These applications are now closed, but if you have experience in bacterial population genomics or microbiome work and are interested in joining the lab, please get in touch.

We are currently advertising two new positions in the lab, based at the University of Melbourne. Feel free to contact me if you would like more information or to discuss which position may suit you best.

(1) Research Fellow (postdoc or RA) – NHMRC funded

Salary: $62,973 – $85,452 p.a. (*PhD Entry Level $79,609 p.a.) plus 9.5% superannuation

This position is for a research assistant or postdoctoral researcher who will use bioinformatics and genomic analysis to investigate bacterial pathogens. The appointee will develop and apply phylogenetic, statistical and/or computational analyses to study the evolution and gene regulation of E. coli and other bacterial pathogens. They will have the opportunity to contribute to and develop related projects within the research group, and produce high impact publications.

Close date: 6 Apr 2015 (extended from 19 Mar 2015

Note this is advertised for 1 year but there will be opportunities to extend.

For position description and to apply online, see http://jobs.unimelb.edu.au/caw/en/job/885526/research-fellow

(2) Postdoctoral Research Fellow – Wellcome Trust funded

Salary: $62,973* – $85,452 p.a. (*PhD Entry Level $79,609 p.a.) plus 9.5% superannuation

This position is for a postdoctoral researcher to join an international research team working together to understand and control typhoid fever. The position is based in the Holt lab (http://holtlab.net) at the University of Melbourne and is funded by the Wellcome Trust Strategic Award: “A strategic vision to drive the control of enteric fever through vaccination”, led by the University of Oxford and joined by collaborating research groups at the Wellcome Trust Sanger Institute, University of Liverpool, Liverpool School of Tropical Medicine and Hygiene, Yale University, Princeton University and partner labs in typhoid-endemic areas of Africa and Asia.

Close date: 6 Apr 2015 (extended from 19 Mar 2015)

Note the position is initially for 2 years with the potential for extension.

For position description and to apply online, see http://jobs.unimelb.edu.au/caw/en/job/885527/research-fellow

Bandage – View and navigate assembly graphs

New from Ryan Wick, a wonderful MSc (Bioinformatics) student in the lab:

Bandage – Bioinformatics Application for Navigating De novo Assembly Graphs Easily

http://rrwick.github.io/Bandage/

De novo assembly graphs contain assembled contigs (nodes) but also the connections between those contigs (edges), which are not easily accessible to users. Bandage is a program for visualising assembly graphs using graph layout algorithms. By displaying connections between contigs, Bandage opens up new possibilities for analysing de novo assemblies that are not possible by looking at contigs alone.

BLAST searches of an assembly can be conducted within Bandage… the resulting hits are highlighted in the graph!

Note: works with the assembly graphs output by Velvet, SPAdes and Trinity.

Tools for bacterial comparative genomics

Yesterday I spoke at a workshop for JAMS TOAST (Sydney’s Joint Academic Microbiology Seminars – bioinformatics workshop)… I was asked to cover tools for comparative genomics, so I put together a list of the tried and tested programs that I find most useful for this kind of analysis. So here is the list.

First, a few caveats…

These are mostly tools with a graphical user interface (mostly Java based)… this means they should be pretty accessible to most users, however if you want to do analyses that are a bit more custom or niche, you will have to get your hands dirty and use the commandline (which you should learn to do anyway!!)

These tools are useful for small-ish scale genomic comparisons, in the order of 2-20 genomes.

Most of these tools are for assembled data, hence we start with how to assemble your data… this will become less of an issue as we move to long read sequencing with PacBio and MinION etc, but for the moment most of the data I work with is from large scale sequencing projects with Illumina (100s-1000s) so we use mapping-based approaches for a lot of tasks… so I have included a few comments about this at the end.

Beginner’s guide with walk-through tutorial

Some of these tools, particularly the visualisation of whole genome comparisons (using Artemis & ACT, Mauve, and BRIG) are covered the in the tutorial from our 2013 “Beginner’s guide to comparative bacterial genome analysis using next-generation sequence data“. So if you want a walk-through, that’s a good place to start. Note that we have updated the tutorial (as of July 2017) to version 2, available here.

First things first – Are my reads good quality?

FastQC – Generate graphical reports of read quality from the fastq files.

Assembly

SPAdes – de Bruijn graph assembly, incorporating multiple kmers and read pairing information in the building of the graph. Think of this as a more sophisticated version of Velvet… in my experience, it nearly always provides better assemblies than Velvet, except on the rare occasion (1-5% of read sets) where it fails to get a good assembly at all. In which case, try Velvet!

Velvet – The first and most widely used de Bruijn graph assembler built to tackle the problem of short reads. Graphs are built using a single kmer value, and read pairing information used for scaffolding only (unlike SPAdes, where multiple kmers are incorporated into a single graph and read pairing is also used directly in building the graph). How do you know what kmer to use? Use Velvet Optimiser. Hate the command line? Try Vague, a GUI wrapper for Velvet.

How do I judge if I have a good assembly? Try QUAST

What other assemblers are there? What’s best for what task? Take a look at Nucelotid.es and Assemblathon.

How can I view my assembly graphs? Try Bandage – freshly released from Ryan Wick, a MSc (Bioinformatics) student in my lab. Bandage allows you to view and manipulate de Druijn graphs output by Velvet or SPAdes… lots of super cool features and useful applications, see the github site for examples.

Working with assembled data

Now you have a nice set of assembled contigs – where are all the genes?

Whole genome annotation

RAST – Web tool (upload contigs), uses the subsystems in the SEED database and provides detailed annotation and pathway analysis. Takes several hours per genome but I think this is the best way to get a high quality annotation (if you have only a few genomes to annotate).

Prokka – Standalone command line tool, takes just a few minutes per genome. This is the best way to get good quality annotation in a flash, which is particularly useful if you have loads of genomes or need to annotate a pangenome or metagenome. Note however that the quality of functional information is not as good as RAST, and you will need several extra steps if you want to do functional profiling and pathway analysis of your genome(s)… which is in-built in RAST.

Annotating specific types of features

Resistance genes

CARD – best combination of easy interface + pretty good database
ARG-Annot – best quality database (in my experience, focusing on Enterobacteriaceae)
ResFinder – easy interface, database needs ongoing development

Virulence genes

PATRIC – for certain bugs only, but has good online tools for genome comparisons.
VFDB – broader range of species, but varying levels of comprehensiveness and you need to do more of the work yourself.

Insertion sequences

IS saga – Upload your genome and have IS saga find all the transposes in your genome using their IS finder database

Phage

PHAST – Upload your genome and this will identify likely prophage regions, summarising these at the level of whole phage and also individual genes.

Viewing your genome – The Artemis Genome Browser

There are zillions of genome browsers out there, but I still love Artemis… and not just because I’m from the Sanger Institute. Unlike most genome browsers, Artemis was custom-built for bacterial genomes, which let’s face it are really quite different from humans and other eukaryotes.

The default view shows you your sequence and annotation, with 6 frame translation and allows you to easily edit or create features in the annotation, graph sequence-based functions like GC content and GC skew, and do all manner of other useful things. It’s been around for a zillion years (well, at least 10 or so) and is very well developed and supported.

Artemis has lots of cool features built in, including the ‘BamView’ feature that allows you to view BAM files that show the alignment of reads mapped to your genome, zoomed in to the base level or zoomed out to look at coverage and SNP distributions… this is also super handy for viewing RNAseq data, as you can easily see the stacks of reads derived from coding regions.

Artemis also has DNA Plotter built in, which you can use to generate those pretty circular figures of your genome sequences and their features.

Plus, when you’ve got used to using Artemis to get to know your shiny new genome, you can move on to viewing comparisons against other genomes using ACT – the Artemis Comparison Tool.

Comparing whole genome assemblies

NOTE: Walk-throughs of these tools, using examples from the 2011 E. coli outbreak in Germany, are covered in the “Beginner’s guide to comparative bacterial genome analysis using next-generation sequence data“.

ACT (Artemis Comparison Tool) – Visualises BLAST (or similar) comparisons of genomes. This is most useful for comparisons of two or a few genomes, and makes it easy to spot and zoom in to regions of difference.

Mauve – Whole genome alignment and viewer that can output SNPs, regions of difference, homologous blocks, etc. It can also be used to assess assembly quality against a reference, using Mauve Contig Metrics.

BRIG (BLAST Ring Image Generator) – Gives a global view of whole genome comparisons by visualising BLAST comparisons via pretty circular figures. This is suitable for comparing lots of genomes, although because you have to enter each one through the GUI, it’s tricky to do more than a dozen or so.

Whole genome SNP-based phylogenies (from assembled data)

You can’t go past Adam Phiippy’s Harvest Suite

Parsnp – Compare genomes to a reference (using MUMmer) to identify core genome SNPs and build a phylogeny

Gingr – View the phylogeny and associated SNP calls (VCF format)… also useful for visualising tree + VCF that you have created in other ways, e.g. from mapping.

Detecting recombination in whole genome comparisons

Gubbins – A new implementation of the approach first used in Nick Croucher’s 2011 Science paper on Streptococcus pneumoniae. Command-line driven and runs pretty fast (<2 hours usually on our data).

BRAT NextGen – Uses a similar idea to Gubbins but using Bayesian clustering is GUI-driven… sounds nice, but actually I find it less convenient than Gubbins as there are manual steps required and then you need to run lots of iterations to get significance values.

Mapping based analyses

Why?

If you have specific questions to answer, where precise variant detection is important (e.g. allele calling, MLST, SNP detection, typing, mutation detection), mapping provides greater sensitivity and specificity than assembled data. Basically, if you want to be really sure about a variant call, you should be using the full information available in the reads rather than relying on the assembler and consensus base caller to get things right every time. See our SRST2 paper if you don’t believe me.

Also, if you need quick answers to specific questions, this is almost always going to be achieved faster and more accurately if you work direct from reads without attempting to generate high quality assemblies first.

The basics

For mapping our go-to is BWA or Bowtie2 (getting from fastq -> BAM). For processing of BAMs we use: SAMtools and BAMtools for variant calling, and BAMstats and BEDtools for summarising coverage and other information from the alignments.

Pipelines for specific tasks

There are loads of pipelines around the place that use the basic tools above to do specific tasks. A few of ours are:

SRST2 – MLST, resistance genes, virulence genes
ISMapper – IS (insertion sequence / tranposase) insertions
RedDog – Whole genome SNP-based phylogenies

Global and local views of Shigella sonnei population genomics

If you have seen me give a talk in the last couple of years, chances are you would have heard a bit about Shigella sonnei. This is because it has been my favourite project in recent years, for two main reasons:

(1) it involved looking in-depth at phylogeography and evolution of the same organism at two different scales – first globally, over hundreds of years and then locally in Vietnam, over about 15 years; and

(2) it was done with two people I really enjoy working with – Steve Baker (based at the Oxford University Clinical Research Unit in Vietnam) and Nick Thomson (based at the Sanger Institute).

Here are the papers:

Shigella sonnei genome sequencing and phylogenetic analysis indicate recent global dissemination from Europe

Holt KE, et al. Nature Genetics 2012 [PubMed]

This study used whole genome sequencing of a global collection of 132 Shigella sonnei, an increasingly important cause of dysentery, to reconstruct the evolutionary history of the bacterium. Phylogenetic analysis showed that the current S. sonnei population descends from a common ancestor that existed less than 500 years ago and that diversified into several distinct lineages with unique characteristics. Furthermore the analysis suggests that the majority of this diversification occurred in Europe and was followed by more recent establishment of local pathogen populations on other continents, predominantly due to the pandemic spread of a single, rapidly evolving, multidrug-resistant lineage.

Commentaries on the paper are available in Nature Genetics and Nature Reviews Gastroenterology and Hepatology.

Dissemination of S. sonnei lineages out of Europe. Reprinted by permission from Macmillan Publishers Ltd: Nature Genetics 44:1056, copyright 2012.

Tracking the establishment of local endemic populations of an emergent enteric pathogen

Holt KE, et al. PNAS 2013 [PubMed]

This study continues the Shigella sonnei story by examining the arrival of the rapidly evolving multidrug-resistant lineage in one particular country – Vietnam. We sequenced over 250 genomes of S. sonneiisolated over a 15-year period, and found that the multidrug-resistant lineage successfully established itself in Ho Chi Minh City, pushing out other dysentery-causing bacteria to become the dominant cause of dysentery.

This was likely helped by the acquisition of a colicin (toxin) system that enabled it to kill competing bacteria it came into contact with (including otherShigella), forming a new clone we called the VN (Viet Nam) clone. The VN clone spread to other cities in Vietnam, and we found evidence of convergent evolution of drug resistance mutations and plasmids in all three local populations we examined.

Phylogeny of Vietnamese S. sonnei and map of Vietnam, showing the inferred path of evolution and geographical spread.

Typhoid in Kathmandu and Open Biology OA journal

A paper I’ve been working on for a few years on typhoid in Kathmandu yesterday had the honour of being the first paper ever published by the new open access journal of the Royal Society, Open Biology. I’m very keen on open access publishing and always try to submit to OA journals, but there is still a limited choice of truly OA journals. I love PLoS and BMC and submit to both regularly, but I think it’s really important to have a diverse range of OA journals – and therefore diversity in editors, editorial policies & styles, subject areas, etc – to make open access work for everyone.

So I’m excited to be have a paper in the new Open Biology, who publish under a Creative Commons 3 license (reuse/modify/distribute with attribution). Only time will tell how well the journal does, but it will only become great if us scientists are willing to submit good manuscripts. One incentive to do this is that Open Biology aims for a quick turn-around time of 4 weeks from submission to decision. Much as I love PLoS and BMC, they’ve never managed anywhere near that sort of turn-around. For info on Open Biology, see their ‘About’ page https://royalsocietypublishing.org/rsob/about.

So what is the paper? Thanks to OA, I can reproduce it here… (or you can read online or PDF)

Combined high-resolution genotyping and geospatial analysis reveals modes of endemic urban typhoid fever transmission

Stephen Baker1,2,*,†, Kathryn E. Holt3,4,†, Archie C. A. Clements5, Abhilasha Karkey2, Amit Arjyal2, Maciej F. Boni1,6, Sabina Dongol2, Naomi Hammond4, Samir Koirala2, Pham Thanh Duy1, Tran Vu Thieu Nga1, James I. Campbell1, Christiane Dolecek1,2, Buddha Basnyat2, Gordon Dougan4 and Jeremy J. Farrar1,2

Open Biol October 2011 1:110008; doi:10.1098/rsob.110008.

Basically, it uses genotyping and GPS to study typhoid fever in Kathmandu, Nepal. We examined 4-years worth of typhoid cases and looked at where the patients lived within the city (using GPS) and subtyped the bacteria responsible for their infections using high throughput SNP typing.

Firstly, we found that about 3/4 of the patients were infected with Salmonella Typhi and 1/4 were infected with Salmonella Paratyphi A. If you aren’t familiar with Salmonella, these are two serotypes of Salmonella enterica which, rather than causing gastrointestinal disease (ie food poisoning) like most Salmonella serotypes, cause the systemic infection known as typhoid. Typhi and Paratyphi A are quite different genetically, but have undergone convergent evolution to cause the same disease syndrome (see earlier paper in BMC Genomics).

Temporal distribution of Typhi (red) and Paratyphi A (blue) cases

Then we looked at the spatial distribution of the patients homes, and found that they were clustered in specific “hotspot” areas of Kathmandu:

Spatial risk model for Typhi infection (see paper for separate map for Paratyphi A risk)

Contrary to expectation, these hotspots weren’t the most densely populated areas…you might expect more people = more cases, but this wasn’t the case. Some complicated spatial statistics, done by Archie Clements at University of Queensland, confirmed that the hostpots weren’t associated with population density or hospital referral patterns, but were in low-elevation areas local people source their water from stone waterspouts.

Spatial distribution of Typhi cases, and location of water spouts

To see if the waterspouts could really be a source of typhoid transmission, we tested water samples for the presence of Typhi or Paratyphi A using culture and PCR. Culturing didn’t work, but it is notoriously difficult to culture Typhi from water samples that are not pre-enriched for bacteria…however PCR (using this method we published earlier in BMC Infectious Diseases) detected Typhi in 3/4 of water samples and Paratyphi A in 2/3.

Stone water spouts in Kathmandu (taken by co-author Stephen Baker)

We also looked at the population of bacteria causing the typhoid fever. We examined Salmonella Typhi isolated from the blood of typhoid fever patients, and used SNP typing to analyse the Typhi DNA and examine the population structure. We typed 113 SNPs (single nucleotide polymorphisms, ie point mutations) that we already knew about from previous variation discovery efforts. About 2/3 of isolates had the same haplotype, so to discriminate further within this local subgroup we sequenced 40 of the Typhi to identify novel SNPs arising in the local population (local microevolution) and typed these SNPs as well. Most of the Typhi belonged to the H58 lineage, which is common in other typhoid endemic zones we’ve looked at previously (Mekong Delta Vietnam – Holt & Dolecek 2011, PLoS NTD [OA]; Nairobi, Kenya – Kariuki 2010, J Clin Micro [free]; globally – Holt & Phan 2011, PLoS NTD [OA], Roumagnac 2006, Science [free in PMC]).

Typhi tree, red bars indicate frequency of genotypes in Kathmandu collection; red zones are H58 lineage and H58G sublineage

As the map above shows, the different Typhi genotypes were distributed randomly, with no spatial or temporal clustering. The exception was a probable outbreak in the west of the city, outside the hotspot zone, where 28 cases of infection with the same Typhi genotype were recorded in a two-month period – see yellow shaded area in map above, and zoomed in below:

Localised outbreak of Typhi genotype H58G-b4

Finally, we looked at what was happening in households from which multiple typhoid infections were studied. You might expect these household disease clusters to represent shared infections, which are transmitted between members of the household. However in most of these household typhoid clusters, the cases were caused by different organisms – either Paratyphi in one case and Typhi in the other, or multiple cases caused by different genotypes of Typhi. Cases with the same causative Typhi genotype are linked with dashed lines in this figure, you can see they are the exception rather than the rule:

Distribution of typhoid-causing bacteria in households with multiple typhoid cases

So, all in all we found typhoid fever infections clustered in spatial hotspots within Kathmandu, and that this clustering was explained not by population density but by low elevation and proximity to stone water spouts which are used to supply water. This implicates the water spouts in typhoid transmission via dissemination of Typhi and Paratyphi A around the city, supported by the detection of Typhi and Paratyphi A in the majority of water samples taken from these spouts. The diversity of Typhi genotypes we detected indicates that transmission occurs via water that is contaminated with a diverse population of Typhi, rather than point source outbreaks (with the exception of one outbreak, which actually occurred outside the hotspot zone). The diversity of Typhi genotypes within households suggests that this sort of transmission – ie dissemination via contamination of the water supply – contributes more to the overall typhoid burden than direct person-to-person transmission.

How does this contamination happen? Well, it is possible for people to carry Typhi and Paratyphi A in the gall bladder, without ever noticing an infection…for example, Typhoid Mary was a famous carrier of Typhi. Carriers shed the bacteria in their feces, so any food or water contaminated with their fecal material becomes a vehicle for typhoid transmission. Our study suggests that there are many Typhi and Paratyphi A carriers in Kathmandu who are unknowingly shedding the bacteria, so that whenever sewage seeps into the groundwater that feeds the stone water spout, the water becomes contaminated and can pass on the infection to those who drink the water. Most of the typhoid cases occur in the monsoon season, when flooding is likely to promote seepage of sewage into the underground aquifers that supply the water spouts. Hence the study suggests that endemic typhoid in Kathmandu is essentially a question of water infrastructure, and could potentially be dramatically reduced by supplying clean drinking water to people living in these few hotspot areas.

Holt Lab

microbial genomics