Tools for bacterial comparative genomics

Yesterday I spoke at a workshop for JAMS TOAST (Sydney’s Joint Academic Microbiology Seminars – bioinformatics workshop)… I was asked to cover tools for comparative genomics, so I put together a list of the tried and tested programs that I find most useful for this kind of analysis. So here is the list.

First, a few caveats…

These are mostly tools with a graphical user interface (mostly Java based)… this means they should be pretty accessible to most users, however if you want to do analyses that are a bit more custom or niche, you will have to get your hands dirty and use the commandline (which you should learn to do anyway!!)

These tools are useful for small-ish scale genomic comparisons, in the order of 2-20 genomes.

Most of these tools are for assembled data, hence we start with how to assemble your data… this will become less of an issue as we move to long read sequencing with PacBio and MinION etc, but for the moment most of the data I work with is from large scale sequencing projects with Illumina (100s-1000s) so we use mapping-based approaches for a lot of tasks… so I have included a few comments about this at the end.

Beginner’s guide with walk-through tutorial

Some of these tools, particularly the visualisation of whole genome comparisons (using Artemis & ACT, Mauve, and BRIG) are covered the in the tutorial from our 2013 “Beginner’s guide to comparative bacterial genome analysis using next-generation sequence data“. So if you want a walk-through, that’s a good place to start. Note that we have updated the tutorial (as of July 2017) to version 2, available here.


First things first – Are my reads good quality?

FastQC – Generate graphical reports of read quality from the fastq files.


Assembly

SPAdes – de Bruijn graph assembly, incorporating multiple kmers and read pairing information in the building of the graph. Think of this as a more sophisticated version of Velvet… in my experience, it nearly always provides better assemblies than Velvet, except on the rare occasion (1-5% of read sets) where it fails to get a good assembly at all. In which case, try Velvet!

Velvet – The first and most widely used de Bruijn graph assembler built to tackle the problem of short reads. Graphs are built using a single kmer value, and read pairing information used for scaffolding only (unlike SPAdes, where multiple kmers are incorporated into a single graph and read pairing is also used directly in building the graph). How do you know what kmer to use? Use Velvet Optimiser. Hate the command line? Try Vague, a GUI wrapper for Velvet.

How do I judge if I have a good assembly? Try QUAST

What other assemblers are there? What’s best for what task? Take a look at Nucelotid.es and Assemblathon.

How can I view my assembly graphs? Try Bandage – freshly released from Ryan Wick, a MSc (Bioinformatics) student in my lab. Bandage allows you to view and manipulate de Druijn graphs output by Velvet or SPAdes… lots of super cool features and useful applications, see the github site for examples.


Working with assembled data

Now you have a nice set of assembled contigs – where are all the genes?

Whole genome annotation

RAST – Web tool (upload contigs), uses the subsystems in the SEED database and provides detailed annotation and pathway analysis. Takes several hours per genome but I think this is the best way to get a high quality annotation (if you have only a few genomes to annotate).

Prokka – Standalone command line tool, takes just a few minutes per genome. This is the best way to get good quality annotation in a flash, which is particularly useful if you have loads of genomes or need to annotate a pangenome or metagenome. Note however that the quality of functional information is not as good as RAST, and you will need several extra steps if you want to do functional profiling and pathway analysis of your genome(s)… which is in-built in RAST.

Annotating specific types of features

Resistance genes

  • CARD – best combination of easy interface + pretty good database
  • ARG-Annot  – best quality database (in my experience, focusing on Enterobacteriaceae)
  • ResFinder – easy interface, database needs ongoing development

Virulence genes

  • PATRIC – for certain bugs only, but has good online tools for genome comparisons.
  • VFDB – broader range of species, but varying levels of comprehensiveness and you need to do more of the work yourself.

Insertion sequences

  • IS saga – Upload your genome and have IS saga find all the transposes in your genome using their IS finder database

Phage

  • PHAST – Upload your genome and this will identify likely prophage regions, summarising these at the level of whole phage and also individual genes.

Viewing your genome – The Artemis Genome Browser

There are zillions of genome browsers out there, but I still love Artemis… and not just because I’m from the Sanger Institute. Unlike most genome browsers, Artemis was custom-built for bacterial genomes, which let’s face it are really quite different from humans and other eukaryotes.

The default view shows you your sequence and annotation, with 6 frame translation and allows you to easily edit or create features in the annotation, graph sequence-based functions like GC content and GC skew, and do all manner of other useful things. It’s been around for a zillion years (well, at least 10 or so) and is very well developed and supported.

Artemis has lots of cool features built in, including the ‘BamView’ feature that allows you to view BAM files that show the alignment of reads mapped to your genome, zoomed in to the base level or zoomed out to look at coverage and SNP distributions… this is also super handy for viewing RNAseq data, as you can easily see the stacks of reads derived from coding regions.

Artemis also has DNA Plotter built in, which you can use to generate those pretty circular figures of your genome sequences and their features.

Plus, when you’ve got used to using Artemis to get to know your shiny new genome, you can move on to viewing comparisons against other genomes using ACT – the Artemis Comparison Tool.


Comparing whole genome assemblies

NOTE: Walk-throughs of these tools, using examples from the 2011 E. coli outbreak in Germany, are covered in the “Beginner’s guide to comparative bacterial genome analysis using next-generation sequence data“.

ACT (Artemis Comparison Tool) – Visualises BLAST (or similar) comparisons of genomes. This is most useful for comparisons of two or a few genomes, and makes it easy to spot and zoom in to regions of difference.

Mauve – Whole genome alignment and viewer that can output SNPs, regions of difference, homologous blocks, etc. It can also be used to assess assembly quality against a reference, using Mauve Contig Metrics.

BRIG (BLAST Ring Image Generator) – Gives a global view of whole genome comparisons by visualising BLAST comparisons via pretty circular figures. This is suitable for comparing lots of genomes, although because you have to enter each one through the GUI, it’s tricky to do more than a dozen or so.


Whole genome SNP-based phylogenies (from assembled data)

You can’t go past Adam Phiippy’s Harvest Suite

Parsnp – Compare genomes to a reference (using MUMmer) to identify core genome SNPs and build a phylogeny

Gingr – View the phylogeny and associated SNP calls (VCF format)… also useful for visualising tree + VCF that you have created in other ways, e.g. from mapping.


Detecting recombination in whole genome comparisons

Gubbins – A new implementation of the approach first used in Nick Croucher’s 2011 Science paper on Streptococcus pneumoniae. Command-line driven and runs pretty fast (<2 hours usually on our data).

BRAT NextGen – Uses a similar idea to Gubbins but using Bayesian clustering is GUI-driven… sounds nice, but actually I find it less convenient than Gubbins as there are manual steps required and then you need to run lots of iterations to get significance values.



Mapping based analyses

Why?

If you have specific questions to answer, where precise variant detection is important (e.g. allele calling, MLST, SNP detection, typing, mutation detection), mapping provides greater sensitivity and specificity than assembled data. Basically, if you want to be really sure about a variant call, you should be using the full information available in the reads rather than relying on the assembler and consensus base caller to get things right every time. See our SRST2 paper if you don’t believe me.

Also, if you need quick answers to specific questions, this is almost always going to be achieved faster and more accurately if you work direct from reads without attempting to generate high quality assemblies first.

The basics

For mapping our go-to is BWA or Bowtie2 (getting from fastq -> BAM). For processing of BAMs we use: SAMtools and BAMtools for variant calling, and BAMstats and BEDtools for summarising coverage and other information from the alignments.

Pipelines for specific tasks

There are loads of pipelines around the place that use the basic tools above to do specific tasks. A few of ours are:

  • SRST2 – MLST, resistance genes, virulence genes
  • ISMapper – IS (insertion sequence / tranposase) insertions
  • RedDog – Whole genome SNP-based phylogenies

8 comments

  1. This is a really good overview, thank you! I can’t seem to find RedDog though. Has it been taken offline?

    1. Sorry, did not see this comment! Yes it call SNPs wherever they appear across the whole genome, you can post-filter out whatever you like.

Comment