Microbial Genomics methods

MGen (Microbial Genomics) was created last year by the UK’s Microbiology Society, with the aim of becoming the go-to journal for microbial genomics research. As a Senior Editor, I was asked to help mark the occasion of MGen’s six month anniversary (on Jan 15, 2016), by reviewing the 24 papers published in the journal’s first six months since launch.
The full review is is available over on the MGen site, but I wanted to draw particular attention here to trends in the analysis tools used by the genomic epidemiology papers published so far in MGen, which largely reflects what’s been happening across the field generally.

tools word cloud

One third of the articles are cutting edge genomic epidemiologyin action, using genomics to investigate the evolution and transmission of a range of pathogens, from anthrax to dysentery and food poisoning.

Interestingly all of these studies used similar methodologies, reflecting the maturity of this field of research.

Common approaches include: (i) sequencing large numbers of isolates using high-throughput Illumina platforms; (ii) the identification of SNPs (single nucleotide polymorphisms) using read mapping approaches (with BWA, SMALT, SAMtools and GATK being popular tools); and (iii) uniform use of RAxML for generating maximum likelihood phylogenies.

Some also used BEAST to estimate mutation rates, divergence dates and phylogeographical patterns. Interestingly, half of these papers utilised MLST (multi-locus sequence typing) to identify clades, showing that this sub-genomic approach based on capillary sequencing of ~7 gene fragments is still considered useful by many genomic researchers. [NOTE: nice to see that most who were inferring MLST from Illumina data were using our SRST or SRST2 software.]

Almost all of the genomic epidemiology studies took steps to remove SNPs introduced via recombination, in order to capture the underlying signals of vertical inheritance that are so important for transmission studies. Popular tools were BratNextGen, Gubbins and ClonalFrameML, which were all published within the last 3 years.

For pan genome analysis, Velvet and SPAdes were the most popular tools for bacterial genome assembly, with Prokka and Prodigal for gene annotation, and LS-BSR and related approaches being commonly used to cluster orthologous groups of genes.

 

Most of this won’t be any surprise to people working in bacterial genomic epi, but I think it’s great to see that consensus is emerging on how best to do these sorts of analyses, and to at last have some reliable tools for detecting and accounting for recombination.

The area of least agreement remains SNP calling – which mainly comes down to which read mappers to use, and which SNP calling algorithms and filtering to go with? This is a complex area, as highlighted in the recent review “Best practices for evaluating single nucleotide variant calling methods for microbial genomics” in Frontiers in Genetics, which did a very thorough job of examining the issues that need to be considered, but (quite deliberately) doesn’t provide an answer to “which tool is best?”.

Although there is still no real consensus on exact methods for SNP calling, I think most of the tools people are using (ie a good, stable read mapper followed by SNP calling with an established tool like SAMtools or GATK, with some basic filtering to remove low-evidence or ambiguous calls) end up with very similar answers (as we saw with the NGS outbreak analysis challenge session held at the ASM NGS meeting in September 2015). All in all it seems to me that the use of genomics for public health & diagnostic microbiology is in far better shape in this respect than clinical human genomics, which is going through something of a crisis involving wide discrepancies in variant calling as well as uncertainty around data interpretation.

Comment