Just playing with ANDI, for building whole genome trees rapidly from large sets of genome assemblies
Code: https://github.com/evolbioinf/andi/
Paper: http://bioinformatics.oxfordjournals.org/content/31/8/1169
It seems really great, and lightning fast (few seconds on ~100 genomes).
The output is a phylip distance file, which is not something I normally work with, having spent the last 5+ years inferring trees with ML from DNA alignments and not thinking much about distance based methods!
So I wanted to figure out how to handle this output simply in R.
The answer is easy, code below… the only tricky thing is that there were a couple of genomes in my genome set that returned NaN for pairwise distances with other genomes, which need to be removed before you can infer a tree.
Note there are lots more tree inference options in the ape library, nj is just one example.
# function to remove rows with unknowns
dropNA <- function(d) {
na <- apply(d,1,function(x){sum(x=="NaN")})
while (max(na)>0) {
na.drop <- c(1:length(na))[na==max(na)]
d <- d[-na.drop,-na.drop]
na <- apply(d,1,function(x){sum(x=="NaN")})
}
d
}
# read distance file output by ANDI
d<-read.table("andi.phy",header=F,skip=1,row.names=1)
colnames(d) <- rownames(d)
# remove rows with unknowns
d <- dropNA(d)
# infer tree from distance
t<-nj(as.dist(d))
# write tree
write.tree(t,file="tree.nwk")