R code to infer tree from ANDI output

Just playing with ANDI, for building whole genome trees rapidly from large sets of genome assemblies
Code: https://github.com/evolbioinf/andi/
Paper: http://bioinformatics.oxfordjournals.org/content/31/8/1169

It seems really great, and lightning fast (few seconds on ~100 genomes).

The output is a phylip distance file, which is not something I normally work with, having spent the last 5+ years inferring trees with ML from DNA alignments and not thinking much about distance based methods!

So I wanted to figure out how to handle this output simply in R.

The answer is easy, code below… the only tricky thing is that there were a couple of genomes in my genome set that returned NaN for pairwise distances with other genomes, which need to be removed before you can infer a tree.

Note there are lots more tree inference options in the ape library, nj is just one example.


# function to remove rows with unknowns 
dropNA <- function(d) {
na <- apply(d,1,function(x){sum(x=="NaN")})
while (max(na)>0) {
na.drop <- c(1:length(na))[na==max(na)]
d <- d[-na.drop,-na.drop]
na <- apply(d,1,function(x){sum(x=="NaN")})
}
d
}

# read distance file output by ANDI
d<-read.table("andi.phy",header=F,skip=1,row.names=1)
colnames(d) <- rownames(d)

# remove rows with unknowns
d <- dropNA(d)

# infer tree from distance
t<-nj(as.dist(d))

# write tree
write.tree(t,file="tree.nwk")

Comment