Bio.Phylo: Phylogenetic trees and networks

The Bio.Phylo module is for data types and methods for handling phylogenetic trees and networks.

Phylogenies

# Bio.Phylo.PhylogenyType.

Phylogeny represents a phylogenetic tree.

The type is parametric with two parameters C and B.

This is because it is common to want to annotate tips, clades, and branches in a phylogeny with data to create a richer model of evolution of do other things like dictate aesthetic values when plotting.

Type parameter C dictates what datatype can be stored in the phylogeny to annotate clades and tips. Type parameter B dictates what datatype can be stored in the phylogeny to annotate branches. Think C for clades and B for branches.

source

Constructors

You can create a very simple unresolved phylogeny (a star phylogeny) by providing the tips as a vector of strings or a vector of symbols.

tips = [:A, :B, :C]
tree = Phylogeny(tips)
Bio.Phylo.Phylogeny{Float64,Bio.Phylo.BasicBranch}({5, 3} directed graph,Bio.Indexers.Indexer{Int64}(Dict(:C=>3,:B=>2,:A=>1,:Root=>4),Symbol[:A,:B,:C,:Root]),Float64[],Dict{Pair{Int64,Int64},Bio.Phylo.BasicBranch}(),3,false,true)
tips = ["A", "B", "C"]
tree = Phylogeny(tips)
Bio.Phylo.Phylogeny{Float64,Bio.Phylo.BasicBranch}({5, 3} directed graph,Bio.Indexers.Indexer{Int64}(Dict(:C=>3,:B=>2,:A=>1,:Root=>4),Symbol[:A,:B,:C,:Root]),Float64[],Dict{Pair{Int64,Int64},Bio.Phylo.BasicBranch}(),3,false,true)

Roots

You can test whether such a phylogeny is rooted, is re-rootable, and get the root vertex of a phylogeny. You can also test if a vertex of a phylogeny is a root.

# Bio.Phylo.isrootedFunction.

isrooted(x::Phylogeny)

Test whether a Phylogeny is rooted.

Examples

isrooted(my_phylogeny)

source

# Bio.Phylo.isrerootableFunction.

isrerootable(x::Phylogeny)

Test whether a Phylogeny is re-rootable.

Examples

isrerootable(my_phylogeny)

source

# Bio.Phylo.rootFunction.

Get the vertex of the tree which represents the root of the tree.

source

isrooted(tree)
false
isrerootable(tree)
true
root(tree)
4

Divergence time estimation

Phylo has a submodule called Dating which contains methods for divergence time estimation between sequences.

Dating methods

Currently Phylo.Dating has two types which are used as function arguments to dictate how to compute coalescence times. They all inherit from the abstract data type DatingMethod.

# Bio.Phylo.Dating.SimpleEstimateType.

A very simple expected divergence time estimate. Assumes a strict molecult clock and that the divergence time is equal to

$ t = d / (2\mu) $

Where $d$ is the evolutionary distance computed for two aligned sequences, and $\mu$ is the substitution rate.

source

# Bio.Phylo.Dating.SpeedDatingType.

SpeedDate is the name given to a method of estimating a divergence time between two DNA sequence regions that was first implemented in the R package HybridCheck in order to date regions of introgression in large sequence contigs.

The coalescence time is estimated using the number of mutations that have occurred between two aligned sequences. The calculation uses a strict molecular clock which assumes a constant substitution rate, both through time and across taxa. Modelling the mutation accumulation process as a Bernoulli trial, the probability of observing $k$ or fewer mutations between two sequences of length $n$ can be given as:

$ Pr(X \le k) = \sum_{i=0}^{\lfloor k \rfloor} \binom{n}{i} p^i (1 - p)^{n-i} $

Where $p$ is the probability of observing a single mutation between the two aligned sequences. The value of $p$ depends on two key factors: the substitution rate and the coalescence time. If you assume a molecular clock, whereby two DNA sequences are both accumulating mutations at a rate $\mu$ for $t$ generations, then you may define $p = 2\mu t$.

Using these assumptions, the SpeedDate method finds the root of the following formula for $Pr(X \le k) = 0.05$, $0.5$, and $0.95$, and then divides the three answers by twice the assumed substitution rate.

$ f(n, k, 2\mu t, Pr(X \le k) = \left( \sum_{i=0}^{\lfloor k \rfloor} \binom{n}{i} {2\mu t}^i (1 - 2\mu t)^{n-i} \right) - Pr(X \le k) $

This results in an upper, middle, and lower estimate of the coalescence time $t$ of the two sequences (expressed as the number of generations).

source