Ggtree

From EEBedia
Revision as of 20:30, 10 April 2018 by Kevin Keegan (Talk | contribs) (Clade Labels)

Jump to: navigation, search
Adiantum.png EEB 5349: Phylogenetics

by Kevin Keegan

Goals

To introduce you to the R package ggtree for plotting phylogenetic trees.

Introduction

Getting Started

You will need R version 3.4 or greater. Download the tree file moths.txt and save it in a convenient place on your hard drive.

Installing and Loading Packages

See what versions of R are available:

module avail

Load R version 3.3.1

module load R/3.3.1

Start R

R

You will need to install the following packages:

BiocInstaller
ape
Biostrings
ggplot2
ggtree
phytools
ggrepel
stringr
stringi
abind
treeio

You can install a single package like so:

install.packages("BiocInstaller")

Or install multiple packages like so:

install.packages(c("ape", "Biostrings"))

Many of the above packages are part of the Bioconductor project (like ggtree and treeio). You can find extensive documentation on their website for packages associated with their project.

Now load all of the above packages like so:

library("ape")

Read in the Tree File

We're dealing with a tree in the Newick file format which the function read.newick from the package treeio can handle:

tree <- read.newick("moths.txt")

R can handle more than just Newick formatted tree files. To see what other file formats from the various phylogenetic software that R can handle checkout treeio. Note: the functionality within treeio used to be part of the ggtree package itself, but the authors recently split ggtree in two with one part (ggtree) handling mostly plotting, and the other other part (treeio) handling mostly file input/output operations.

Let's quickly plot the tree to see what it looks like using the plot function from the ape package:

plot(tree)

Notice the tree has all of its tips labeled. It's also a little cramped. You can expand the plot window to try to get the tree to display more legibly. We'll eventually use the package ggsave to control the dimensions of the plot when we finally export it to a PDF file. But until then, you'll need to expand the plot window to get the tree to display reasonably well. Don't worry about getting it to fit at this moment.

Now plot the tree using the ggtree package:

ggtree(tree)

What happened to our tree!? The plot function from the ape package plotted the tree with tip labels, but ggtree plotted just the bare bones of the tree. ggtree by default plots almost nothing, assuming you will add what you want to your tree plot. You can add elements to the plot using geoms, just the same way that you would add elements to plots using the package ggplot2. The grammar/logic of ggtree is meant to model that of ggplot2 and not the R language in general. The syntax of ggtree/ggplot2 makes them easily estensible, but by no means intuitive to someone used to R and plotting trees using ape. To see the geoms available to ggtree check out its reference manual on BioConductor.

Adding/Altering Tree Elements with Geoms

Tip Labels

OK this tree would be more useful with tiplabels. Let's add them using geom_tiplab:

ggtree(tree)+geom_tiplab()

Those tip labels are nice but a little big. geom_tiplab has a bunch of arguments that you can play around with, including one for the text size. You can read more about the available arguments in the ggtree manual Plot the tree again but with smaller labels:

ggtree(tree)+geom_tiplab(size=3.5)

This tree is a little tall. It might be better to do a circular tree:

ggtree(tree, layout="circular")+geom_tiplab2(size=3.5)

OK that's a bit easier to work with. Notice we are using geom_tiplab2 and not geom_tiplab to show labels on the circular tree. Don't ask my why there are two different tip label geoms for different tree layouts :)

Clade Colors

In order to label clades, we need to tell ggtree which nodes subtend each clade we want to label. Just like with the plot function in ape, you can plot a tree with node numbers, see which nodes subtend the clade of interest and then tell ggtree the nodes that define the clades you want to label. Another way to get your node of interest is to use the findMRCA function (find most recent common ancestor) from the phytools package. We will pass the function two tip labels that define each clade of interest. In their study, Keegan et al (in review) found the Amphipyrinae (as currently classified) is polyphyletic -- astoundingly polyphyletic. Let's color two clades: one for what they found to be true Amphipyrinae, and one for a tribe (Stiriini) currently classified in Amphipyrinae, that they show to be far removed phylogenetically and thus has no business being in Amphipyrinae.

amphipyrinae_clade <- findMRCA(tree, c("*Redingtonia_alba_KLKDNA0031","MM01162_Amphipyra_perflua"))
stiriini_clade <- findMRCA(tree, c("*Chrysoecia_scira_KLKDNA0002","*Annaphila_diva_KLKDNA0180"))

You can't (as far as I know) tell ggtree directly, as in ape, that the lineages descending from a given node should all be a certain color. What we need to do is define a group that consists of the clades we want colored, and to tell ggtree that it should color the tree by according to the group.

tree <- groupClade(tree, node=c(amphipyrinae_clade, stiriini_clade), group_name = "group")

In the above line of code, we apply the groupClade function to the object tree. We are not overwriting tree and making it consist of only the Amphipyrinae and Stiriini clades. Now if you execute ggtree(tree, layout="circular")+geom_tiplab2(size=3.5) will still look the same. We need to amend the command to tell it to style the tree by the grouping of clades we just made called "group":

ggtree(tree, layout="circular",aes(color=group, linetype="solid"))+geom_tiplab2(size=3.5)

As you can see the tree gets colored according to some default color scheme. We can define our own color scheme. Let's call it "palette":

palette <- c("#000000", "#009E73","#e5bc06")

The values in palette are color values represented by a hexadecimal value. You can Google one of these hexadecimal values and a little interactive hexadecimal color picker will pop up. Feel free to pick two colors of your choosing to use in the palette -- but leave #000000 as it is. When you're designing a figure for publication, be sure to consider how easily your colors can be distinguished from each other by colorblind folks.

Now let's amend the ggtree command and tell it to use the colors we defined:

ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + geom_tiplab2(size=3.5) + scale_colour_manual(values = palette)

The order in which clades are colored is determined by the order of clades in the groupClade command. Everyt lineage in the tree not within a defined clade (i.e. within stiriini_clade or amphipyrinae_clade) is automatically colored according to the first palette value. The first defined clade (stiriini_clade) is colored according to the second palette value, and the second defined clade (amphipyrinae_clade) is colored according to the third palette value.

Clade Labels

Let's add some labels to the two clades. It's relatively straightforward now that we've already defined the subtending nodes:

ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + 
geom_tiplab2(size=3.5) + 
scale_colour_manual(values = palette) +
geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae") +
geom_cladelabel(node=stiriini_clade, label="Stiriini")
Node Labels
Scale Bar

Export Plot to PDF

Cite ggtree

citation("ggtree")

References

Yu G, Smith D, Zhu H, Guan Y and Lam TT (2017). “ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data.” Methods in Ecology and Evolution, 8, pp. 28-36. doi: 10.1111/2041-210X.12628, http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12628/abstract.