|EEB 5349: Phylogenetics|
by Kevin Keegan
- 1 Goals
- 2 Introduction
- 3 Getting Started
- 4 Running ggtree on your Computer
- 5 Getting Help
- 6 References
To introduce you to the R package ggtree for plotting phylogenetic trees.
This tutorial is written for the cluster user in mind, but feel free to perform it with your own local version of R (>=3.4). There are instructions at the end of this tutorial on how to get your local version of R set-up for this exercise.
Get Situated on the Cluster
Log onto the cluster like normal but with an added flag to allow for any graphics to be displayed on your computer.
ssh firstname.lastname@example.org -Y
Be sure to get off the head node to avoid litigation and subsequent incarceration:
Navigate to the folder you want to be working in for the R portion of the lab and download the tree file we'll be working with:
For more information on the curl command and what options you can use with it consult Wikipedia
Start R and Load Packages
See what versions of R are available:
Load R version 3.4.4
module load R/3.4.4
You'll need to load the following packages:
BiocInstaller Biostrings ape ggplot2 ggtree phytools ggrepel stringr stringi abind treeio
You can load packages like so:
If the package easypackages were installed and loaded, you could load packages like so:
Read in the Tree File
We're dealing with a tree in the Newick file format which the function read.newick from the package treeio can handle:
tree <- read.newick("moths.txt")
R can handle more than just Newick formatted tree files. To see what other file formats from the various phylogenetic software that R can handle checkout treeio. The functionality within treeio used to be part of the ggtree package itself, but the authors recently split ggtree in two with one part (ggtree) handling mostly plotting, and the other other part (treeio) handling mostly file input/output operations.
Let's quickly plot the tree to see what it looks like using the plot function from the ape package:
Notice the tree has all of its tips labeled. It's also a little cramped. You can expand the plot window to try to get the tree to display more legibly. We'll eventually use the function ggsave to control the dimensions of the plot when we finally export it to a PDF file. Don't worry about getting it to display well at the moment.
Now plot the tree using the ggtree package:
What happened to our tree!? The plot function from the ape package plotted the tree with tip labels, but ggtree plotted just the bare bones of the tree. ggtree by default plots almost nothing, assuming you will add what you want to your tree plot. The grammar/logic of ggtree is meant to model that of ggplot2 and not the R language in general. The syntax of ggtree/ggplot2 makes them easily extendable and particularly useful for graphics, but is by no means intuitive to someone used to R and plotting trees using ape.
Adding/Altering Tree Elements with Geoms and Geom-Like Functions
ggtree has a variety of functions available to you that allow you to add different elements to a tree. Many of them have the prefix "geoms" and are collectively referred to as geoms. We'll only go over some of them. You start with a bare bones tree and elements to the tree, function by function, until you get the tree looking like you want it to. You'll see as we progress through this tutorial that visualizing trees in ggtree is a truly additive process.
OK this tree would be more useful with tiplabels. Let's add them using geom_tiplab:
ggtree(tree) + geom_tiplab()
This tree is a little crowded. You can expand the graphics window vertically to get it all to fit, but it might be better to do a circular tree:
OK that's a bit easier to work with. Those tip labels are nice but a little big. geom_tiplab has a bunch of arguments that you can play around with, including one for the text size. You can read more about the available arguments for a given function in the ggtree manual. Plot the tree again but with smaller labels:
ggtree(tree, layout="circular") + geom_tiplab2(size=3.5)
Notice we are using geom_tiplab2 and not geom_tiplab to show labels on the circular tree. Don't ask my why there are two different tip label geoms for different tree layouts :)
The tree is still a little crowded, but at this point just play around with the size of the graphics window so you can work with it. We'll finalize how the tree looks later on using the ggsave function.
In order to label clades, we need to tell ggtree which nodes subtend each clade we want to label. Just like with the plot function in ape, you can plot a tree with node numbers, see which nodes subtend the clade of interest and then tell ggtree the nodes that define the clades you want to label. Another way to get your node of interest is to use the findMRCA function (find most recent common ancestor) from the phytools package. We will pass the function two tip labels as arguments that define each clade of interest. In their study, Keegan et al (in review) found the Amphipyrinae (as currently classified taxonomically) is polyphyletic -- astoundingly polyphyletic. Let's color two clades: one for what they found to be true Amphipyrinae, and one for a tribe (Stiriini) currently classified taxonomically in Amphipyrinae, that they show to be far removed phylogenetically and thus has no business being classified within Amphipyrinae.
amphipyrinae_clade <- findMRCA(tree, c("*Redingtonia_alba_KLKDNA0031","MM01162_Amphipyra_perflua")) stiriini_clade <- findMRCA(tree, c("*Chrysoecia_scira_KLKDNA0002","*Annaphila_diva_KLKDNA0180"))
You can't (as far as I know) tell ggtree directly, as in ape, that the lineages descending from a given node should all be a certain color. What we need to do is define a group that consists of the clades we want colored, and to tell ggtree that it should color the tree by according to the group.
tree <- groupClade(tree, node=c(amphipyrinae_clade, stiriini_clade), group_name = "group")
In the above line of code, we apply the groupClade function to the object tree. We are not overwriting tree and making it consist of only the Amphipyrinae and Stiriini clades. Now if you were to execute ggtree(tree, layout="circular") + geom_tiplab2(size=3.5) will still look the same. We need to amend the command to tell it to style the tree by the grouping of clades we just made called "group":
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + geom_tiplab2(size=3.5)
As you can see the tree gets colored according to some default color scheme. We can define our own color scheme. Let's call it "palette":
palette <- c("#000000", "#009E73","#e5bc06")
The values in palette are color values represented by a hexadecimal value. You can Google one of these hexadecimal values and a little interactive hexadecimal color picker will pop up. Feel free to pick two colors of your choosing to use in the palette -- but leave #000000 as it is. When you're designing a figure for publication, be sure to consider how easily your colors can be distinguished from each other by colorblind folks.
Now let's amend the ggtree command and tell it to use the colors we defined:
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + geom_tiplab2(size=3.5) + scale_colour_manual(values = palette)
The order in which clades are colored is determined by the order of clades in the groupClade command. Every lineage in the tree not within a defined clade (i.e. within stiriini_clade or amphipyrinae_clade) is automatically colored according to the first palette value. The first defined clade (stiriini_clade) is colored according to the second palette value, and the second defined clade (amphipyrinae_clade) is colored according to the third palette value.
Let's add some labels to the two clades. It's relatively straightforward now that we've already defined the subtending nodes:
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + geom_tiplab2(size=3.5) + scale_colour_manual(values = palette) + geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae") + geom_cladelabel(node=stiriini_clade, label="Stiriini")
OK we should move those labels so they're not directly over the tree:
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + geom_tiplab2(size=3.5) + scale_colour_manual(values = palette) + geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae", fontsize=6, offset=1.9, align=TRUE) + geom_cladelabel(node=stiriini_clade, label="Stiriini",fontsize=6, offset=2.1, align=TRUE)
You might have noticed that adding labels caused the rest of the tree to squish together. ggtree will try to fit everything into whatever size graphics window you have open. Try playing around with expanding and contracting the graphics window to see this functionality in action. Don't worry about getting everything to display perfectly in the graphics window, because we will use the function ggsave to create a PDF -- with definable dimensions -- to control how big the plot is, and thus how the tree looks with its many elements. You may wish to go back and change some of the tree elements after seeing your figure in PDF form.
Let's add some node labels. You can add labels that show the number of the node, but what you would probably like to do is show nodal support values (e.g. bootstraps) which are stored as node labels. We can display the node labels using geom_label.
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + geom_tiplab2(size=3.5) + scale_colour_manual(values = palette) + geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae", fontsize=6, offset=1.9, align=TRUE) + geom_cladelabel(node=stiriini_clade, label="Stiriini",fontsize=6, offset=2.1, align=TRUE) + geom_label(aes(label=label))
You should see A LOT of node labels appear. They get redrawn when you change the size of the graphics window which is quite mesmerizing to watch. Let's subset the node labels in order to just show the ones we want and reduce some of the clutter. We'll first create a dataframe from the data within tree:
q <- ggtree(tree) d <- q$data
First let's select only internal nodes (we don't need to show the leaf node labels, as we've already done that with geomtiplab2):
d <- d[!d$isTip,]
Now lets get rid of the root node:
d <- d[!d$node=="Root",]
And finally get rid of any node labels less than 75:
subset_labels <- d[as.double(d$label) > 75,]
Note that the object tree still has all of its labels. All we did was make a "copy" of tree called q, and then we created a subset of the data in q called d. Before, when we plotted the tree with node labels, we didn't specify which ones to label -- so ggtree labeled all of them. Now alter your geom_label, using the data argument available to geom_label display the dataset you just created consisting of a subset of node labels. Right now the only argument available to geom_label that we are using is the aes argument. Look in the ggtree manual for an argument that allows you to specify the data passed to geom_label.
Scale Bar and Title
Try adding a scale bar using the scale bar geom. I've added in some of the available arguments:
Add a title using ggtitle. Use it just like you would a geom
ggtitle("This is a Title")
Export Plot to PDF
ggsave cannot plot phylo objects (like tree) directly like ape can. You must first apply your ggtree function to your phylo object, and assign the result to a new variable. Let's call that variable tree_save:
tree_save <- ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + geom_tiplab2(size=3.5) + scale_colour_manual(values = palette) + geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae", fontsize=6, offset=1.9, align=TRUE) + geom_cladelabel(node=stiriini_clade, label="Stiriini",fontsize=6, offset=2.1, align=TRUE)
Now you can export tree_save to a PDF
ggsave(tree_save,file="moth_tree.pdf", width=30, height=30)
If the layout of your tree just isn't quite what you wanted, go back and play around with the geoms and geom-like functions until the PDF is to your liking.
Remember to cite ggtree if you use it in a published work!
Running ggtree on your Computer
You will need to install the following packages:
BiocInstaller Biostrings ape ggplot2 ggtree phytools ggrepel stringr stringi abind treeio
The package BiocInstaller is special. You can think of it as a meta-package, as it is used to handle the installation and interoperability of a suite of closely related open-source bioinformatics packages.
Install BiocInstaller like so:
You can, and probably should, install BioConductor packages using BiocInstaller, and not through the regular install.packages("package_name") method. To install packages via BioConductor:
Or install multiple packages like so:
Now load all of the above packages like so:
The Google Group for ggtree is fairly active. The lead author of ggtree chimes in regularly to answer people's questions -- just be sure you've read the documentation first!
Yu G, Smith D, Zhu H, Guan Y and Lam TT (2017). “ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data.” Methods in Ecology and Evolution, 8, pp. 28-36. doi: 10.1111/2041-210X.12628, http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12628/abstract.