http://hydrodictyon.eeb.uconn.edu/eebedia/api.php?action=feedcontributions&user=Paul+Lewis&feedformat=atomEEBedia - User contributions [en]2019-06-20T15:14:06ZUser contributionsMediaWiki 1.25.2http://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Molecular_systematics_Spring_2018&diff=40143Molecular systematics Spring 20182019-03-07T14:14:08Z<p>Paul Lewis: </p>
<hr />
<div>2 Credits- half-semester module, 19 March-29 April 2016 <br />
<br />
'''Lectures:''' <br/><br />
Lectures: M & W 11:00-12:15 Bio-Pharm 3rd floor conference room. <br />
<br />
'''Labs:'''<br/><br />
M 2:30-4:30; Th 2:00-4:00 (Each lab session starts in 3rd floor conference room then moves to BioPharm 325). <br />
<br />
<br />
'''Instructor:'''<br/><br />
Chris Simon, Biopharm 305D, 6-4640, <chris.simon@uconn.edu><br />
Graduate Assistant: Katie Taylor, TLS 479, Katie.taylor@uconn.edu, 6-5479<br />
<br />
'''Readings:''' will be posted as PDF’s. <br/><br />
<br />
Optional reference books: 1) Paul Lewis's unpublished text; 2) The Phylogenetic Handbook (eds. Philippe Lemey, Marco Salemi, and Anne-Mieke Vandamme, 2010); 3) Inferring Phylogenies (Felsenstein 2004, Sinauer); 4) Molecular Evolution: A phylogenetic Approach (Page & Holmes 1998, Blackwell); 5) Molecular Systematics, 2nd ed. (Hillis, Moritz & Mable, eds. 1996, Sinauer) especially Chapter 11 by Swofford et al. on Phylogenetic Inference. <br />
<br />
'''Lecture Goals:''' The course will focus on the basics of molecular systematics theory and practice from the point of view of the data. We will explore the ways in which an understanding of processes of evolution of molecular data can help in the construction of evolutionary trees. Lectures will examine some of the most serious problems in evolutionary tree construction: nucleotide bias, alignment, homoplasy, among-site rate variation, taxon sampling, long branches, big trees, heterogeneous rates of evolution among branches, covarion shifts. <br />
<br />
'''Laboratory Goals:''' Labs will cover basic techniques in molecular systematics from DNA extraction to sequencing, alignment and cloning. This lab will be of interest to both experienced and novice molecular systematists because we will try newly developed kits/techniques and compare them to older ones and we will pursue a class project. <br />
<br />
'''Short Assignments:'''<br />
<br />
'''1)''' For each topic a bibliography will be provided including one focal paper for which the PDF will be posted. Each student will need to turn in a one-page summary of the importance of each focal paper (1 or occasionally 2 papers per week). <br />
<br />
'''2)''' The week prior to the start of classes you will be given a checklist discussing practical considerations, organization and data checks for molecular systematics. In certain sections you are asked to answer questions and explain how these procedures are modified in your lab.<br />
<br />
'''3)''' There will be a short "secondary structure alignment assignment" during the semester. <br />
<br />
'''4)''' Each student will keep a laboratory notebook and hand-in data collected during the course in the form of an alignment and a nexus data file. Various exercises will be performed in laboratory and some will be finished outside of class. These are detailed in the laboratory syllabus. <br />
<br />
'''5)''' For each Lab, one student will present a 10-15 minute Powerpoint presentation relating to techniques used in that day’s lab. Ursula will be available to advise you, but use web searches and try to do as much as possible on your own. These Powerpoint presentations will be posted on the class website so that in the future when you teach a molecular systematics class, they can be used as a starting point to revise and develop lectures of your own.<br />
<br />
'''Final Exam:''' The final exam will be a take home test in which each student critiques the first draft of a paper submitted to Systematic Biology (submitted in the past but making comments as if it were submitted today). Each student will also compare the submitted version to the published version. The answer key will be the actual review containing reviewers, associate editors, and editor’s comments (with permission of authors, reviewers and editors) and a list of critical points that need to be considered by the authors.<br />
<br />
'''Final Due Dates: Sunday 1st May: Lab project and notebook due. Take Home FINAL EXAM handed out. Sunday 6th May: Take home final due.'''<br />
<br />
'''Syllabus:''' <br />
<br />
=='''Schedule'''==<br />
{| border="1" cellpadding="2" <br />
!style="background:#99cccc;" width="90" align="center"|Day<br />
!style="background:#2A52BE;" width="365"|Topics<br />
!style="background:#008080;" width="315"|Reading/Assignment/Bibliography<br />
!style="background:#00B7EB;" width="275"|Lab<br />
|-<br />
|Monday <br/> Mar 19 ||Lecture 1. An introduction to looking at your data: How molecules evolve. Data checks {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/EEB_5350_Day_1_Lecture_&_Lab_Data_Checks_2018.pdf}}<br/><br />
<br />
|| Read Simon et al. 1994. 651-670 (up to the section that starts on the bottom of the second column). Too large to post, will be emailed to you. How Molecules Evolve & Model Choice Bibliography: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/How_molecules_evolve,_model_choice_Readings_2018.pdf}}||'''LAB:''' Data checks at every step. Mechanics of Lab; Qiagen kit extractions. Qiagen kit extraction protocol: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/DNeasy-Blood--Tissue-Quick-start-Protocol-EN.pdf}}<br />
|-<br />
|Wednesday Mar 21 ||Lecture 2. How molecules evolve, continued. {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/EEB_5350_Day_2_How_molecules_evolve_(cont.).pdf}}<br/> || Read Sullivan and Swofford 2001 for Monday March 26th. Among Site Rate Variation Readings: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/EEB_5350_ASRV_readings_2018.pdf}} ||'''Mini-presentation:''' DNA extraction (Katie) {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/DNA_Extraction_2018.pptx}} <br/> '''LAB:''' Qiagen extractions continued and plant extractions. CTAB plant extraction protocol: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/MicroPrep_CTAB_Plant_DNA.pdf}}<br />
|-<br />
|Monday <br/> Mar 26 || Lecture 3. ASRV, models of evolution, and the history of molecular systematics. Calculating the probability of substitution for sites, Fitch and Margoliash invariant sites models & negative binominal models,Weighting stems and loops. {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/EEB_5350_Day_3_Among_Site_Rate_Variation_2018.pptx.pdf}}|| ||'''Mini-presentation:''' Primer Design ( Katie ) {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Primer_Design_2018.pptx}} <br/> '''LAB:''' Explanation of class Tettigades project {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Tettigades_Intro_2018.pptx}} and Making gels, running extractions on gels, DNA extraction quantification, Troubleshooting and improving “universal” primers for COI. <br />
Qubit protocol: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/qubit_assays_quick_start_guide.pdf}}<br />
Nanodrop manual {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Nanodrop_1000_v3.7_manual.pdf}}<br />
Nanodrop protocol: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Nanodrop_DNA_Spec_Protocol.pdf}}<br />
<br />
|-<br />
|Wednesday Mar 28 || Lecture 4. Correlated changes- should consider stems vs loops; How much to down weight and how to partition when weighting is problematic; Different methods for calculating & accommodating ASRV; For probability of substitution, using a tree is more effective than an alignment; The interaction of tree shape and ASRV; The two components of evolutionary trees; (equal weights aka evenly weighted; misnomer “unweighted” parsimony); Effects of Ignoring ASRV {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Day_4._ASRV_(Cont.)_Combining_Data_27_Mar_2018.pdf}} <br />
||Read for Monday April 2nd, Bull et al. 1993. Classic paper from the Hillis Lab on partitioning and combing data, Bull et al. 1993. {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Bull_et_al._1993_Syst._Biol.pdf}} ||'''Mini-presentation:'''The polymerase chain reaction ( Zoe ) {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Polymerase_Chain_Reaction.pptx}}<br/> '''LAB:''' Setting-up PCR reactions. PCR protocol: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/PCR_protocol_2018.docx}}<br />
<br />
|-<br />
|Monday <br/> Apr 2 || Lecture 5. History of “combining data”, As many kinds of data as possible, non-specificity hypothesis, To combine or not to combine? That is the question. Lack of agreement among character subsets, Random error vs systematic error, Assumptions of combined analysis, Bull et al. vs. Chippindale & Wiens; ASRV &ALRV, Homothermia {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Day_5._EEB_5350_Combining_Data._2018.pdf}}|| Read and Summarize for Class by Monday April 9th Pagel, M. and A. Meade. 2004. Read and Summarize for Class on Wednesday, April 11th Kainer, D. and R. Lanfear. 2015. Combining Data, Partitioning, Species Trees readings {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Combining_Data,_Partitioning,_Comparing_Trees_Readings_2018.doc}}, Pagel and Meade 2004 {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Pagel_and_Meade._2004._Mixture_Model.pdf}}, Kanier and Landfear 2015. {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Kainer_and_Landfear._2015._Effects_of_Partitioning_on_Phylogen.pdf}} || '''Mini-presentation:''' Different methods for cleaning PCR products for sequencing reactions ( Tanner ) (Tanner's presentation is lost, but here is a presentation on the same topic from a prior year of the class, which might be a useful reference {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/PCR_and_Sequencing_cleanup_Ursula.pdf}} ) <br/> '''LAB:''' Running PCR products on gels, purifying PCR products with ExoSAP-IT, and setting-up sequencing reactions. PCR clean up protocol and Cycle sequencing protocol: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/PCR_Clean_Up_+_Cycle_Sequencing.docx}}<br />
<br />
|-<br />
|Wednesday Apr 4 ||Lecture 6. Tests for combining data; testing whether the same tree underlies each data partition. Partitioning; Choosing among models for pre-assigned partitions; Automated partition assignment and partition simplification; Model averaging and mixture models{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Day_6._Partitions_and_Mixtures_4Apr2018.pdf}}<br/> || ||'''Mini-presentation:''' Numts ( Johnny ) {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/NUMTS_Presentation.pptx}} <br/> '''LAB:''' Cleaning and putting samples on the ABI; Looking at sequences using Geneious. Sephadex cleaning protocol and loading ABI machine protocol: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Sequencing_and_Cleaning_protocol.docx}}<br />
|-<br />
|Monday <br/> Apr 09 ||Lecture 7. What is a long branch?; The meaning of “basal”; Node density artifacts; Felsenstein 1978- when will parsimony be positively misleading?; Penny & Hendy 1989- long branch attraction; Huelsenbeck & Hillis simulations to explore tree space. Accuracy of different phylogenetic methods; Swofford et al. 2001. Bias in Phylogeny estimation due to long branches: Parsimony vs. likelihood in tree space; Remaining uncommitted {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Day_7._EEB_5350._Long_Branches_2018.pdf}} ||Covarion, Heterotachy, Nucleotide Bias Readings {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Covarion,_Heterotachy,_bias_readings_2018rev.pdf}} <br> <br>Read and summarize for Class (Due Monday, April 16) Gruenheit, Nicole, Peter J. Lockhart, Mike Steel, and William Martin. 2008. {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Gruenheit_et_al._(Lockhart)_2008._Covarion_under_changing_proportions_var_sites.pdf}} ||'''Mini-presentation:''' How Big Dye works, chromatograms, and troubleshooting(Diler )<br/> '''LAB:''' Viewing and interpreting sequencing results, setting up long range PCR. Long range PCR protocol {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/LR_PCR_protocol_2018.docx}}<br />
|-<br />
|Wednesday Apr 11 ||Lecture 8. ALRV: heterotachy, covarion models;Among Lineage rate variation: Covarion evolution: codon models {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Day_8)_Covarion_evolution,_heterotachy_2018.pptx.pdf}} <br/> || ||'''Mini-presentation:''' : Depositing sequences in GenBank ( Tanner ) ppt: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Submitting_to_GenBank.pptx}} submission protocol: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/GenBank_sequence_submission_protocol_19May2016.doc}} example feature table: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/COII_feature_table.xlsx}} <br/> '''LAB:''' Running long rage PCR gel{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/LRGel.jpeg}} , cleaning long range PRC product, setting up 2nd short PCR, Make reagents for bead cleanup protocol. Protocol for making bead cleanup mix and testing: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/serapure_v2.pdf}}<br />
<br />
|-<br />
|Monday <br/> Apr 16 ||Lecture 9. Heterotachous evolution continued, Covarion Models, The Case for Stationary Genes, Mixture of Branch Lengths for building trees and studying selection. Covarion Mixture Models.{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/EEB_5350_Day_9)_Covarion_evolution_(cont),_Mixture_of_branch_lengths_2018.pptx.pdf}}<br />
|| ||'''Mini-presentation:''' Ancient DNA & Museum DNA protocols ( Zoe ) <br/> '''LAB:''' Running short rage PCR gel {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/SRGel.jpeg}} , cleaning PRC product, setting up sequencing reaction, test bead clean up method {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/BeadCleanupTesting.jpeg}}<br />
|-<br />
|Wednesday Apr 18 ||Lecture 10: Problems associated with nodal support {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Day_10)_EEB_5350_Big_trees,_Branch_support_2018.pdf}} ||Nodal Support Readings {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Nodal_support_readings_Sp_2018.doc}}<br />
Read and Summarize for Next week.... Monday 23 April 18.<br />
Salichos L, Stamatakis A, Rokas A. 2014. Novel information theory-based measures for quantifying incongruence among phylogenetic trees. Molecular Biology and Evolution 31:1261-1271.{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Salichos_et_al._2014._(Stamatakis,_Rokas)_Novel_Information_Theory_Based_Meas.pdf}} (No need to summarize the derivation, just the introduction and the applications).<br />
||'''Mini-presentation:''' RNA: extraction and what it can be used for ( Diler ) <br/> '''LAB:''' Cleaning and putting samples on the ABI, and starting RNA isolation with Trizol. Trizol RNA from tissue protocol: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Trizol_RNA_extraction_from_Tissue.docx}}<br />
|-<br />
|Monday <br/> Apr 23 ||Lecture 11) Nodal support continued. Spectral analysis, Internode certainty, SplitsTrees. Misc. topics: Big Trees; more taxa or more sequences. {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Day_11,_part_1)_Branch_Support_(cont.)_Support_&_Conflict_2018.pdf}} || ||'''LAB:''' Finish RNA isolation, Compare sequencing results from long range and typical PCR <br />
|-<br />
|Wednesday Apr 25 ||Lecture 12: Secondary structure & alignment. {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Day_12)_Part_1._Alignment_&_Secondary_Structure_2018.pdf}} Molecular Clocks {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Day_12)_Part_2._Molecular_clocks.pdf}}|| Secondary structure assignment {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/EEB_5350_secondary_structure_assignment_Sp18.pdf}} and templates for Magicicada {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Magicicada_12S_rRNA.pdf}} and conserved motif template {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/rRNA_12S_3rd_Domain_Template_2018_04_23_09_57_12_OCR.pdf}}. <br />
Hickson et al. 1996 Conserved sequence motifs, alignment, and secondary structure for the third domain of animal 12s rRNA. {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Hickson_et_al._1996._rRNA_structure_&_alignment._Mol_Biol_Evol.pdf}} <br />
Molecular clock readings: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Molecular_Clock_Readings_2018.pdf}}<br />
Structure and alignment readings: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Structure_Alignmnt_readings_2018.pdf}}<br />
||<br/> '''Mini-presentation:''' Gel electrophoresis ( Johnny ) {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Gel_Electrophoresis.pptx}}<br/> '''LAB:''' Katie presents on Next Generation Sequencing and Applications<br />
|-<br />
|Sunday <br/> Sunday April 29th || Lab notebook due. Take home final handed out.|| ||<br />
|-<br />
|Sunday <br/> May 6th||Final Exam due, emailed to Katie who will transmit the anonymous papers to Chris along with a list of pseudonyms|| || <br />
|}<br />
<br/><br />
'''Final Exam Files''' Reviewers instructions: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Final_Exam_EEB350_Reviewer's_instructns_S_2018.pdf}} <br />
Submitted manuscript: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Shull_et_al._28Oct_MS_submitted.pdf}}<br />
Submitted figures: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Shull_et_al._submitted_figures1-7,_Tabs_1-5.pdf}}<br />
Published manuscript: {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/molsyst-eeb5350/Published_Shull_et_al._2001_SYB.pdf}}<br/></div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Seminar_speaker_sign-up&diff=40109Seminar speaker sign-up2019-02-25T15:38:20Z<p>Paul Lewis: /* Friday, March 1, 2019 */</p>
<hr />
<div><br />
== '''Spencer Barrett''' ==<br />
<br />
'''Institution:''' University of Toronto <br><br />
'''Website: ''' http://labs.eeb.utoronto.ca/BarrettLab/Sbarrett.html <br><br />
'''Seminar Title: ''' Genomic insights into the evolution and ecology of plant sexual diversity" <br><br />
'''Time and Place:''' 3:30 PM, Thursday, February 28, 2019, BioPhysics 131 <br><br />
'''Contact:''' Pamela Diggle - pamela.diggle@uconn.edu <br><br />
<br />
==Wednesday, February 27, 2019 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|6:00 || Dinner: Carlos Garcia-Robledo, Erin Kuprewicz || TBD<br />
|}<br />
<br />
==Thursday, February 28, 2019 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|8:00-9:30 || Carl Schlichting || Breakfast at Tolland Inn<br />
|-<br />
|9:30-10:00 ||Morgan Tingley || BioPharm 205D <br />
|-<br />
|10:00-10:30 || Don Les || TLS? <br />
|-<br />
|10:30-11:00 || Cindi Jones || BioPharm 400<br />
|-<br />
|11:00-12:00 || Greg Anderson || TLS 383<br />
|-<br />
|12:00-1:30 ||Lunch with grad students || Bamford TLS171<br />
|-<br />
|1:30-2:00 || John Silander || TLS 184<br />
|-<br />
|2:00-2:30 || Dan Bolnick || BioPharm <br />
|-<br />
|2:30-2:45 || break || BioPharm 500A<br />
|-<br />
|2:45-3:15 ||EEB 3894 class || Bamford<br />
|-<br />
|3:15-3:30 ||set up for seminar || <br />
|-<br />
|3:30-4:30 || Seminar "Genomic insights into the evolution and ecology of plant sexual diversity" || BioPhysics 131<br />
|-<br />
|4:30-5:00 || Snacks || Bamford<br />
|-<br />
|5:00 - 5:30 || Break || <br />
|-<br />
|6:00 || Pam Diggle, Paul Lewis, Louise Lewis, Janine Caira || Dinner<br />
|}<br />
<br />
==Friday, March 1, 2019 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|8:00-9:30 || || Breakfast at Tolland Inn<br />
|-<br />
|9:30-10:00 || || <br />
|-<br />
|10:00-10:30 || Kent Holsinger || BioPharm 6th floor penthouse<br />
|-<br />
|10:30-11:00 || Foen Peng || BioPharm319<br />
|-<br />
|11:00-12:00 ||Yaowu Yuan || BioPharm 300A<br />
|-<br />
|12:00-1:30 || lunch Robi Bagchi|| Biopharm 205C<br />
|-<br />
|1:30-2:00 || leave for airport || <br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Seminar_speaker_sign-up&diff=40108Seminar speaker sign-up2019-02-25T15:38:04Z<p>Paul Lewis: /* Thursday, February 28, 2019 */</p>
<hr />
<div><br />
== '''Spencer Barrett''' ==<br />
<br />
'''Institution:''' University of Toronto <br><br />
'''Website: ''' http://labs.eeb.utoronto.ca/BarrettLab/Sbarrett.html <br><br />
'''Seminar Title: ''' Genomic insights into the evolution and ecology of plant sexual diversity" <br><br />
'''Time and Place:''' 3:30 PM, Thursday, February 28, 2019, BioPhysics 131 <br><br />
'''Contact:''' Pamela Diggle - pamela.diggle@uconn.edu <br><br />
<br />
==Wednesday, February 27, 2019 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|6:00 || Dinner: Carlos Garcia-Robledo, Erin Kuprewicz || TBD<br />
|}<br />
<br />
==Thursday, February 28, 2019 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|8:00-9:30 || Carl Schlichting || Breakfast at Tolland Inn<br />
|-<br />
|9:30-10:00 ||Morgan Tingley || BioPharm 205D <br />
|-<br />
|10:00-10:30 || Don Les || TLS? <br />
|-<br />
|10:30-11:00 || Cindi Jones || BioPharm 400<br />
|-<br />
|11:00-12:00 || Greg Anderson || TLS 383<br />
|-<br />
|12:00-1:30 ||Lunch with grad students || Bamford TLS171<br />
|-<br />
|1:30-2:00 || John Silander || TLS 184<br />
|-<br />
|2:00-2:30 || Dan Bolnick || BioPharm <br />
|-<br />
|2:30-2:45 || break || BioPharm 500A<br />
|-<br />
|2:45-3:15 ||EEB 3894 class || Bamford<br />
|-<br />
|3:15-3:30 ||set up for seminar || <br />
|-<br />
|3:30-4:30 || Seminar "Genomic insights into the evolution and ecology of plant sexual diversity" || BioPhysics 131<br />
|-<br />
|4:30-5:00 || Snacks || Bamford<br />
|-<br />
|5:00 - 5:30 || Break || <br />
|-<br />
|6:00 || Pam Diggle, Paul Lewis, Louise Lewis, Janine Caira || Dinner<br />
|}<br />
<br />
==Friday, March 1, 2019 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|8:00-9:30 || || Breakfast at Tolland Inn<br />
|-<br />
|9:30-10:00 || Paul Lewis || TLS 164 <br />
|-<br />
|10:00-10:30 || Kent Holsinger || BioPharm 6th floor penthouse<br />
|-<br />
|10:30-11:00 || Foen Peng || BioPharm319<br />
|-<br />
|11:00-12:00 ||Yaowu Yuan || BioPharm 300A<br />
|-<br />
|12:00-1:30 || lunch Robi Bagchi|| Biopharm 205C<br />
|-<br />
|1:30-2:00 || leave for airport || <br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Seminar_speaker_sign-up&diff=40104Seminar speaker sign-up2019-02-25T14:37:43Z<p>Paul Lewis: /* Friday, February 29, 2019 */</p>
<hr />
<div><br />
== '''Spencer Barrett''' ==<br />
<br />
'''Institution:''' University of Toronto <br><br />
'''Website: ''' http://labs.eeb.utoronto.ca/BarrettLab/Sbarrett.html <br><br />
'''Seminar Title: ''' Genomic insights into the evolution and ecology of plant sexual diversity" <br><br />
'''Time and Place:''' 3:30 PM, Thursday, February 28, 2019, BioPhysics 131 <br><br />
'''Contact:''' Pamela Diggle - pamela.diggle@uconn.edu <br><br />
<br />
==Wednesday, February 27, 2019 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|6:00 || Dinner: Carlos Garcia-Robledo, Erin Kuprewicz || TBD<br />
|}<br />
<br />
==Thursday, February 28, 2019 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|8:00-9:30 || Carl Schlichting || Breakfast at Tolland Inn<br />
|-<br />
|9:30-10:00 ||Morgan Tingley || BioPharm 205D <br />
|-<br />
|10:00-10:30 || Don Les || TLS? <br />
|-<br />
|10:30-11:00 || Cindi Jones || BioPharm 400<br />
|-<br />
|11:00-12:00 || Greg Anderson || TLS 383<br />
|-<br />
|12:00-1:30 ||Lunch with grad students || Bamford TLS171<br />
|-<br />
|1:30-2:00 || John Silander || TLS 184<br />
|-<br />
|2:00-2:30 || Dan Bolnick || BioPharm <br />
|-<br />
|2:30-3:00 || || <br />
|-<br />
|3:00-3:30 ||Break || BioPharm 500A<br />
|-<br />
|3:30-4:30 || Seminar "Genomic insights into the evolution and ecology of plant sexual diversity" || BioPhysics 131<br />
|-<br />
|4:30-5:00 || Snacks || Bamford<br />
|-<br />
|5:00 - 5:30 || Break || <br />
|-<br />
|6:00 || Pam Diggle, || Dinner<br />
|}<br />
<br />
==Friday, March 1, 2019 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|8:00-9:30 || || Breakfast at Tolland Inn<br />
|-<br />
|9:30-10:00 || Paul Lewis || TLS 164 <br />
|-<br />
|10:00-10:30 || Kent Holsinger || BioPharm 6th floor penthouse<br />
|-<br />
|10:30-11:00 || Foen Peng || BioPharm319<br />
|-<br />
|11:00-12:00 ||Yaowu Yuan || BioPharm 300A<br />
|-<br />
|12:00-1:30 || lunch Robi Bagchi|| Biopharm 205C<br />
|-<br />
|1:30-2:00 || leave for airport || <br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Seminar_speaker_sign-up&diff=40103Seminar speaker sign-up2019-02-25T14:35:12Z<p>Paul Lewis: /* Friday, February 29, 2019 */</p>
<hr />
<div><br />
== '''Spencer Barrett''' ==<br />
<br />
'''Institution:''' University of Toronto <br><br />
'''Website: ''' http://labs.eeb.utoronto.ca/BarrettLab/Sbarrett.html <br><br />
'''Seminar Title: ''' Genomic insights into the evolution and ecology of plant sexual diversity" <br><br />
'''Time and Place:''' 3:30 PM, Thursday, February 28, 2019, BioPhysics 131 <br><br />
'''Contact:''' Pamela Diggle - pamela.diggle@uconn.edu <br><br />
<br />
==Wednesday, February 27, 2019 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|6:00 || Dinner: Carlos Garcia-Robledo, Erin Kuprewicz || TBD<br />
|}<br />
<br />
==Thursday, February 28, 2019 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|8:00-9:30 || Carl Schlichting || Breakfast at Tolland Inn<br />
|-<br />
|9:30-10:00 ||Morgan Tingley || BioPharm 205D <br />
|-<br />
|10:00-10:30 || Don Les || TLS? <br />
|-<br />
|10:30-11:00 || Cindi Jones || BioPharm 400<br />
|-<br />
|11:00-12:00 || Greg Anderson || TLS 383<br />
|-<br />
|12:00-1:30 ||Lunch with grad students || Bamford TLS171<br />
|-<br />
|1:30-2:00 || John Silander || TLS 184<br />
|-<br />
|2:00-2:30 || Dan Bolnick || BioPharm <br />
|-<br />
|2:30-3:00 || || <br />
|-<br />
|3:00-3:30 ||Break || BioPharm 500A<br />
|-<br />
|3:30-4:30 || Seminar "Genomic insights into the evolution and ecology of plant sexual diversity" || BioPhysics 131<br />
|-<br />
|4:30-5:00 || Snacks || Bamford<br />
|-<br />
|5:00 - 5:30 || Break || <br />
|-<br />
|6:00 || Pam Diggle, || Dinner<br />
|}<br />
<br />
==Friday, February 29, 2019 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|8:00-9:30 || || Breakfast at Tolland Inn<br />
|-<br />
|9:30-10:00 || Paul Lewis || TLS 164 <br />
|-<br />
|10:00-10:30 || Kent Holsinger || BioPharm 6th floor penthouse<br />
|-<br />
|10:30-11:00 || Foen Peng || BioPharm319<br />
|-<br />
|11:00-12:00 ||Yaowu Yuan || BioPharm 300A<br />
|-<br />
|12:00-1:30 || lunch Robi Bagchi|| Biopharm 205C<br />
|-<br />
|1:30-2:00 || leave for airport || <br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Systematics_Seminar&diff=39608Systematics Seminar2018-09-24T14:31:18Z<p>Paul Lewis: /* Theme and Schedule for Spring 2018 */</p>
<hr />
<div>This is the home page of the UConn EEB department's Systematics Seminar (EEB 6486). This is a graduate seminar devoted to issues of interest to graduate students and faculty who make up the systematics program at the University of Connecticut. <br />
<br />
[[Systematics Listserv|Click here for information about joining and using the Systematics email list]]<br />
<br />
== Meeting time and place ==<br />
<br />
We meet at 11:05 in the Bamford Room (TLS 171B)<br />
<br />
== Theme and Schedule for Fall 2018 ==<br />
We will largely be discussing papers on character mapping, reticulation, and biogeography+dating. Any students that would like to sign up to present a practice talk or talk through ideas related to their research are encouraged to do so!<br />
<br />
=== Aug. 31 ===<br />
Planning meeting (no readings)<br />
<br />
=== Sep. 7 ===<br />
<br />
Conflicts between the results of morphological and molecular datasets in squamate reptiles. <br />
<br />
Paper and supplemental files:<br />
https://dropbox.uconn.edu/dropbox?n=PhylogIguniaRootProblem18.zip&p=Wzhn6V64Bz4T9W7qH<br />
<br />
Discussion led by Jack Phillips<br />
<br />
jackson.phillips@uconn.edu<br />
<br />
=== Sep. 14 ===<br />
<br />
Diler and Eric discuss [https://doi.org/10.1093/sysbio/syy019 The Biogeography of Deep Time Reticulation]<br />
<br />
=== Sep. 21 ===<br />
<br />
Diler discusses [https://doi.org/10.1371/journal.pgen.1005896 Inferring Phylogenetic Networks with Maximum Pseudolikelihood under Incomplete Lineage Sorting]<br />
<br />
Watch Cecile Ané's [https://www.youtube.com/watch?v=PF4j_JOQP0c PhyloSeminar] and check out her [http://www.stat.wisc.edu/~ane/PhyloNetworks/MBL2018-networkmodels.pdf slides] from the 2018 Molecular Evolution Workshop at Woods Hole for more information on phylogenetic networks.<br />
<br />
=== Sep. 28 ===<br />
<br />
Katie discusses [https://doi.org/10.1093/sysbio/syy023 HyDe: A Python Package for Genome-Scale Hybridization Detection]<br />
<br />
=== Oct. 5 ===<br />
<br />
Kevin discusses something...<br />
<br />
=== Oct. 12 ===<br />
<br />
=== Oct. 19 ===<br />
<br />
=== Oct. 26 ===<br />
<br />
Katie and Kevin give practice ESA talks<br />
<br />
=== Nov. 2 ===<br />
<br />
=== Nov. 9 ===<br />
<br />
=== Nov. 16 ===<br />
<br />
=== Nov. 23 ===<br />
<br />
'''THANKSGIVING BREAK! WOO!'''<br />
<br />
=== Nov. 30 ===<br />
<br />
=== Dec. 7 ===<br />
<br />
== Information for discussion leaders ==<br />
'''Seminar Format:''' Registered students be prepared to lead discussions, perhaps more than once depending on the number of participants. <br />
<br />
The leader(s) will be responsible both for (1) selection of readings, (2) announcing the selection, (3) an introductory presentation, (4) driving discussion and (5) setting up and putting away the projector. <br />
<br />
'''Readings:''' In consultation with the instructors, each leader should assign one primary paper for discussion and up to two other ancillary papers or resources. The readings should be posted to EEBedia at least 5 days in advance.<br />
<br />
'''Announcing the reading:''' The leader should add an entry to the schedule (see below) by editing this page. There are two ways to create a link to the paper:<br />
<br />
1. If the paper is available online through our library, it is sufficient to create a link to the DOI:<br />
<nowiki>:[http://dx.doi.org/10.1093/sysbio/syv041 Doyle et al. 2015. Syst. Biol. 64:824-837.]</nowiki><br />
In this case, you need not give all the citation details because the DOI should always be sufficient to find the paper. The colon (:) at the beginning of the link causes the link to be indented an placed on a separate line. Note that the DOI is in the form of a URL, starting with <code><nowiki>http://dx.doi.org/</nowiki></code>. Here is how the above link looks embedded in this EEBedia page:<br />
:[http://dx.doi.org/10.1093/sysbio/syv041 Doyle et al. 2015. Syst. Biol. 64:824-837.]<br />
<br />
2. If the paper is not available through the library, upload a PDF of the paper to [http://dropbox.uconn.edu the UConn dropbox], being sure to use the secure version so that it can be password protected. Copy the URL provided by dropbox, and create a link to it as follows (see the [[Dropbox Test]] page for other examples):<br />
<nowiki>:[https://dropbox.uconn.edu/dropbox?n=SystBiol-2015-Doyle-824-37.pdf&p=ELPFIc5NtO3c4V44Ls Doyle et al. 2015.]</nowiki><br />
In this case, you should provide a full citation to the paper for the benefit of those that visit the site long after the dropbox link has expired; however, the full details need not be part of the link text. Here is what this kind of link looks like embedded in this EEBedia page:<br />
<br />
:[https://dropbox.uconn.edu/dropbox?n=SystBiol-2015-Doyle-824-37.pdf&p=ELPFIc5NtO3c4V44Ls Doyle et al. 2015.] Full citation: Vinson P. Doyle, Randee E. Young, Gavin J. P. Naylor, and Jeremy M. Brown. 2015. Can We Identify Genes with Increased Phylogenetic Reliability? Systematic Biology 64 (5): 824-837. doi:10.1093/sysbio/syv041<br />
<br />
If you have ancillary papers, upload those to the dropbox individually and create separate links. <br />
<br />
Finally, send a note to the [[Systematics Listserv]] letting everyone know that a paper is available. <br />
<br />
'''Introductory PowerPoint/KeyNote Presentation:''' Introduce your topic with a 10- to 15-minute PowerPoint or KeyNote presentation. Dedicate at least 2/3 of that time to placing the subject into the broader context of the subject areas/themes and at most 1/3 of it introducing paper, special definitions, taxa, methods, etc. Never exceed 15 minutes. (For example, for a reading on figs and fig-wasps, broaden the scope to plant-herbivore co-evolution.). Add images, include short movie clips, visit web resources, etc. to keep the presentation engaging. Although your presentation should not be a review of the primary reading, showing key figures from the readings may be helpful (and appreciated). You may also want to provide more detail and background about ancillary readings which likely have not been read by all. <br />
<br />
'''Discussion:''' You are responsible for driving the discussion. Assume everyone in attendance has read the main paper. There are excellent suggestions for generating class discussions on Chris Elphick’s Current Topics in Conservation Biology course site. See section under expectations. <br />
<br />
Prepare 3-5 questions that you expect will spur discussion. Ideally, you would distribute questions a day or two before our class meeting.<br />
<br />
'''Projector:''' <br />
The Bamford room has joined the modern world--you should just need to plug in your computer or USB key to project.<br />
<br />
== Past Seminars ==<br />
* [[Systematics Seminar Spring 2018|Spring 2018]]<br />
* [[Systematics Seminar Fall 2017|Fall 2017]]<br />
* [[Systematics Seminar Fall 2014|Fall 2014]]<br />
* [[Systematics Seminar Fall 2013|Fall 2013]]<br />
* [[Systematics Seminar Spring 2012|Spring 2012]]<br />
* [[Systematics Seminar Fall 2011|Fall 2011]]<br />
* [http://darwin.eeb.uconn.edu/wiki/index.php/Statistical_phylogeography Spring 2011] (we joined Kent Holsinger's seminar on Statistical Phylogeography this semester)<br />
* [[Systematics Seminar Fall 2010|Fall 2010]]<br />
* [[Systematics Seminar Spring 2010|Spring 2010]]<br />
* [[Systematics Seminar Fall 2009|Fall 2009]]<br />
* [[Systematics Seminar Fall 2008|Fall 2008]]<br />
* [[Systematics Seminar Spring 2008|Spring 2008]]<br />
* [[Systematics Seminar Fall 2007|Fall 2007]]<br />
* [[Systematics Seminar Spring 2007|Spring 2007]]<br />
* [http://hydrodictyon.eeb.uconn.edu/courses/systematicsseminar/SystSemFall2006.html Fall 2006]<br />
* [http://hydrodictyon.eeb.uconn.edu/courses/systematicsseminar/SystSemSpring2005.html Spring 2005]<br />
* [http://hydrodictyon.eeb.uconn.edu/courses/systematicsseminar/SystSemFall2004.html Fall 2004]<br />
* [http://hydrodictyon.eeb.uconn.edu/courses/phylomath/ Spring 2004]<br />
<br />
[[Category:EEB Seminars]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Seminar_speaker_sign-up&diff=39542Seminar speaker sign-up2018-09-07T20:15:44Z<p>Paul Lewis: </p>
<hr />
<div><br />
== '''Charles Mann''' ==<br />
<br />
'''Institution:''' Science/Wired/Atlantic Monthly <br><br />
'''Website: ''' http://www.charlesmann.org <br><br />
'''Seminar Title: ''' The Edge of the petri dish <br><br />
'''Time and Place:''' 4:00 PM, Thursday, September 13th, 2018, in the Konover Auditorium, Dodd Center <br><br />
'''Contact:''' Greg Anderson <br><br />
<br />
==Thursday, September 13th, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|8:30 - 9:30 || Greg and Mona Anderson || Breakfast at the Fitch House<br />
|-<br />
|9:30-10:00 || Gene Likens || Torrey 386<br />
|-<br />
|10:00-10:30 || || <br />
|-<br />
|10:30-11:00 || || <br />
|-<br />
|11:00-12:00 || || <br />
|-<br />
|12:00-12:20 || break || <br />
|- <br />
|12:20-1:30 || Lunch with students || TLS 171b<br />
|-<br />
|1:30-2:00 || Meeting with NSS group || TBA<br />
|-<br />
|2:00-2:45 || open || <br />
|-<br />
|2:45-3:15 || Kent Holsinger || Graduate School<br />
|-<br />
|3:15-4:00 || Talk preparations || Dodd Center <br />
|-<br />
|4:00 - 5:00 || Talk || Dodd Center, Konover Auditorium<br />
|-<br />
|5:30 || Dinner with John Volin, Dan Weiner, Pam Diggle, possibly Kevin McBride || TBD<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Seminar_speaker_sign-up&diff=39516Seminar speaker sign-up2018-09-04T19:16:17Z<p>Paul Lewis: /* Wednesday, September 6th, 2018 */</p>
<hr />
<div><br />
== '''Jason Fridley''' ==<br />
<br />
<br />
<br />
'''Institution:''' Syracuse University <br><br />
'''Website: '''https://sites.google.com/site/fridleylab/home <br><br />
'''Seminar Title: '''The modern invasive species problem: a world Darwin envisioned? <br><br />
'''Time and Place:''' 3:30 PM, Thursday, September 6th, 2018, in Biophysics 131 <br><br />
'''Contact:''' Robert Bagchi <br><br />
<br />
== Wednesday, September 5th, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| Late || John Silander || Dinner (Silander residence)<br />
|}<br />
<br />
==Thursday, September 6th, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|8:00 - 9:30 || Val and Dipsi || Breakfast at Tolland Inn<br />
|-<br />
|9:30-10:00 || Robert Bagchi || PharmBio 205C<br />
|-<br />
|10:00-10:30 ||Tom Harrington || PharmBio 404<br />
|-<br />
|10:30-11:00 ||Don Les || TLS 375<br />
|-<br />
|11:00-1200 || Mike Willig|| CESE (Willig will pick up from outside TLS) <br />
|-<br />
|12:00-1:00 || Lunch with grad students|| Bamford Room (TLS 171b)<br />
|- <br />
|1:00-1:30 || James Mickley || BioPharm 219<br />
|-<br />
|1:30-2:00 || Mark Urban || BioPharm 200A<br />
|-<br />
|2:00-2:45 || Tanisha Williams || TLS 180<br />
|-<br />
|2:45-3:15 || Meet with EEB 3894 || Bamford Room (TLS 171b) <br />
|-<br />
|3:30-4:30 || SEMINAR: The modern invasive species problem: a world Darwin envisioned? || BPB 131 <br />
|-<br />
|4:30 - 5:00 || Post-seminar snacks || Bamford Room (TLS 171b)<br />
|-<br />
|5:45 || Dinner (take-out)|| Bagchi-Davis residence (48 Fellen Rd)<br />
|}<br />
<br />
==Friday, September 7th, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|8:00 - 9:30 || || Breakfast at Tolland Inn<br />
|-<br />
|9:30 - 10:00 || Kristen Nolting || TLS 180<br />
|-<br />
|10:00 - 10:30 || John Silander || TLS 184<br />
|-<br />
|10:30 || Depart ||<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Seminar_speaker_sign-up&diff=39259Seminar speaker sign-up2018-05-01T11:17:43Z<p>Paul Lewis: /* Friday, May 4th, 2018 */</p>
<hr />
<div><br />
== '''Michael Landis''' ==<br />
<br />
<br />
<br />
'''Institution:''' Yale University <br><br />
'''Website: '''https://donoghuelab.yale.edu/people/michael-landis<br><br />
'''Seminar Title: '''Dating the silversword radiation using Hawaiian paleogeography <br><br />
'''Time and Place:''' 11:00 AM, Friday, May 4th, 2018, in Bamford Room <br><br />
'''Contact:''' Chris Simon <br><br />
<br />
==Friday, May 4th, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|9-9:30 || Chris Simon || Meet at 17 Silver Falls to be guided through construction on campus<br />
|-<br />
|9:30-10:00 || || <br />
|-<br />
|10:00-10:30 || || <br />
|-<br />
|10:30-11:00 || Kevin Keegan || <br />
|-<br />
|11:00-1200 || Seminar || Bamford Room<br />
|<br />
|-<br />
|12:00-1:30 || Lunch Chuck & Augie's || Chris Simon, Paul Lewis, Kevin Keagan, Suman Neupane, Dave Wagner, (+ 2 more spaces)<br />
| <br />
|-<br />
||1:30-2:00 || Paul Lewis || TLS 164<br />
|-<br />
||2:00-2:30 || || <br />
|-<br />
||2:30-3:00 || || <br />
|-<br />
||3:00-4:00 || Chris Simon || Biopharm 305d <br />
<br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_BEAST2_Lab&diff=39232Phylogenetics: BEAST2 Lab2018-04-25T22:45:58Z<p>Paul Lewis: /* (q)login to the cluster */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to use BEAST2 to estimate divergence times under a relaxed clock model and to assess whether the strict clock, random local clocks or the uncorrelated lognormal model fits the data best using estimates of the log marginal likelihood.<br />
|}<br />
<br />
== Getting started ==<br />
=== (q)login to the cluster ===<br />
Login to the Bioinformatics Facility Dell cluster (the -Y allows graphics to display on your local machine if you have an XWindow client installed):<br />
ssh -Y username@bbcsrv3.biotech.uconn.edu<br />
Then use the qlogin command to find a free node:<br />
qlogin<br />
Once you are transferred to a free node, type<br />
module load java/1.8.0<br />
module load beast/2.5.0<br />
This makes a more recent version of BEAST2 available to you and loads Java 8 (actually version 1.8), which is required by BEAST 2.5.0.<br />
<br />
=== Create a directory ===<br />
Use the unix mkdir command to create a directory to play in today:<br />
mkdir beastlab<br />
cd beastlab<br />
<br />
=== Download and save the data file ===<br />
<br />
The tutorial uses a data file named ''primate-mtDNA.nex'' that is located on the cluster. Here's how to copy it to your beastlab directory:<br />
cp $BEAST/examples/nexus/primate-mtDNA.nex .<br />
The $BEAST part represents a bash variable (bash is the scripting language you are using to communicate with the operating system). The module load beast/2.5.0 command above created the BEAST variable, which points to the directory on the cluster's hard drive where beast2 was installed. The cp command stands for copy, so the command above will copy the file primate-mtDNA.nex to the current directory, which should be your newly-created beastlab directory (the period at the end of the cp command is shorthand for the current directory). <br />
<br />
=== Download BEAST2 ===<br />
BEAST2 comprises two main programs: BEAST and BEAUTi. BEAUTi is a graphical application that helps you create the xml file that BEAST runs. BEAUTi is a graphical application. If you have logged into the cluster using "ssh -Y" and if you have an XWindow client that can handle the display, you can start BEAUTI by simply typing "beauti" at the prompt. If that does not work, the easiest option is to simply download the latest version of BEAST2 to your own laptop and run BEAUTi from there. Here's the web site:<br />
<br />
http://www.beast2.org<br />
<br />
(Note that if you are running BEAUTi locally, you will need to download the primate-mtDNA.nex file you copied into your beastlab folder to your laptop in order to import it into BEAUTi.)<br />
<br />
=== Divergence Dating Tutorial ===<br />
<br />
The tutorial at the web site address shown below walks you through the process of dating a primate mtDNA tree but does not show you how to assess model fit using marginal likelihood estimation. After you finish the tutorial, I will show you how to estimate the marginal likelihood of the model using BEAST2. <br />
<br />
==== Important: read before starting the tutorial ====<br />
In section 2.3 (Setting the clock model), we will try using three different clock models. Each person will be assigned one of the following:<br />
<br />
# '''strict clock''' (set priors to what tutorial advises)<br />
# '''uncorrelated lognormal relaxed clock''' (set priors to what tutorial advises, except use Gamma with shape=0.001, scale=1000 prior for ucldMean.cclock)<br />
# '''random local clocks''' (set priors to what tutorial advises, and use Poisson with Lambda=0.5 for RRateChanges.cclock prior)<br />
<br />
It is important that you '''pay attention''' and '''do not simply plow through section 2.3 at full tilt''' because then everyone will end up doing only the strict clock analysis!<br />
<br />
==== Here is the tutorial ====<br />
http://beast2-dev.github.io/beast-docs/beast2/DivergenceDating/DivergenceDatingTutorial.html<br />
<br />
Note that you will run BEAUTi on your own laptop to generate the xml file, then move the xml file (e.g. myfile.xml) to the cluster and run beast as follows:<br />
<br />
beast myfile.xml<br />
<br />
=== Marginal Likelihood Estimation ===<br />
<br />
==== Steps for setting up your XML file for steppingstone ====<br />
BEAST2 can estimate the marginal likelihood of a model but unfortunately this currently (as of April 2018, BEAST 2.5.0) involves modifying the xml file directly (there is no menu option in BEAUTi that will do this for you). Open your xml file using your favorite text editor (e.g. BBEdit/TextWrangler on Mac, Notepad++ on Windows) and make the following modifications:<br />
<br />
# Search for "<run" and change it to "<mcmc"<br />
# Search for "</run>" and change it to "</mcmc>"<br />
# Before "<mcmc", add the following lines<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3' deleteOldLogs='true' rootdir='ss'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
# After "</mcmc>" add the following line:<br />
</run><br />
<br />
The sections below show what the file looks like before and after these modifications. <br />
<br />
==== XML file before modification ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</run><br />
</beast><br />
<br />
==== XML file after modification ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3' deleteOldLogs='true' rootdir='ss'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
<mcmc id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</mcmc><br />
</run><br />
</beast><br />
<br />
==== Explanation ====<br />
The steppingstone method (called path sampling in beast documentation) requires samples from a series of MCMC analysis, each exploring a slightly different distribution (and, strangely, none of these equals the posterior distribution!). Each distribution represents a power posterior: a distribution like the posterior except that the likelihood is raised to a power between 0.0 and 1.0. Each power posterior encloses the next one, like matryoshka dolls, allowing accurate estimate of the ratio of the areas of that pair of distributions. Multiplying successive ratios results in cancellation of the numerator of one with the denominator of the next, so that the end result is an estimate of the ratio of the area of the posterior kernel to the area of the prior. Because the area of the prior is 1.0, this overall ratio equals the marginal likelihood. <br />
<br />
The '''nrOfSteps''' run attribute specifies the number of these MCMC analyses to perform, and the '''alpha''' run attribute determines the spacing of the distributions. The XML file you created with BEAUTi specifies everything needed for each component MCMC analysis, so our modifications to the XML file mainly involve wrapping the original run in a shell that causes it to be executed nrOfSteps times. The PathSampler module specified in the spec run attribute handles putting the results together to obtain an estimate of the marginal likelihood.<br />
<br />
The only run attributes I haven't mentioned are deleteOldLogs and rootdir. The '''deleteOldLogs''' attribute comes into play if you try to repeat an analysis. The PathSampling module will not start an analysis if old log files are still lying around unless you've told it that it is okay to delete old logs. The '''rootdir''' attribute specifies the directory used to store the results of each separate MCMC analysis. These have to be stored somewhere until all are finished. Here I've had you specify rootdir=ss, which causes an ss folder to appear inside your beastlab folder. Inside that ss folder you should see nrOfSteps folders appear, numbered step0, step1, ..., step<nrOfSteps-1>.<br />
<br />
==== Running your XML file ====<br />
Upload this new xml file to your beastlab directory (you may wish to move any output files from the previous run first so they will not be overwritten) on the cluster and run it using beast as before. When beast is finished, it will spit out the log marginal likelihood that it estimated.<br />
<br />
=== Further Information (no need to read this for today's lab) ===<br />
Read these section only if you are trying to run this lab entirely on your own computer (not the cluster) or if you are looking for more information about marginal likelihood estimation in BEAST2.<br />
<br />
==== Marginal likelihood estimation on your own laptop ====<br />
You used the cluster to perform the BEAST run that estimated the marginal likelihood. Were you to try this on your own computer, you would discover that you need to install the BEASTLabs and MODEL_SELECTION packages before performing the analysis. This is done using BEAUTi's ''File > Manage Packages...'' menu item. The packages are normally installed in the ''.beast'' directory inside your home directory on whichever computer you use to run BEAUTi. Rather than have everyone do this separately (and because some of you cannot use graphical applications such as BEAUTi if they are run on the cluster), I moved these packages to a place where BEAST can find them on the cluster (/usr/local/share/beast/2.5).<br />
<br />
==== Run options ====<br />
For reference, here are all the options that you can include in the <run ...> xml tag. We only used the first two in this lab and let the defaults apply to all other options. Note that the term ''particles'' is synonymous with ''steppingstones'' or ''ratios''.<br />
* '''alpha''': alpha parameter of Beta(alpha,1) distribution used to space out steps, default 0.3. If alpha <= 0, uniform intervals are used.<br />
* '''nrOfSteps''': the number of steps to use, default 8<br />
* '''rootdir''': root directory for storing particle states and log files (default /tmp)<br />
* '''mcmc''': MCMC analysis used to specify model and operations in each of the particles<br />
* '''chainLength''': number of sample to run a chain for a single step (default 100000L)<br />
* '''burnInPercentage''': burn-In Percentage used for analysing log files (default 50)<br />
* '''preBurnin''': number of samples that are discarded for the first step, but not the others (default 100000)<br />
* '''value''': script for launching a job: <br />
** $(dir) is replaced by the directory associated with the particle]<br />
** $(java.class.path) is replaced by a java class path used to launch this application<br />
** $(java.library.path) is replaced by a java library path used to launch this application<br />
** $(seed) is replaced by a random number seed that differs with every launch<br />
** $(host) is replaced by a host from the list of hosts<br />
* '''hosts''': comma separated list of hosts. If there are k hosts in the list, for particle i the term $(host) in the script will be replaced by the (i modulo k) host in the list. Note that whitespace is removed<br />
* '''doNotRun''': Set up all files but do not run analysis if true. This can be useful for setting up an analysis on a cluster (default false) <br />
* '''deleteOldLogs''': delete existing log files from root dir (default false)<br />
* '''posterior2prior''': whether to do steps from posterior to prior or the other way around. Going from posterior to prior is biased towards over estimates, while from prior to posterior the ML estimate is biased towards under estimates (default true)<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_BEAST2_Lab&diff=39231Phylogenetics: BEAST2 Lab2018-04-25T22:45:06Z<p>Paul Lewis: /* Download BEAST2 */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to use BEAST2 to estimate divergence times under a relaxed clock model and to assess whether the strict clock, random local clocks or the uncorrelated lognormal model fits the data best using estimates of the log marginal likelihood.<br />
|}<br />
<br />
== Getting started ==<br />
=== (q)login to the cluster ===<br />
Login to the Bioinformatics Facility Dell cluster:<br />
ssh username@bbcsrv3.biotech.uconn.edu<br />
Then use the qlogin command to find a free node:<br />
qlogin<br />
Once you are transferred to a free node, type<br />
module load java/1.8.0<br />
module load beast/2.5.0<br />
This makes a more recent version of BEAST2 available to you and loads Java 8 (actually version 1.8), which is required by BEAST 2.5.0.<br />
<br />
=== Create a directory ===<br />
Use the unix mkdir command to create a directory to play in today:<br />
mkdir beastlab<br />
cd beastlab<br />
<br />
=== Download and save the data file ===<br />
<br />
The tutorial uses a data file named ''primate-mtDNA.nex'' that is located on the cluster. Here's how to copy it to your beastlab directory:<br />
cp $BEAST/examples/nexus/primate-mtDNA.nex .<br />
The $BEAST part represents a bash variable (bash is the scripting language you are using to communicate with the operating system). The module load beast/2.5.0 command above created the BEAST variable, which points to the directory on the cluster's hard drive where beast2 was installed. The cp command stands for copy, so the command above will copy the file primate-mtDNA.nex to the current directory, which should be your newly-created beastlab directory (the period at the end of the cp command is shorthand for the current directory). <br />
<br />
=== Download BEAST2 ===<br />
BEAST2 comprises two main programs: BEAST and BEAUTi. BEAUTi is a graphical application that helps you create the xml file that BEAST runs. BEAUTi is a graphical application. If you have logged into the cluster using "ssh -Y" and if you have an XWindow client that can handle the display, you can start BEAUTI by simply typing "beauti" at the prompt. If that does not work, the easiest option is to simply download the latest version of BEAST2 to your own laptop and run BEAUTi from there. Here's the web site:<br />
<br />
http://www.beast2.org<br />
<br />
(Note that if you are running BEAUTi locally, you will need to download the primate-mtDNA.nex file you copied into your beastlab folder to your laptop in order to import it into BEAUTi.)<br />
<br />
=== Divergence Dating Tutorial ===<br />
<br />
The tutorial at the web site address shown below walks you through the process of dating a primate mtDNA tree but does not show you how to assess model fit using marginal likelihood estimation. After you finish the tutorial, I will show you how to estimate the marginal likelihood of the model using BEAST2. <br />
<br />
==== Important: read before starting the tutorial ====<br />
In section 2.3 (Setting the clock model), we will try using three different clock models. Each person will be assigned one of the following:<br />
<br />
# '''strict clock''' (set priors to what tutorial advises)<br />
# '''uncorrelated lognormal relaxed clock''' (set priors to what tutorial advises, except use Gamma with shape=0.001, scale=1000 prior for ucldMean.cclock)<br />
# '''random local clocks''' (set priors to what tutorial advises, and use Poisson with Lambda=0.5 for RRateChanges.cclock prior)<br />
<br />
It is important that you '''pay attention''' and '''do not simply plow through section 2.3 at full tilt''' because then everyone will end up doing only the strict clock analysis!<br />
<br />
==== Here is the tutorial ====<br />
http://beast2-dev.github.io/beast-docs/beast2/DivergenceDating/DivergenceDatingTutorial.html<br />
<br />
Note that you will run BEAUTi on your own laptop to generate the xml file, then move the xml file (e.g. myfile.xml) to the cluster and run beast as follows:<br />
<br />
beast myfile.xml<br />
<br />
=== Marginal Likelihood Estimation ===<br />
<br />
==== Steps for setting up your XML file for steppingstone ====<br />
BEAST2 can estimate the marginal likelihood of a model but unfortunately this currently (as of April 2018, BEAST 2.5.0) involves modifying the xml file directly (there is no menu option in BEAUTi that will do this for you). Open your xml file using your favorite text editor (e.g. BBEdit/TextWrangler on Mac, Notepad++ on Windows) and make the following modifications:<br />
<br />
# Search for "<run" and change it to "<mcmc"<br />
# Search for "</run>" and change it to "</mcmc>"<br />
# Before "<mcmc", add the following lines<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3' deleteOldLogs='true' rootdir='ss'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
# After "</mcmc>" add the following line:<br />
</run><br />
<br />
The sections below show what the file looks like before and after these modifications. <br />
<br />
==== XML file before modification ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</run><br />
</beast><br />
<br />
==== XML file after modification ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3' deleteOldLogs='true' rootdir='ss'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
<mcmc id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</mcmc><br />
</run><br />
</beast><br />
<br />
==== Explanation ====<br />
The steppingstone method (called path sampling in beast documentation) requires samples from a series of MCMC analysis, each exploring a slightly different distribution (and, strangely, none of these equals the posterior distribution!). Each distribution represents a power posterior: a distribution like the posterior except that the likelihood is raised to a power between 0.0 and 1.0. Each power posterior encloses the next one, like matryoshka dolls, allowing accurate estimate of the ratio of the areas of that pair of distributions. Multiplying successive ratios results in cancellation of the numerator of one with the denominator of the next, so that the end result is an estimate of the ratio of the area of the posterior kernel to the area of the prior. Because the area of the prior is 1.0, this overall ratio equals the marginal likelihood. <br />
<br />
The '''nrOfSteps''' run attribute specifies the number of these MCMC analyses to perform, and the '''alpha''' run attribute determines the spacing of the distributions. The XML file you created with BEAUTi specifies everything needed for each component MCMC analysis, so our modifications to the XML file mainly involve wrapping the original run in a shell that causes it to be executed nrOfSteps times. The PathSampler module specified in the spec run attribute handles putting the results together to obtain an estimate of the marginal likelihood.<br />
<br />
The only run attributes I haven't mentioned are deleteOldLogs and rootdir. The '''deleteOldLogs''' attribute comes into play if you try to repeat an analysis. The PathSampling module will not start an analysis if old log files are still lying around unless you've told it that it is okay to delete old logs. The '''rootdir''' attribute specifies the directory used to store the results of each separate MCMC analysis. These have to be stored somewhere until all are finished. Here I've had you specify rootdir=ss, which causes an ss folder to appear inside your beastlab folder. Inside that ss folder you should see nrOfSteps folders appear, numbered step0, step1, ..., step<nrOfSteps-1>.<br />
<br />
==== Running your XML file ====<br />
Upload this new xml file to your beastlab directory (you may wish to move any output files from the previous run first so they will not be overwritten) on the cluster and run it using beast as before. When beast is finished, it will spit out the log marginal likelihood that it estimated.<br />
<br />
=== Further Information (no need to read this for today's lab) ===<br />
Read these section only if you are trying to run this lab entirely on your own computer (not the cluster) or if you are looking for more information about marginal likelihood estimation in BEAST2.<br />
<br />
==== Marginal likelihood estimation on your own laptop ====<br />
You used the cluster to perform the BEAST run that estimated the marginal likelihood. Were you to try this on your own computer, you would discover that you need to install the BEASTLabs and MODEL_SELECTION packages before performing the analysis. This is done using BEAUTi's ''File > Manage Packages...'' menu item. The packages are normally installed in the ''.beast'' directory inside your home directory on whichever computer you use to run BEAUTi. Rather than have everyone do this separately (and because some of you cannot use graphical applications such as BEAUTi if they are run on the cluster), I moved these packages to a place where BEAST can find them on the cluster (/usr/local/share/beast/2.5).<br />
<br />
==== Run options ====<br />
For reference, here are all the options that you can include in the <run ...> xml tag. We only used the first two in this lab and let the defaults apply to all other options. Note that the term ''particles'' is synonymous with ''steppingstones'' or ''ratios''.<br />
* '''alpha''': alpha parameter of Beta(alpha,1) distribution used to space out steps, default 0.3. If alpha <= 0, uniform intervals are used.<br />
* '''nrOfSteps''': the number of steps to use, default 8<br />
* '''rootdir''': root directory for storing particle states and log files (default /tmp)<br />
* '''mcmc''': MCMC analysis used to specify model and operations in each of the particles<br />
* '''chainLength''': number of sample to run a chain for a single step (default 100000L)<br />
* '''burnInPercentage''': burn-In Percentage used for analysing log files (default 50)<br />
* '''preBurnin''': number of samples that are discarded for the first step, but not the others (default 100000)<br />
* '''value''': script for launching a job: <br />
** $(dir) is replaced by the directory associated with the particle]<br />
** $(java.class.path) is replaced by a java class path used to launch this application<br />
** $(java.library.path) is replaced by a java library path used to launch this application<br />
** $(seed) is replaced by a random number seed that differs with every launch<br />
** $(host) is replaced by a host from the list of hosts<br />
* '''hosts''': comma separated list of hosts. If there are k hosts in the list, for particle i the term $(host) in the script will be replaced by the (i modulo k) host in the list. Note that whitespace is removed<br />
* '''doNotRun''': Set up all files but do not run analysis if true. This can be useful for setting up an analysis on a cluster (default false) <br />
* '''deleteOldLogs''': delete existing log files from root dir (default false)<br />
* '''posterior2prior''': whether to do steps from posterior to prior or the other way around. Going from posterior to prior is biased towards over estimates, while from prior to posterior the ML estimate is biased towards under estimates (default true)<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_BEAST2_Lab&diff=39230Phylogenetics: BEAST2 Lab2018-04-25T22:40:42Z<p>Paul Lewis: </p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to use BEAST2 to estimate divergence times under a relaxed clock model and to assess whether the strict clock, random local clocks or the uncorrelated lognormal model fits the data best using estimates of the log marginal likelihood.<br />
|}<br />
<br />
== Getting started ==<br />
=== (q)login to the cluster ===<br />
Login to the Bioinformatics Facility Dell cluster:<br />
ssh username@bbcsrv3.biotech.uconn.edu<br />
Then use the qlogin command to find a free node:<br />
qlogin<br />
Once you are transferred to a free node, type<br />
module load java/1.8.0<br />
module load beast/2.5.0<br />
This makes a more recent version of BEAST2 available to you and loads Java 8 (actually version 1.8), which is required by BEAST 2.5.0.<br />
<br />
=== Create a directory ===<br />
Use the unix mkdir command to create a directory to play in today:<br />
mkdir beastlab<br />
cd beastlab<br />
<br />
=== Download and save the data file ===<br />
<br />
The tutorial uses a data file named ''primate-mtDNA.nex'' that is located on the cluster. Here's how to copy it to your beastlab directory:<br />
cp $BEAST/examples/nexus/primate-mtDNA.nex .<br />
The $BEAST part represents a bash variable (bash is the scripting language you are using to communicate with the operating system). The module load beast/2.5.0 command above created the BEAST variable, which points to the directory on the cluster's hard drive where beast2 was installed. The cp command stands for copy, so the command above will copy the file primate-mtDNA.nex to the current directory, which should be your newly-created beastlab directory (the period at the end of the cp command is shorthand for the current directory). <br />
<br />
=== Download BEAST2 ===<br />
BEAST2 comprises two main programs: BEAST and BEAUTi. BEAUTi is a graphical application that helps you create the xml file that BEAST runs. Because BEAUTi is graphical, it is difficult to run on the cluster, so please download the latest version of BEAST2 to your own laptop and run BEAUTi from there. Here's the web site:<br />
<br />
http://www.beast2.org<br />
<br />
(Note that you will need to download the primate-mtDNA.nex file you copied into your beastlab folder to your laptop in order to import it into BEAUTi.)<br />
<br />
=== Divergence Dating Tutorial ===<br />
<br />
The tutorial at the web site address shown below walks you through the process of dating a primate mtDNA tree but does not show you how to assess model fit using marginal likelihood estimation. After you finish the tutorial, I will show you how to estimate the marginal likelihood of the model using BEAST2. <br />
<br />
==== Important: read before starting the tutorial ====<br />
In section 2.3 (Setting the clock model), we will try using three different clock models. Each person will be assigned one of the following:<br />
<br />
# '''strict clock''' (set priors to what tutorial advises)<br />
# '''uncorrelated lognormal relaxed clock''' (set priors to what tutorial advises, except use Gamma with shape=0.001, scale=1000 prior for ucldMean.cclock)<br />
# '''random local clocks''' (set priors to what tutorial advises, and use Poisson with Lambda=0.5 for RRateChanges.cclock prior)<br />
<br />
It is important that you '''pay attention''' and '''do not simply plow through section 2.3 at full tilt''' because then everyone will end up doing only the strict clock analysis!<br />
<br />
==== Here is the tutorial ====<br />
http://beast2-dev.github.io/beast-docs/beast2/DivergenceDating/DivergenceDatingTutorial.html<br />
<br />
Note that you will run BEAUTi on your own laptop to generate the xml file, then move the xml file (e.g. myfile.xml) to the cluster and run beast as follows:<br />
<br />
beast myfile.xml<br />
<br />
=== Marginal Likelihood Estimation ===<br />
<br />
==== Steps for setting up your XML file for steppingstone ====<br />
BEAST2 can estimate the marginal likelihood of a model but unfortunately this currently (as of April 2018, BEAST 2.5.0) involves modifying the xml file directly (there is no menu option in BEAUTi that will do this for you). Open your xml file using your favorite text editor (e.g. BBEdit/TextWrangler on Mac, Notepad++ on Windows) and make the following modifications:<br />
<br />
# Search for "<run" and change it to "<mcmc"<br />
# Search for "</run>" and change it to "</mcmc>"<br />
# Before "<mcmc", add the following lines<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3' deleteOldLogs='true' rootdir='ss'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
# After "</mcmc>" add the following line:<br />
</run><br />
<br />
The sections below show what the file looks like before and after these modifications. <br />
<br />
==== XML file before modification ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</run><br />
</beast><br />
<br />
==== XML file after modification ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3' deleteOldLogs='true' rootdir='ss'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
<mcmc id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</mcmc><br />
</run><br />
</beast><br />
<br />
==== Explanation ====<br />
The steppingstone method (called path sampling in beast documentation) requires samples from a series of MCMC analysis, each exploring a slightly different distribution (and, strangely, none of these equals the posterior distribution!). Each distribution represents a power posterior: a distribution like the posterior except that the likelihood is raised to a power between 0.0 and 1.0. Each power posterior encloses the next one, like matryoshka dolls, allowing accurate estimate of the ratio of the areas of that pair of distributions. Multiplying successive ratios results in cancellation of the numerator of one with the denominator of the next, so that the end result is an estimate of the ratio of the area of the posterior kernel to the area of the prior. Because the area of the prior is 1.0, this overall ratio equals the marginal likelihood. <br />
<br />
The '''nrOfSteps''' run attribute specifies the number of these MCMC analyses to perform, and the '''alpha''' run attribute determines the spacing of the distributions. The XML file you created with BEAUTi specifies everything needed for each component MCMC analysis, so our modifications to the XML file mainly involve wrapping the original run in a shell that causes it to be executed nrOfSteps times. The PathSampler module specified in the spec run attribute handles putting the results together to obtain an estimate of the marginal likelihood.<br />
<br />
The only run attributes I haven't mentioned are deleteOldLogs and rootdir. The '''deleteOldLogs''' attribute comes into play if you try to repeat an analysis. The PathSampling module will not start an analysis if old log files are still lying around unless you've told it that it is okay to delete old logs. The '''rootdir''' attribute specifies the directory used to store the results of each separate MCMC analysis. These have to be stored somewhere until all are finished. Here I've had you specify rootdir=ss, which causes an ss folder to appear inside your beastlab folder. Inside that ss folder you should see nrOfSteps folders appear, numbered step0, step1, ..., step<nrOfSteps-1>.<br />
<br />
==== Running your XML file ====<br />
Upload this new xml file to your beastlab directory (you may wish to move any output files from the previous run first so they will not be overwritten) on the cluster and run it using beast as before. When beast is finished, it will spit out the log marginal likelihood that it estimated.<br />
<br />
=== Further Information (no need to read this for today's lab) ===<br />
Read these section only if you are trying to run this lab entirely on your own computer (not the cluster) or if you are looking for more information about marginal likelihood estimation in BEAST2.<br />
<br />
==== Marginal likelihood estimation on your own laptop ====<br />
You used the cluster to perform the BEAST run that estimated the marginal likelihood. Were you to try this on your own computer, you would discover that you need to install the BEASTLabs and MODEL_SELECTION packages before performing the analysis. This is done using BEAUTi's ''File > Manage Packages...'' menu item. The packages are normally installed in the ''.beast'' directory inside your home directory on whichever computer you use to run BEAUTi. Rather than have everyone do this separately (and because some of you cannot use graphical applications such as BEAUTi if they are run on the cluster), I moved these packages to a place where BEAST can find them on the cluster (/usr/local/share/beast/2.5).<br />
<br />
==== Run options ====<br />
For reference, here are all the options that you can include in the <run ...> xml tag. We only used the first two in this lab and let the defaults apply to all other options. Note that the term ''particles'' is synonymous with ''steppingstones'' or ''ratios''.<br />
* '''alpha''': alpha parameter of Beta(alpha,1) distribution used to space out steps, default 0.3. If alpha <= 0, uniform intervals are used.<br />
* '''nrOfSteps''': the number of steps to use, default 8<br />
* '''rootdir''': root directory for storing particle states and log files (default /tmp)<br />
* '''mcmc''': MCMC analysis used to specify model and operations in each of the particles<br />
* '''chainLength''': number of sample to run a chain for a single step (default 100000L)<br />
* '''burnInPercentage''': burn-In Percentage used for analysing log files (default 50)<br />
* '''preBurnin''': number of samples that are discarded for the first step, but not the others (default 100000)<br />
* '''value''': script for launching a job: <br />
** $(dir) is replaced by the directory associated with the particle]<br />
** $(java.class.path) is replaced by a java class path used to launch this application<br />
** $(java.library.path) is replaced by a java library path used to launch this application<br />
** $(seed) is replaced by a random number seed that differs with every launch<br />
** $(host) is replaced by a host from the list of hosts<br />
* '''hosts''': comma separated list of hosts. If there are k hosts in the list, for particle i the term $(host) in the script will be replaced by the (i modulo k) host in the list. Note that whitespace is removed<br />
* '''doNotRun''': Set up all files but do not run analysis if true. This can be useful for setting up an analysis on a cluster (default false) <br />
* '''deleteOldLogs''': delete existing log files from root dir (default false)<br />
* '''posterior2prior''': whether to do steps from posterior to prior or the other way around. Going from posterior to prior is biased towards over estimates, while from prior to posterior the ML estimate is biased towards under estimates (default true)<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_BEAST2_Lab&diff=39229Phylogenetics: BEAST2 Lab2018-04-25T22:38:49Z<p>Paul Lewis: /* Marginal Likelihood Estimation */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to use BEAST2 to estimate divergence times under a relaxed clock model and to assess whether the Random Local Clocks or the Uncorrelated Lognormal model fits the data best using estimates of the log marginal likelihood.<br />
|}<br />
<br />
== Getting started ==<br />
=== (q)login to the cluster ===<br />
Login to the Bioinformatics Facility Dell cluster:<br />
ssh username@bbcsrv3.biotech.uconn.edu<br />
Then use the qlogin command to find a free node:<br />
qlogin<br />
Once you are transferred to a free node, type<br />
module load java/1.8.0<br />
module load beast/2.5.0<br />
This makes a more recent version of BEAST2 available to you and loads Java 8 (actually version 1.8), which is required by BEAST 2.5.0.<br />
<br />
=== Create a directory ===<br />
Use the unix mkdir command to create a directory to play in today:<br />
mkdir beastlab<br />
cd beastlab<br />
<br />
=== Download and save the data file ===<br />
<br />
The tutorial uses a data file named ''primate-mtDNA.nex'' that is located on the cluster. Here's how to copy it to your beastlab directory:<br />
cp $BEAST/examples/nexus/primate-mtDNA.nex .<br />
The $BEAST part represents a bash variable (bash is the scripting language you are using to communicate with the operating system). The module load beast/2.5.0 command above created the BEAST variable, which points to the directory on the cluster's hard drive where beast2 was installed. The cp command stands for copy, so the command above will copy the file primate-mtDNA.nex to the current directory, which should be your newly-created beastlab directory (the period at the end of the cp command is shorthand for the current directory). <br />
<br />
=== Download BEAST2 ===<br />
BEAST2 comprises two main programs: BEAST and BEAUTi. BEAUTi is a graphical application that helps you create the xml file that BEAST runs. Because BEAUTi is graphical, it is difficult to run on the cluster, so please download the latest version of BEAST2 to your own laptop and run BEAUTi from there. Here's the web site:<br />
<br />
http://www.beast2.org<br />
<br />
(Note that you will need to download the primate-mtDNA.nex file you copied into your beastlab folder to your laptop in order to import it into BEAUTi.)<br />
<br />
=== Divergence Dating Tutorial ===<br />
<br />
The tutorial at the web site address shown below walks you through the process of dating a primate mtDNA tree but does not show you how to assess model fit using marginal likelihood estimation. After you finish the tutorial, I will show you how to estimate the marginal likelihood of the model using BEAST2. <br />
<br />
==== Important: read before starting the tutorial ====<br />
In section 2.3 (Setting the clock model), we will try using three different clock models. Each person will be assigned one of the following:<br />
<br />
# '''strict clock''' (set priors to what tutorial advises)<br />
# '''uncorrelated lognormal relaxed clock''' (set priors to what tutorial advises, except use Gamma with shape=0.001, scale=1000 prior for ucldMean.cclock)<br />
# '''random local clocks''' (set priors to what tutorial advises, and use Poisson with Lambda=0.5 for RRateChanges.cclock prior)<br />
<br />
It is important that you '''pay attention''' and '''do not simply plow through section 2.3 at full tilt''' because then everyone will end up doing only the strict clock analysis!<br />
<br />
==== Here is the tutorial ====<br />
http://beast2-dev.github.io/beast-docs/beast2/DivergenceDating/DivergenceDatingTutorial.html<br />
<br />
Note that you will run BEAUTi on your own laptop to generate the xml file, then move the xml file (e.g. myfile.xml) to the cluster and run beast as follows:<br />
<br />
beast myfile.xml<br />
<br />
=== Marginal Likelihood Estimation ===<br />
<br />
==== Steps for setting up your XML file for steppingstone ====<br />
BEAST2 can estimate the marginal likelihood of a model but unfortunately this currently (as of April 2018, BEAST 2.5.0) involves modifying the xml file directly (there is no menu option in BEAUTi that will do this for you). Open your xml file using your favorite text editor (e.g. BBEdit/TextWrangler on Mac, Notepad++ on Windows) and make the following modifications:<br />
<br />
# Search for "<run" and change it to "<mcmc"<br />
# Search for "</run>" and change it to "</mcmc>"<br />
# Before "<mcmc", add the following lines<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3' deleteOldLogs='true' rootdir='ss'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
# After "</mcmc>" add the following line:<br />
</run><br />
<br />
The sections below show what the file looks like before and after these modifications. <br />
<br />
==== XML file before modification ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</run><br />
</beast><br />
<br />
==== XML file after modification ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3' deleteOldLogs='true' rootdir='ss'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
<mcmc id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</mcmc><br />
</run><br />
</beast><br />
<br />
==== Explanation ====<br />
The steppingstone method (called path sampling in beast documentation) requires samples from a series of MCMC analysis, each exploring a slightly different distribution (and, strangely, none of these equals the posterior distribution!). Each distribution represents a power posterior: a distribution like the posterior except that the likelihood is raised to a power between 0.0 and 1.0. Each power posterior encloses the next one, like matryoshka dolls, allowing accurate estimate of the ratio of the areas of that pair of distributions. Multiplying successive ratios results in cancellation of the numerator of one with the denominator of the next, so that the end result is an estimate of the ratio of the area of the posterior kernel to the area of the prior. Because the area of the prior is 1.0, this overall ratio equals the marginal likelihood. <br />
<br />
The '''nrOfSteps''' run attribute specifies the number of these MCMC analyses to perform, and the '''alpha''' run attribute determines the spacing of the distributions. The XML file you created with BEAUTi specifies everything needed for each component MCMC analysis, so our modifications to the XML file mainly involve wrapping the original run in a shell that causes it to be executed nrOfSteps times. The PathSampler module specified in the spec run attribute handles putting the results together to obtain an estimate of the marginal likelihood.<br />
<br />
The only run attributes I haven't mentioned are deleteOldLogs and rootdir. The '''deleteOldLogs''' attribute comes into play if you try to repeat an analysis. The PathSampling module will not start an analysis if old log files are still lying around unless you've told it that it is okay to delete old logs. The '''rootdir''' attribute specifies the directory used to store the results of each separate MCMC analysis. These have to be stored somewhere until all are finished. Here I've had you specify rootdir=ss, which causes an ss folder to appear inside your beastlab folder. Inside that ss folder you should see nrOfSteps folders appear, numbered step0, step1, ..., step<nrOfSteps-1>.<br />
<br />
==== Running your XML file ====<br />
Upload this new xml file to your beastlab directory (you may wish to move any output files from the previous run first so they will not be overwritten) on the cluster and run it using beast as before. When beast is finished, it will spit out the log marginal likelihood that it estimated.<br />
<br />
=== Further Information (no need to read this for today's lab) ===<br />
Read these section only if you are trying to run this lab entirely on your own computer (not the cluster) or if you are looking for more information about marginal likelihood estimation in BEAST2.<br />
<br />
==== Marginal likelihood estimation on your own laptop ====<br />
You used the cluster to perform the BEAST run that estimated the marginal likelihood. Were you to try this on your own computer, you would discover that you need to install the BEASTLabs and MODEL_SELECTION packages before performing the analysis. This is done using BEAUTi's ''File > Manage Packages...'' menu item. The packages are normally installed in the ''.beast'' directory inside your home directory on whichever computer you use to run BEAUTi. Rather than have everyone do this separately (and because some of you cannot use graphical applications such as BEAUTi if they are run on the cluster), I moved these packages to a place where BEAST can find them on the cluster (/usr/local/share/beast/2.5).<br />
<br />
==== Run options ====<br />
For reference, here are all the options that you can include in the <run ...> xml tag. We only used the first two in this lab and let the defaults apply to all other options. Note that the term ''particles'' is synonymous with ''steppingstones'' or ''ratios''.<br />
* '''alpha''': alpha parameter of Beta(alpha,1) distribution used to space out steps, default 0.3. If alpha <= 0, uniform intervals are used.<br />
* '''nrOfSteps''': the number of steps to use, default 8<br />
* '''rootdir''': root directory for storing particle states and log files (default /tmp)<br />
* '''mcmc''': MCMC analysis used to specify model and operations in each of the particles<br />
* '''chainLength''': number of sample to run a chain for a single step (default 100000L)<br />
* '''burnInPercentage''': burn-In Percentage used for analysing log files (default 50)<br />
* '''preBurnin''': number of samples that are discarded for the first step, but not the others (default 100000)<br />
* '''value''': script for launching a job: <br />
** $(dir) is replaced by the directory associated with the particle]<br />
** $(java.class.path) is replaced by a java class path used to launch this application<br />
** $(java.library.path) is replaced by a java library path used to launch this application<br />
** $(seed) is replaced by a random number seed that differs with every launch<br />
** $(host) is replaced by a host from the list of hosts<br />
* '''hosts''': comma separated list of hosts. If there are k hosts in the list, for particle i the term $(host) in the script will be replaced by the (i modulo k) host in the list. Note that whitespace is removed<br />
* '''doNotRun''': Set up all files but do not run analysis if true. This can be useful for setting up an analysis on a cluster (default false) <br />
* '''deleteOldLogs''': delete existing log files from root dir (default false)<br />
* '''posterior2prior''': whether to do steps from posterior to prior or the other way around. Going from posterior to prior is biased towards over estimates, while from prior to posterior the ML estimate is biased towards under estimates (default true)<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_BEAST2_Lab&diff=39228Phylogenetics: BEAST2 Lab2018-04-25T22:15:17Z<p>Paul Lewis: /* Run options */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to use BEAST2 to estimate divergence times under a relaxed clock model and to assess whether the Random Local Clocks or the Uncorrelated Lognormal model fits the data best using estimates of the log marginal likelihood.<br />
|}<br />
<br />
== Getting started ==<br />
=== (q)login to the cluster ===<br />
Login to the Bioinformatics Facility Dell cluster:<br />
ssh username@bbcsrv3.biotech.uconn.edu<br />
Then use the qlogin command to find a free node:<br />
qlogin<br />
Once you are transferred to a free node, type<br />
module load java/1.8.0<br />
module load beast/2.5.0<br />
This makes a more recent version of BEAST2 available to you and loads Java 8 (actually version 1.8), which is required by BEAST 2.5.0.<br />
<br />
=== Create a directory ===<br />
Use the unix mkdir command to create a directory to play in today:<br />
mkdir beastlab<br />
cd beastlab<br />
<br />
=== Download and save the data file ===<br />
<br />
The tutorial uses a data file named ''primate-mtDNA.nex'' that is located on the cluster. Here's how to copy it to your beastlab directory:<br />
cp $BEAST/examples/nexus/primate-mtDNA.nex .<br />
The $BEAST part represents a bash variable (bash is the scripting language you are using to communicate with the operating system). The module load beast/2.5.0 command above created the BEAST variable, which points to the directory on the cluster's hard drive where beast2 was installed. The cp command stands for copy, so the command above will copy the file primate-mtDNA.nex to the current directory, which should be your newly-created beastlab directory (the period at the end of the cp command is shorthand for the current directory). <br />
<br />
=== Download BEAST2 ===<br />
BEAST2 comprises two main programs: BEAST and BEAUTi. BEAUTi is a graphical application that helps you create the xml file that BEAST runs. Because BEAUTi is graphical, it is difficult to run on the cluster, so please download the latest version of BEAST2 to your own laptop and run BEAUTi from there. Here's the web site:<br />
<br />
http://www.beast2.org<br />
<br />
(Note that you will need to download the primate-mtDNA.nex file you copied into your beastlab folder to your laptop in order to import it into BEAUTi.)<br />
<br />
=== Divergence Dating Tutorial ===<br />
<br />
The tutorial at the web site address shown below walks you through the process of dating a primate mtDNA tree but does not show you how to assess model fit using marginal likelihood estimation. After you finish the tutorial, I will show you how to estimate the marginal likelihood of the model using BEAST2. <br />
<br />
==== Important: read before starting the tutorial ====<br />
In section 2.3 (Setting the clock model), we will try using three different clock models. Each person will be assigned one of the following:<br />
<br />
# '''strict clock''' (set priors to what tutorial advises)<br />
# '''uncorrelated lognormal relaxed clock''' (set priors to what tutorial advises, except use Gamma with shape=0.001, scale=1000 prior for ucldMean.cclock)<br />
# '''random local clocks''' (set priors to what tutorial advises, and use Poisson with Lambda=0.5 for RRateChanges.cclock prior)<br />
<br />
It is important that you '''pay attention''' and '''do not simply plow through section 2.3 at full tilt''' because then everyone will end up doing only the strict clock analysis!<br />
<br />
==== Here is the tutorial ====<br />
http://beast2-dev.github.io/beast-docs/beast2/DivergenceDating/DivergenceDatingTutorial.html<br />
<br />
Note that you will run BEAUTi on your own laptop to generate the xml file, then move the xml file (e.g. myfile.xml) to the cluster and run beast as follows:<br />
<br />
beast myfile.xml<br />
<br />
=== Marginal Likelihood Estimation ===<br />
<br />
BEAST2 can estimate the marginal likelihood of a model but unfortunately this currently (as of April 2018, BEAST 2.5.0) involves modifying the xml file directly (there is no menu option in BEAUTi that will do this for you). Open your xml file using your favorite text editor (e.g. BBEdit/TextWrangler on Mac, Notepad++ on Windows) and make the following modifications:<br />
<br />
# Search for "<run" and change it to "<mcmc"<br />
# Search for "</run>" and change it to "</mcmc>"<br />
# Before "<mcmc", add the following lines<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
# After "</mcmc>" add the following line:<br />
</run><br />
<br />
The sections below show what the file looks like before and after these modifications. <br />
<br />
==== Before ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</run><br />
</beast><br />
<br />
==== After ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
<mcmc id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</mcmc><br />
</run><br />
</beast><br />
<br />
Upload this new xml file to your beastlab directory (you may wish to move any output files from the previous run first so they will not be overwritten) on the cluster and run it using beast as before. When beast is finished, it will spit out the log marginal likelihood that it estimated.<br />
<br />
=== Further Information (no need to read this for today's lab) ===<br />
Read these section only if you are trying to run this lab entirely on your own computer (not the cluster) or if you are looking for more information about marginal likelihood estimation in BEAST2.<br />
<br />
==== Marginal likelihood estimation on your own laptop ====<br />
You used the cluster to perform the BEAST run that estimated the marginal likelihood. Were you to try this on your own computer, you would discover that you need to install the BEASTLabs and MODEL_SELECTION packages before performing the analysis. This is done using BEAUTi's ''File > Manage Packages...'' menu item. The packages are normally installed in the ''.beast'' directory inside your home directory on whichever computer you use to run BEAUTi. Rather than have everyone do this separately (and because some of you cannot use graphical applications such as BEAUTi if they are run on the cluster), I moved these packages to a place where BEAST can find them on the cluster (/usr/local/share/beast/2.5).<br />
<br />
==== Run options ====<br />
For reference, here are all the options that you can include in the <run ...> xml tag. We only used the first two in this lab and let the defaults apply to all other options. Note that the term ''particles'' is synonymous with ''steppingstones'' or ''ratios''.<br />
* '''alpha''': alpha parameter of Beta(alpha,1) distribution used to space out steps, default 0.3. If alpha <= 0, uniform intervals are used.<br />
* '''nrOfSteps''': the number of steps to use, default 8<br />
* '''rootdir''': root directory for storing particle states and log files (default /tmp)<br />
* '''mcmc''': MCMC analysis used to specify model and operations in each of the particles<br />
* '''chainLength''': number of sample to run a chain for a single step (default 100000L)<br />
* '''burnInPercentage''': burn-In Percentage used for analysing log files (default 50)<br />
* '''preBurnin''': number of samples that are discarded for the first step, but not the others (default 100000)<br />
* '''value''': script for launching a job: <br />
** $(dir) is replaced by the directory associated with the particle]<br />
** $(java.class.path) is replaced by a java class path used to launch this application<br />
** $(java.library.path) is replaced by a java library path used to launch this application<br />
** $(seed) is replaced by a random number seed that differs with every launch<br />
** $(host) is replaced by a host from the list of hosts<br />
* '''hosts''': comma separated list of hosts. If there are k hosts in the list, for particle i the term $(host) in the script will be replaced by the (i modulo k) host in the list. Note that whitespace is removed<br />
* '''doNotRun''': Set up all files but do not run analysis if true. This can be useful for setting up an analysis on a cluster (default false) <br />
* '''deleteOldLogs''': delete existing log files from root dir (default false)<br />
* '''posterior2prior''': whether to do steps from posterior to prior or the other way around. Going from posterior to prior is biased towards over estimates, while from prior to posterior the ML estimate is biased towards under estimates (default true)<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_BEAST2_Lab&diff=39219Phylogenetics: BEAST2 Lab2018-04-25T17:33:28Z<p>Paul Lewis: /* Important: read before starting the tutorial */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to use BEAST2 to estimate divergence times under a relaxed clock model and to assess whether the Random Local Clocks or the Uncorrelated Lognormal model fits the data best using estimates of the log marginal likelihood.<br />
|}<br />
<br />
== Getting started ==<br />
=== (q)login to the cluster ===<br />
Login to the Bioinformatics Facility Dell cluster:<br />
ssh username@bbcsrv3.biotech.uconn.edu<br />
Then use the qlogin command to find a free node:<br />
qlogin<br />
Once you are transferred to a free node, type<br />
module load java/1.8.0<br />
module load beast/2.5.0<br />
This makes a more recent version of BEAST2 available to you and loads Java 8 (actually version 1.8), which is required by BEAST 2.5.0.<br />
<br />
=== Create a directory ===<br />
Use the unix mkdir command to create a directory to play in today:<br />
mkdir beastlab<br />
cd beastlab<br />
<br />
=== Download and save the data file ===<br />
<br />
The tutorial uses a data file named ''primate-mtDNA.nex'' that is located on the cluster. Here's how to copy it to your beastlab directory:<br />
cp $BEAST/examples/nexus/primate-mtDNA.nex .<br />
The $BEAST part represents a bash variable (bash is the scripting language you are using to communicate with the operating system). The module load beast/2.5.0 command above created the BEAST variable, which points to the directory on the cluster's hard drive where beast2 was installed. The cp command stands for copy, so the command above will copy the file primate-mtDNA.nex to the current directory, which should be your newly-created beastlab directory (the period at the end of the cp command is shorthand for the current directory). <br />
<br />
=== Download BEAST2 ===<br />
BEAST2 comprises two main programs: BEAST and BEAUTi. BEAUTi is a graphical application that helps you create the xml file that BEAST runs. Because BEAUTi is graphical, it is difficult to run on the cluster, so please download the latest version of BEAST2 to your own laptop and run BEAUTi from there. Here's the web site:<br />
<br />
http://www.beast2.org<br />
<br />
(Note that you will need to download the primate-mtDNA.nex file you copied into your beastlab folder to your laptop in order to import it into BEAUTi.)<br />
<br />
=== Divergence Dating Tutorial ===<br />
<br />
The tutorial at the web site address shown below walks you through the process of dating a primate mtDNA tree but does not show you how to assess model fit using marginal likelihood estimation. After you finish the tutorial, I will show you how to estimate the marginal likelihood of the model using BEAST2. <br />
<br />
==== Important: read before starting the tutorial ====<br />
In section 2.3 (Setting the clock model), we will try using three different clock models. Each person will be assigned one of the following:<br />
<br />
# '''strict clock''' (set priors to what tutorial advises)<br />
# '''uncorrelated lognormal relaxed clock''' (set priors to what tutorial advises, except use Gamma with shape=0.001, scale=1000 prior for ucldMean.cclock)<br />
# '''random local clocks''' (set priors to what tutorial advises, and use Poisson with Lambda=0.5 for RRateChanges.cclock prior)<br />
<br />
It is important that you '''pay attention''' and '''do not simply plow through section 2.3 at full tilt''' because then everyone will end up doing only the strict clock analysis!<br />
<br />
==== Here is the tutorial ====<br />
http://beast2-dev.github.io/beast-docs/beast2/DivergenceDating/DivergenceDatingTutorial.html<br />
<br />
Note that you will run BEAUTi on your own laptop to generate the xml file, then move the xml file (e.g. myfile.xml) to the cluster and run beast as follows:<br />
<br />
beast myfile.xml<br />
<br />
=== Marginal Likelihood Estimation ===<br />
<br />
BEAST2 can estimate the marginal likelihood of a model but unfortunately this currently (as of April 2018, BEAST 2.5.0) involves modifying the xml file directly (there is no menu option in BEAUTi that will do this for you). Open your xml file using your favorite text editor (e.g. BBEdit/TextWrangler on Mac, Notepad++ on Windows) and make the following modifications:<br />
<br />
# Search for "<run" and change it to "<mcmc"<br />
# Search for "</run>" and change it to "</mcmc>"<br />
# Before "<mcmc", add the following lines<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
# After "</mcmc>" add the following line:<br />
</run><br />
<br />
The sections below show what the file looks like before and after these modifications. <br />
<br />
==== Before ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</run><br />
</beast><br />
<br />
==== After ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
<mcmc id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</mcmc><br />
</run><br />
</beast><br />
<br />
Upload this new xml file to your beastlab directory (you may wish to move any output files from the previous run first so they will not be overwritten) on the cluster and run it using beast as before. When beast is finished, it will spit out the log marginal likelihood that it estimated.<br />
<br />
=== Further Information (no need to read this for today's lab) ===<br />
Read these section only if you are trying to run this lab entirely on your own computer (not the cluster) or if you are looking for more information about marginal likelihood estimation in BEAST2.<br />
<br />
==== Marginal likelihood estimation on your own laptop ====<br />
You used the cluster to perform the BEAST run that estimated the marginal likelihood. Were you to try this on your own computer, you would discover that you need to install the BEASTLabs and MODEL_SELECTION packages before performing the analysis. This is done using BEAUTi's ''File > Manage Packages...'' menu item. The packages are normally installed in the ''.beast'' directory inside your home directory on whichever computer you use to run BEAUTi. Rather than have everyone do this separately (and because some of you cannot use graphical applications such as BEAUTi if they are run on the cluster), I moved these packages to a place where BEAST can find them on the cluster (/usr/local/share/beast/2.5).<br />
<br />
==== Run options ====<br />
For reference, here are all the options that you can include in the <run ...> xml tag. We only used the first two in this lab and let the defaults apply to all other options.<br />
* '''alpha''': alpha parameter of Beta(alpha,1) distribution used to space out steps, default 0.3. If alpha <= 0, uniform intervals are used.<br />
* '''nrOfSteps''': the number of steps to use, default 8<br />
* '''rootdir''': root directory for storing particle states and log files (default /tmp)<br />
* '''mcmc''': MCMC analysis used to specify model and operations in each of the particles<br />
* '''chainLength''': number of sample to run a chain for a single step (default 100000L)<br />
* '''burnInPercentage''': burn-In Percentage used for analysing log files (default 50)<br />
* '''preBurnin''': number of samples that are discarded for the first step, but not the others (default 100000)<br />
* '''value''': script for launching a job: <br />
** $(dir) is replaced by the directory associated with the particle]<br />
** $(java.class.path) is replaced by a java class path used to launch this application<br />
** $(java.library.path) is replaced by a java library path used to launch this application<br />
** $(seed) is replaced by a random number seed that differs with every launch<br />
** $(host) is replaced by a host from the list of hosts<br />
* '''hosts''': comma separated list of hosts. If there are k hosts in the list, for particle i the term $(host) in the script will be replaced by the (i modulo k) host in the list. Note that whitespace is removed<br />
* '''doNotRun''': Set up all files but do not run analysis if true. This can be useful for setting up an analysis on a cluster (default false) <br />
* '''deleteOldLogs''': delete existing log files from root dir (default false)<br />
* '''posterior2prior''': whether to do steps from posterior to prior or the other way around. Going from posterior to prior is biased towards over estimates, while from prior to posterior the ML estimate is biased towards under estimates (default true)<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_BEAST2_Lab&diff=39218Phylogenetics: BEAST2 Lab2018-04-25T17:32:56Z<p>Paul Lewis: /* Important: read before starting the tutorial */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to use BEAST2 to estimate divergence times under a relaxed clock model and to assess whether the Random Local Clocks or the Uncorrelated Lognormal model fits the data best using estimates of the log marginal likelihood.<br />
|}<br />
<br />
== Getting started ==<br />
=== (q)login to the cluster ===<br />
Login to the Bioinformatics Facility Dell cluster:<br />
ssh username@bbcsrv3.biotech.uconn.edu<br />
Then use the qlogin command to find a free node:<br />
qlogin<br />
Once you are transferred to a free node, type<br />
module load java/1.8.0<br />
module load beast/2.5.0<br />
This makes a more recent version of BEAST2 available to you and loads Java 8 (actually version 1.8), which is required by BEAST 2.5.0.<br />
<br />
=== Create a directory ===<br />
Use the unix mkdir command to create a directory to play in today:<br />
mkdir beastlab<br />
cd beastlab<br />
<br />
=== Download and save the data file ===<br />
<br />
The tutorial uses a data file named ''primate-mtDNA.nex'' that is located on the cluster. Here's how to copy it to your beastlab directory:<br />
cp $BEAST/examples/nexus/primate-mtDNA.nex .<br />
The $BEAST part represents a bash variable (bash is the scripting language you are using to communicate with the operating system). The module load beast/2.5.0 command above created the BEAST variable, which points to the directory on the cluster's hard drive where beast2 was installed. The cp command stands for copy, so the command above will copy the file primate-mtDNA.nex to the current directory, which should be your newly-created beastlab directory (the period at the end of the cp command is shorthand for the current directory). <br />
<br />
=== Download BEAST2 ===<br />
BEAST2 comprises two main programs: BEAST and BEAUTi. BEAUTi is a graphical application that helps you create the xml file that BEAST runs. Because BEAUTi is graphical, it is difficult to run on the cluster, so please download the latest version of BEAST2 to your own laptop and run BEAUTi from there. Here's the web site:<br />
<br />
http://www.beast2.org<br />
<br />
(Note that you will need to download the primate-mtDNA.nex file you copied into your beastlab folder to your laptop in order to import it into BEAUTi.)<br />
<br />
=== Divergence Dating Tutorial ===<br />
<br />
The tutorial at the web site address shown below walks you through the process of dating a primate mtDNA tree but does not show you how to assess model fit using marginal likelihood estimation. After you finish the tutorial, I will show you how to estimate the marginal likelihood of the model using BEAST2. <br />
<br />
==== Important: read before starting the tutorial ====<br />
In section 2.3 (Setting the clock model), we will try using three different clock models. Each person will be assigned one of the following:<br />
<br />
# '''strict clock''' (set priors to what tutorial advises)<br />
# '''uncorrelated lognormal relaxed clock''' (set priors to what tutorial advises, except use Gamma(0.001,1000 prior for ucldMean.cclock)<br />
# '''random local clocks''' (set priors to what tutorial advises, and use Poisson with Lambda=0.5 for RRateChanges.cclock prior)<br />
<br />
It is important that you '''pay attention''' and '''do not simply plow through section 2.3 at full tilt''' because then everyone will end up doing only the strict clock analysis!<br />
<br />
==== Here is the tutorial ====<br />
http://beast2-dev.github.io/beast-docs/beast2/DivergenceDating/DivergenceDatingTutorial.html<br />
<br />
Note that you will run BEAUTi on your own laptop to generate the xml file, then move the xml file (e.g. myfile.xml) to the cluster and run beast as follows:<br />
<br />
beast myfile.xml<br />
<br />
=== Marginal Likelihood Estimation ===<br />
<br />
BEAST2 can estimate the marginal likelihood of a model but unfortunately this currently (as of April 2018, BEAST 2.5.0) involves modifying the xml file directly (there is no menu option in BEAUTi that will do this for you). Open your xml file using your favorite text editor (e.g. BBEdit/TextWrangler on Mac, Notepad++ on Windows) and make the following modifications:<br />
<br />
# Search for "<run" and change it to "<mcmc"<br />
# Search for "</run>" and change it to "</mcmc>"<br />
# Before "<mcmc", add the following lines<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
# After "</mcmc>" add the following line:<br />
</run><br />
<br />
The sections below show what the file looks like before and after these modifications. <br />
<br />
==== Before ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</run><br />
</beast><br />
<br />
==== After ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
<mcmc id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</mcmc><br />
</run><br />
</beast><br />
<br />
Upload this new xml file to your beastlab directory (you may wish to move any output files from the previous run first so they will not be overwritten) on the cluster and run it using beast as before. When beast is finished, it will spit out the log marginal likelihood that it estimated.<br />
<br />
=== Further Information (no need to read this for today's lab) ===<br />
Read these section only if you are trying to run this lab entirely on your own computer (not the cluster) or if you are looking for more information about marginal likelihood estimation in BEAST2.<br />
<br />
==== Marginal likelihood estimation on your own laptop ====<br />
You used the cluster to perform the BEAST run that estimated the marginal likelihood. Were you to try this on your own computer, you would discover that you need to install the BEASTLabs and MODEL_SELECTION packages before performing the analysis. This is done using BEAUTi's ''File > Manage Packages...'' menu item. The packages are normally installed in the ''.beast'' directory inside your home directory on whichever computer you use to run BEAUTi. Rather than have everyone do this separately (and because some of you cannot use graphical applications such as BEAUTi if they are run on the cluster), I moved these packages to a place where BEAST can find them on the cluster (/usr/local/share/beast/2.5).<br />
<br />
==== Run options ====<br />
For reference, here are all the options that you can include in the <run ...> xml tag. We only used the first two in this lab and let the defaults apply to all other options.<br />
* '''alpha''': alpha parameter of Beta(alpha,1) distribution used to space out steps, default 0.3. If alpha <= 0, uniform intervals are used.<br />
* '''nrOfSteps''': the number of steps to use, default 8<br />
* '''rootdir''': root directory for storing particle states and log files (default /tmp)<br />
* '''mcmc''': MCMC analysis used to specify model and operations in each of the particles<br />
* '''chainLength''': number of sample to run a chain for a single step (default 100000L)<br />
* '''burnInPercentage''': burn-In Percentage used for analysing log files (default 50)<br />
* '''preBurnin''': number of samples that are discarded for the first step, but not the others (default 100000)<br />
* '''value''': script for launching a job: <br />
** $(dir) is replaced by the directory associated with the particle]<br />
** $(java.class.path) is replaced by a java class path used to launch this application<br />
** $(java.library.path) is replaced by a java library path used to launch this application<br />
** $(seed) is replaced by a random number seed that differs with every launch<br />
** $(host) is replaced by a host from the list of hosts<br />
* '''hosts''': comma separated list of hosts. If there are k hosts in the list, for particle i the term $(host) in the script will be replaced by the (i modulo k) host in the list. Note that whitespace is removed<br />
* '''doNotRun''': Set up all files but do not run analysis if true. This can be useful for setting up an analysis on a cluster (default false) <br />
* '''deleteOldLogs''': delete existing log files from root dir (default false)<br />
* '''posterior2prior''': whether to do steps from posterior to prior or the other way around. Going from posterior to prior is biased towards over estimates, while from prior to posterior the ML estimate is biased towards under estimates (default true)<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_BEAST2_Lab&diff=39217Phylogenetics: BEAST2 Lab2018-04-25T17:32:42Z<p>Paul Lewis: /* Important: read before starting the tutorial */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to use BEAST2 to estimate divergence times under a relaxed clock model and to assess whether the Random Local Clocks or the Uncorrelated Lognormal model fits the data best using estimates of the log marginal likelihood.<br />
|}<br />
<br />
== Getting started ==<br />
=== (q)login to the cluster ===<br />
Login to the Bioinformatics Facility Dell cluster:<br />
ssh username@bbcsrv3.biotech.uconn.edu<br />
Then use the qlogin command to find a free node:<br />
qlogin<br />
Once you are transferred to a free node, type<br />
module load java/1.8.0<br />
module load beast/2.5.0<br />
This makes a more recent version of BEAST2 available to you and loads Java 8 (actually version 1.8), which is required by BEAST 2.5.0.<br />
<br />
=== Create a directory ===<br />
Use the unix mkdir command to create a directory to play in today:<br />
mkdir beastlab<br />
cd beastlab<br />
<br />
=== Download and save the data file ===<br />
<br />
The tutorial uses a data file named ''primate-mtDNA.nex'' that is located on the cluster. Here's how to copy it to your beastlab directory:<br />
cp $BEAST/examples/nexus/primate-mtDNA.nex .<br />
The $BEAST part represents a bash variable (bash is the scripting language you are using to communicate with the operating system). The module load beast/2.5.0 command above created the BEAST variable, which points to the directory on the cluster's hard drive where beast2 was installed. The cp command stands for copy, so the command above will copy the file primate-mtDNA.nex to the current directory, which should be your newly-created beastlab directory (the period at the end of the cp command is shorthand for the current directory). <br />
<br />
=== Download BEAST2 ===<br />
BEAST2 comprises two main programs: BEAST and BEAUTi. BEAUTi is a graphical application that helps you create the xml file that BEAST runs. Because BEAUTi is graphical, it is difficult to run on the cluster, so please download the latest version of BEAST2 to your own laptop and run BEAUTi from there. Here's the web site:<br />
<br />
http://www.beast2.org<br />
<br />
(Note that you will need to download the primate-mtDNA.nex file you copied into your beastlab folder to your laptop in order to import it into BEAUTi.)<br />
<br />
=== Divergence Dating Tutorial ===<br />
<br />
The tutorial at the web site address shown below walks you through the process of dating a primate mtDNA tree but does not show you how to assess model fit using marginal likelihood estimation. After you finish the tutorial, I will show you how to estimate the marginal likelihood of the model using BEAST2. <br />
<br />
==== Important: read before starting the tutorial ====<br />
In section 2.3 (Setting the clock model), we will try using three different clock models. Each person will be assigned one of the following:<br />
<br />
# '''strict clock''' (set priors to what tutorial advises)<br />
# '''uncorrelated lognormal relaxed clock'' (set priors to what tutorial advises, except use Gamma(0.001,1000 prior for ucldMean.cclock)<br />
# '''random local clocks''' (set priors to what tutorial advises, and use Poisson with Lambda=0.5 for RRateChanges.cclock prior)<br />
<br />
It is important that you '''pay attention''' and '''do not simply plow through section 2.3 at full tilt''' because then everyone will end up doing only the strict clock analysis!<br />
<br />
==== Here is the tutorial ====<br />
http://beast2-dev.github.io/beast-docs/beast2/DivergenceDating/DivergenceDatingTutorial.html<br />
<br />
Note that you will run BEAUTi on your own laptop to generate the xml file, then move the xml file (e.g. myfile.xml) to the cluster and run beast as follows:<br />
<br />
beast myfile.xml<br />
<br />
=== Marginal Likelihood Estimation ===<br />
<br />
BEAST2 can estimate the marginal likelihood of a model but unfortunately this currently (as of April 2018, BEAST 2.5.0) involves modifying the xml file directly (there is no menu option in BEAUTi that will do this for you). Open your xml file using your favorite text editor (e.g. BBEdit/TextWrangler on Mac, Notepad++ on Windows) and make the following modifications:<br />
<br />
# Search for "<run" and change it to "<mcmc"<br />
# Search for "</run>" and change it to "</mcmc>"<br />
# Before "<mcmc", add the following lines<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
# After "</mcmc>" add the following line:<br />
</run><br />
<br />
The sections below show what the file looks like before and after these modifications. <br />
<br />
==== Before ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</run><br />
</beast><br />
<br />
==== After ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
<mcmc id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</mcmc><br />
</run><br />
</beast><br />
<br />
Upload this new xml file to your beastlab directory (you may wish to move any output files from the previous run first so they will not be overwritten) on the cluster and run it using beast as before. When beast is finished, it will spit out the log marginal likelihood that it estimated.<br />
<br />
=== Further Information (no need to read this for today's lab) ===<br />
Read these section only if you are trying to run this lab entirely on your own computer (not the cluster) or if you are looking for more information about marginal likelihood estimation in BEAST2.<br />
<br />
==== Marginal likelihood estimation on your own laptop ====<br />
You used the cluster to perform the BEAST run that estimated the marginal likelihood. Were you to try this on your own computer, you would discover that you need to install the BEASTLabs and MODEL_SELECTION packages before performing the analysis. This is done using BEAUTi's ''File > Manage Packages...'' menu item. The packages are normally installed in the ''.beast'' directory inside your home directory on whichever computer you use to run BEAUTi. Rather than have everyone do this separately (and because some of you cannot use graphical applications such as BEAUTi if they are run on the cluster), I moved these packages to a place where BEAST can find them on the cluster (/usr/local/share/beast/2.5).<br />
<br />
==== Run options ====<br />
For reference, here are all the options that you can include in the <run ...> xml tag. We only used the first two in this lab and let the defaults apply to all other options.<br />
* '''alpha''': alpha parameter of Beta(alpha,1) distribution used to space out steps, default 0.3. If alpha <= 0, uniform intervals are used.<br />
* '''nrOfSteps''': the number of steps to use, default 8<br />
* '''rootdir''': root directory for storing particle states and log files (default /tmp)<br />
* '''mcmc''': MCMC analysis used to specify model and operations in each of the particles<br />
* '''chainLength''': number of sample to run a chain for a single step (default 100000L)<br />
* '''burnInPercentage''': burn-In Percentage used for analysing log files (default 50)<br />
* '''preBurnin''': number of samples that are discarded for the first step, but not the others (default 100000)<br />
* '''value''': script for launching a job: <br />
** $(dir) is replaced by the directory associated with the particle]<br />
** $(java.class.path) is replaced by a java class path used to launch this application<br />
** $(java.library.path) is replaced by a java library path used to launch this application<br />
** $(seed) is replaced by a random number seed that differs with every launch<br />
** $(host) is replaced by a host from the list of hosts<br />
* '''hosts''': comma separated list of hosts. If there are k hosts in the list, for particle i the term $(host) in the script will be replaced by the (i modulo k) host in the list. Note that whitespace is removed<br />
* '''doNotRun''': Set up all files but do not run analysis if true. This can be useful for setting up an analysis on a cluster (default false) <br />
* '''deleteOldLogs''': delete existing log files from root dir (default false)<br />
* '''posterior2prior''': whether to do steps from posterior to prior or the other way around. Going from posterior to prior is biased towards over estimates, while from prior to posterior the ML estimate is biased towards under estimates (default true)<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_BEAST2_Lab&diff=39216Phylogenetics: BEAST2 Lab2018-04-25T17:30:10Z<p>Paul Lewis: /* Important: read before starting the tutorial */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to use BEAST2 to estimate divergence times under a relaxed clock model and to assess whether the Random Local Clocks or the Uncorrelated Lognormal model fits the data best using estimates of the log marginal likelihood.<br />
|}<br />
<br />
== Getting started ==<br />
=== (q)login to the cluster ===<br />
Login to the Bioinformatics Facility Dell cluster:<br />
ssh username@bbcsrv3.biotech.uconn.edu<br />
Then use the qlogin command to find a free node:<br />
qlogin<br />
Once you are transferred to a free node, type<br />
module load java/1.8.0<br />
module load beast/2.5.0<br />
This makes a more recent version of BEAST2 available to you and loads Java 8 (actually version 1.8), which is required by BEAST 2.5.0.<br />
<br />
=== Create a directory ===<br />
Use the unix mkdir command to create a directory to play in today:<br />
mkdir beastlab<br />
cd beastlab<br />
<br />
=== Download and save the data file ===<br />
<br />
The tutorial uses a data file named ''primate-mtDNA.nex'' that is located on the cluster. Here's how to copy it to your beastlab directory:<br />
cp $BEAST/examples/nexus/primate-mtDNA.nex .<br />
The $BEAST part represents a bash variable (bash is the scripting language you are using to communicate with the operating system). The module load beast/2.5.0 command above created the BEAST variable, which points to the directory on the cluster's hard drive where beast2 was installed. The cp command stands for copy, so the command above will copy the file primate-mtDNA.nex to the current directory, which should be your newly-created beastlab directory (the period at the end of the cp command is shorthand for the current directory). <br />
<br />
=== Download BEAST2 ===<br />
BEAST2 comprises two main programs: BEAST and BEAUTi. BEAUTi is a graphical application that helps you create the xml file that BEAST runs. Because BEAUTi is graphical, it is difficult to run on the cluster, so please download the latest version of BEAST2 to your own laptop and run BEAUTi from there. Here's the web site:<br />
<br />
http://www.beast2.org<br />
<br />
(Note that you will need to download the primate-mtDNA.nex file you copied into your beastlab folder to your laptop in order to import it into BEAUTi.)<br />
<br />
=== Divergence Dating Tutorial ===<br />
<br />
The tutorial at the web site address shown below walks you through the process of dating a primate mtDNA tree but does not show you how to assess model fit using marginal likelihood estimation. After you finish the tutorial, I will show you how to estimate the marginal likelihood of the model using BEAST2. <br />
<br />
==== Important: read before starting the tutorial ====<br />
In section 2.3 (Setting the clock model), we will try using three different clock models. Each person will be assigned one of the following:<br />
<br />
# '''strict clock'''<br />
* set priors to what tutorial advises<br />
# '''uncorrelated lognormal relaxed clock''<br />
* set priors to what tutorial advises, except use Gamma(0.001,1000 prior for ucldMean.cclock<br />
# '''random local clocks'''<br />
* set priors to what tutorial advises, and use Poisson with Lambda=0.5 for RRateChanges.cclock prior<br />
<br />
It is important that you '''pay attention''' and '''do not simply plow through section 2.3 at full tilt''' because then everyone will end up doing only the strict clock analysis!<br />
<br />
==== Here is the tutorial ====<br />
http://beast2-dev.github.io/beast-docs/beast2/DivergenceDating/DivergenceDatingTutorial.html<br />
<br />
Note that you will run BEAUTi on your own laptop to generate the xml file, then move the xml file (e.g. myfile.xml) to the cluster and run beast as follows:<br />
<br />
beast myfile.xml<br />
<br />
=== Marginal Likelihood Estimation ===<br />
<br />
BEAST2 can estimate the marginal likelihood of a model but unfortunately this currently (as of April 2018, BEAST 2.5.0) involves modifying the xml file directly (there is no menu option in BEAUTi that will do this for you). Open your xml file using your favorite text editor (e.g. BBEdit/TextWrangler on Mac, Notepad++ on Windows) and make the following modifications:<br />
<br />
# Search for "<run" and change it to "<mcmc"<br />
# Search for "</run>" and change it to "</mcmc>"<br />
# Before "<mcmc", add the following lines<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
# After "</mcmc>" add the following line:<br />
</run><br />
<br />
The sections below show what the file looks like before and after these modifications. <br />
<br />
==== Before ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</run><br />
</beast><br />
<br />
==== After ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
<mcmc id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</mcmc><br />
</run><br />
</beast><br />
<br />
Upload this new xml file to your beastlab directory (you may wish to move any output files from the previous run first so they will not be overwritten) on the cluster and run it using beast as before. When beast is finished, it will spit out the log marginal likelihood that it estimated.<br />
<br />
=== Further Information (no need to read this for today's lab) ===<br />
Read these section only if you are trying to run this lab entirely on your own computer (not the cluster) or if you are looking for more information about marginal likelihood estimation in BEAST2.<br />
<br />
==== Marginal likelihood estimation on your own laptop ====<br />
You used the cluster to perform the BEAST run that estimated the marginal likelihood. Were you to try this on your own computer, you would discover that you need to install the BEASTLabs and MODEL_SELECTION packages before performing the analysis. This is done using BEAUTi's ''File > Manage Packages...'' menu item. The packages are normally installed in the ''.beast'' directory inside your home directory on whichever computer you use to run BEAUTi. Rather than have everyone do this separately (and because some of you cannot use graphical applications such as BEAUTi if they are run on the cluster), I moved these packages to a place where BEAST can find them on the cluster (/usr/local/share/beast/2.5).<br />
<br />
==== Run options ====<br />
For reference, here are all the options that you can include in the <run ...> xml tag. We only used the first two in this lab and let the defaults apply to all other options.<br />
* '''alpha''': alpha parameter of Beta(alpha,1) distribution used to space out steps, default 0.3. If alpha <= 0, uniform intervals are used.<br />
* '''nrOfSteps''': the number of steps to use, default 8<br />
* '''rootdir''': root directory for storing particle states and log files (default /tmp)<br />
* '''mcmc''': MCMC analysis used to specify model and operations in each of the particles<br />
* '''chainLength''': number of sample to run a chain for a single step (default 100000L)<br />
* '''burnInPercentage''': burn-In Percentage used for analysing log files (default 50)<br />
* '''preBurnin''': number of samples that are discarded for the first step, but not the others (default 100000)<br />
* '''value''': script for launching a job: <br />
** $(dir) is replaced by the directory associated with the particle]<br />
** $(java.class.path) is replaced by a java class path used to launch this application<br />
** $(java.library.path) is replaced by a java library path used to launch this application<br />
** $(seed) is replaced by a random number seed that differs with every launch<br />
** $(host) is replaced by a host from the list of hosts<br />
* '''hosts''': comma separated list of hosts. If there are k hosts in the list, for particle i the term $(host) in the script will be replaced by the (i modulo k) host in the list. Note that whitespace is removed<br />
* '''doNotRun''': Set up all files but do not run analysis if true. This can be useful for setting up an analysis on a cluster (default false) <br />
* '''deleteOldLogs''': delete existing log files from root dir (default false)<br />
* '''posterior2prior''': whether to do steps from posterior to prior or the other way around. Going from posterior to prior is biased towards over estimates, while from prior to posterior the ML estimate is biased towards under estimates (default true)<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_BEAST2_Lab&diff=39215Phylogenetics: BEAST2 Lab2018-04-25T17:10:23Z<p>Paul Lewis: /* Important: read before starting the tutorial */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to use BEAST2 to estimate divergence times under a relaxed clock model and to assess whether the Random Local Clocks or the Uncorrelated Lognormal model fits the data best using estimates of the log marginal likelihood.<br />
|}<br />
<br />
== Getting started ==<br />
=== (q)login to the cluster ===<br />
Login to the Bioinformatics Facility Dell cluster:<br />
ssh username@bbcsrv3.biotech.uconn.edu<br />
Then use the qlogin command to find a free node:<br />
qlogin<br />
Once you are transferred to a free node, type<br />
module load java/1.8.0<br />
module load beast/2.5.0<br />
This makes a more recent version of BEAST2 available to you and loads Java 8 (actually version 1.8), which is required by BEAST 2.5.0.<br />
<br />
=== Create a directory ===<br />
Use the unix mkdir command to create a directory to play in today:<br />
mkdir beastlab<br />
cd beastlab<br />
<br />
=== Download and save the data file ===<br />
<br />
The tutorial uses a data file named ''primate-mtDNA.nex'' that is located on the cluster. Here's how to copy it to your beastlab directory:<br />
cp $BEAST/examples/nexus/primate-mtDNA.nex .<br />
The $BEAST part represents a bash variable (bash is the scripting language you are using to communicate with the operating system). The module load beast/2.5.0 command above created the BEAST variable, which points to the directory on the cluster's hard drive where beast2 was installed. The cp command stands for copy, so the command above will copy the file primate-mtDNA.nex to the current directory, which should be your newly-created beastlab directory (the period at the end of the cp command is shorthand for the current directory). <br />
<br />
=== Download BEAST2 ===<br />
BEAST2 comprises two main programs: BEAST and BEAUTi. BEAUTi is a graphical application that helps you create the xml file that BEAST runs. Because BEAUTi is graphical, it is difficult to run on the cluster, so please download the latest version of BEAST2 to your own laptop and run BEAUTi from there. Here's the web site:<br />
<br />
http://www.beast2.org<br />
<br />
(Note that you will need to download the primate-mtDNA.nex file you copied into your beastlab folder to your laptop in order to import it into BEAUTi.)<br />
<br />
=== Divergence Dating Tutorial ===<br />
<br />
The tutorial at the web site address shown below walks you through the process of dating a primate mtDNA tree but does not show you how to assess model fit using marginal likelihood estimation. After you finish the tutorial, I will show you how to estimate the marginal likelihood of the model using BEAST2. <br />
<br />
==== Important: read before starting the tutorial ====<br />
In section 2.3 (Setting the clock model), we will try using three different clock models. Each person will be assigned one of the following:<br />
<br />
# strict clock (this is what the tutorial advises)<br />
# uncorrelated lognormal relaxed clock (use Gamma(0.001,1000 prior for ucldMean.cclock; leave ucldStdev.cclock prior set to default, which is Gamma(.5396, .3819))<br />
# random local clocks<br />
<br />
It is important that you '''pay attention''' and '''do not simply plow through section 2.3 at full tilt''' because then everyone will end up doing only the strict clock analysis!<br />
<br />
==== Here is the tutorial ====<br />
http://beast2-dev.github.io/beast-docs/beast2/DivergenceDating/DivergenceDatingTutorial.html<br />
<br />
Note that you will run BEAUTi on your own laptop to generate the xml file, then move the xml file (e.g. myfile.xml) to the cluster and run beast as follows:<br />
<br />
beast myfile.xml<br />
<br />
=== Marginal Likelihood Estimation ===<br />
<br />
BEAST2 can estimate the marginal likelihood of a model but unfortunately this currently (as of April 2018, BEAST 2.5.0) involves modifying the xml file directly (there is no menu option in BEAUTi that will do this for you). Open your xml file using your favorite text editor (e.g. BBEdit/TextWrangler on Mac, Notepad++ on Windows) and make the following modifications:<br />
<br />
# Search for "<run" and change it to "<mcmc"<br />
# Search for "</run>" and change it to "</mcmc>"<br />
# Before "<mcmc", add the following lines<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
# After "</mcmc>" add the following line:<br />
</run><br />
<br />
The sections below show what the file looks like before and after these modifications. <br />
<br />
==== Before ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</run><br />
</beast><br />
<br />
==== After ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
<mcmc id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</mcmc><br />
</run><br />
</beast><br />
<br />
Upload this new xml file to your beastlab directory (you may wish to move any output files from the previous run first so they will not be overwritten) on the cluster and run it using beast as before. When beast is finished, it will spit out the log marginal likelihood that it estimated.<br />
<br />
=== Further Information (no need to read this for today's lab) ===<br />
Read these section only if you are trying to run this lab entirely on your own computer (not the cluster) or if you are looking for more information about marginal likelihood estimation in BEAST2.<br />
<br />
==== Marginal likelihood estimation on your own laptop ====<br />
You used the cluster to perform the BEAST run that estimated the marginal likelihood. Were you to try this on your own computer, you would discover that you need to install the BEASTLabs and MODEL_SELECTION packages before performing the analysis. This is done using BEAUTi's ''File > Manage Packages...'' menu item. The packages are normally installed in the ''.beast'' directory inside your home directory on whichever computer you use to run BEAUTi. Rather than have everyone do this separately (and because some of you cannot use graphical applications such as BEAUTi if they are run on the cluster), I moved these packages to a place where BEAST can find them on the cluster (/usr/local/share/beast/2.5).<br />
<br />
==== Run options ====<br />
For reference, here are all the options that you can include in the <run ...> xml tag. We only used the first two in this lab and let the defaults apply to all other options.<br />
* '''alpha''': alpha parameter of Beta(alpha,1) distribution used to space out steps, default 0.3. If alpha <= 0, uniform intervals are used.<br />
* '''nrOfSteps''': the number of steps to use, default 8<br />
* '''rootdir''': root directory for storing particle states and log files (default /tmp)<br />
* '''mcmc''': MCMC analysis used to specify model and operations in each of the particles<br />
* '''chainLength''': number of sample to run a chain for a single step (default 100000L)<br />
* '''burnInPercentage''': burn-In Percentage used for analysing log files (default 50)<br />
* '''preBurnin''': number of samples that are discarded for the first step, but not the others (default 100000)<br />
* '''value''': script for launching a job: <br />
** $(dir) is replaced by the directory associated with the particle]<br />
** $(java.class.path) is replaced by a java class path used to launch this application<br />
** $(java.library.path) is replaced by a java library path used to launch this application<br />
** $(seed) is replaced by a random number seed that differs with every launch<br />
** $(host) is replaced by a host from the list of hosts<br />
* '''hosts''': comma separated list of hosts. If there are k hosts in the list, for particle i the term $(host) in the script will be replaced by the (i modulo k) host in the list. Note that whitespace is removed<br />
* '''doNotRun''': Set up all files but do not run analysis if true. This can be useful for setting up an analysis on a cluster (default false) <br />
* '''deleteOldLogs''': delete existing log files from root dir (default false)<br />
* '''posterior2prior''': whether to do steps from posterior to prior or the other way around. Going from posterior to prior is biased towards over estimates, while from prior to posterior the ML estimate is biased towards under estimates (default true)<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_BEAST2_Lab&diff=39214Phylogenetics: BEAST2 Lab2018-04-25T17:01:36Z<p>Paul Lewis: /* Download BEAST2 */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to use BEAST2 to estimate divergence times under a relaxed clock model and to assess whether the Random Local Clocks or the Uncorrelated Lognormal model fits the data best using estimates of the log marginal likelihood.<br />
|}<br />
<br />
== Getting started ==<br />
=== (q)login to the cluster ===<br />
Login to the Bioinformatics Facility Dell cluster:<br />
ssh username@bbcsrv3.biotech.uconn.edu<br />
Then use the qlogin command to find a free node:<br />
qlogin<br />
Once you are transferred to a free node, type<br />
module load java/1.8.0<br />
module load beast/2.5.0<br />
This makes a more recent version of BEAST2 available to you and loads Java 8 (actually version 1.8), which is required by BEAST 2.5.0.<br />
<br />
=== Create a directory ===<br />
Use the unix mkdir command to create a directory to play in today:<br />
mkdir beastlab<br />
cd beastlab<br />
<br />
=== Download and save the data file ===<br />
<br />
The tutorial uses a data file named ''primate-mtDNA.nex'' that is located on the cluster. Here's how to copy it to your beastlab directory:<br />
cp $BEAST/examples/nexus/primate-mtDNA.nex .<br />
The $BEAST part represents a bash variable (bash is the scripting language you are using to communicate with the operating system). The module load beast/2.5.0 command above created the BEAST variable, which points to the directory on the cluster's hard drive where beast2 was installed. The cp command stands for copy, so the command above will copy the file primate-mtDNA.nex to the current directory, which should be your newly-created beastlab directory (the period at the end of the cp command is shorthand for the current directory). <br />
<br />
=== Download BEAST2 ===<br />
BEAST2 comprises two main programs: BEAST and BEAUTi. BEAUTi is a graphical application that helps you create the xml file that BEAST runs. Because BEAUTi is graphical, it is difficult to run on the cluster, so please download the latest version of BEAST2 to your own laptop and run BEAUTi from there. Here's the web site:<br />
<br />
http://www.beast2.org<br />
<br />
(Note that you will need to download the primate-mtDNA.nex file you copied into your beastlab folder to your laptop in order to import it into BEAUTi.)<br />
<br />
=== Divergence Dating Tutorial ===<br />
<br />
The tutorial at the web site address shown below walks you through the process of dating a primate mtDNA tree but does not show you how to assess model fit using marginal likelihood estimation. After you finish the tutorial, I will show you how to estimate the marginal likelihood of the model using BEAST2. <br />
<br />
==== Important: read before starting the tutorial ====<br />
In section 2.3 (Setting the clock model), we will try using three different clock models. Each person will be assigned one of the following:<br />
<br />
# strict clock (this is what the tutorial advises)<br />
# uncorrelated lognormal relaxed clock<br />
# random local clocks<br />
<br />
It is important that you '''pay attention''' and '''do not simply plow through section 2.3 at full tilt''' because then everyone will end up doing only the strict clock analysis!<br />
<br />
==== Here is the tutorial ====<br />
http://beast2-dev.github.io/beast-docs/beast2/DivergenceDating/DivergenceDatingTutorial.html<br />
<br />
Note that you will run BEAUTi on your own laptop to generate the xml file, then move the xml file (e.g. myfile.xml) to the cluster and run beast as follows:<br />
<br />
beast myfile.xml<br />
<br />
=== Marginal Likelihood Estimation ===<br />
<br />
BEAST2 can estimate the marginal likelihood of a model but unfortunately this currently (as of April 2018, BEAST 2.5.0) involves modifying the xml file directly (there is no menu option in BEAUTi that will do this for you). Open your xml file using your favorite text editor (e.g. BBEdit/TextWrangler on Mac, Notepad++ on Windows) and make the following modifications:<br />
<br />
# Search for "<run" and change it to "<mcmc"<br />
# Search for "</run>" and change it to "</mcmc>"<br />
# Before "<mcmc", add the following lines<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
# After "</mcmc>" add the following line:<br />
</run><br />
<br />
The sections below show what the file looks like before and after these modifications. <br />
<br />
==== Before ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</run><br />
</beast><br />
<br />
==== After ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
<mcmc id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</mcmc><br />
</run><br />
</beast><br />
<br />
Upload this new xml file to your beastlab directory (you may wish to move any output files from the previous run first so they will not be overwritten) on the cluster and run it using beast as before. When beast is finished, it will spit out the log marginal likelihood that it estimated.<br />
<br />
=== Further Information (no need to read this for today's lab) ===<br />
Read these section only if you are trying to run this lab entirely on your own computer (not the cluster) or if you are looking for more information about marginal likelihood estimation in BEAST2.<br />
<br />
==== Marginal likelihood estimation on your own laptop ====<br />
You used the cluster to perform the BEAST run that estimated the marginal likelihood. Were you to try this on your own computer, you would discover that you need to install the BEASTLabs and MODEL_SELECTION packages before performing the analysis. This is done using BEAUTi's ''File > Manage Packages...'' menu item. The packages are normally installed in the ''.beast'' directory inside your home directory on whichever computer you use to run BEAUTi. Rather than have everyone do this separately (and because some of you cannot use graphical applications such as BEAUTi if they are run on the cluster), I moved these packages to a place where BEAST can find them on the cluster (/usr/local/share/beast/2.5).<br />
<br />
==== Run options ====<br />
For reference, here are all the options that you can include in the <run ...> xml tag. We only used the first two in this lab and let the defaults apply to all other options.<br />
* '''alpha''': alpha parameter of Beta(alpha,1) distribution used to space out steps, default 0.3. If alpha <= 0, uniform intervals are used.<br />
* '''nrOfSteps''': the number of steps to use, default 8<br />
* '''rootdir''': root directory for storing particle states and log files (default /tmp)<br />
* '''mcmc''': MCMC analysis used to specify model and operations in each of the particles<br />
* '''chainLength''': number of sample to run a chain for a single step (default 100000L)<br />
* '''burnInPercentage''': burn-In Percentage used for analysing log files (default 50)<br />
* '''preBurnin''': number of samples that are discarded for the first step, but not the others (default 100000)<br />
* '''value''': script for launching a job: <br />
** $(dir) is replaced by the directory associated with the particle]<br />
** $(java.class.path) is replaced by a java class path used to launch this application<br />
** $(java.library.path) is replaced by a java library path used to launch this application<br />
** $(seed) is replaced by a random number seed that differs with every launch<br />
** $(host) is replaced by a host from the list of hosts<br />
* '''hosts''': comma separated list of hosts. If there are k hosts in the list, for particle i the term $(host) in the script will be replaced by the (i modulo k) host in the list. Note that whitespace is removed<br />
* '''doNotRun''': Set up all files but do not run analysis if true. This can be useful for setting up an analysis on a cluster (default false) <br />
* '''deleteOldLogs''': delete existing log files from root dir (default false)<br />
* '''posterior2prior''': whether to do steps from posterior to prior or the other way around. Going from posterior to prior is biased towards over estimates, while from prior to posterior the ML estimate is biased towards under estimates (default true)<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_BEAST2_Lab&diff=39213Phylogenetics: BEAST2 Lab2018-04-25T16:49:55Z<p>Paul Lewis: Created page with "{| border="0" |- |rowspan="2" valign="top"|150px |<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_S..."</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to use BEAST2 to estimate divergence times under a relaxed clock model and to assess whether the Random Local Clocks or the Uncorrelated Lognormal model fits the data best using estimates of the log marginal likelihood.<br />
|}<br />
<br />
== Getting started ==<br />
=== (q)login to the cluster ===<br />
Login to the Bioinformatics Facility Dell cluster:<br />
ssh username@bbcsrv3.biotech.uconn.edu<br />
Then use the qlogin command to find a free node:<br />
qlogin<br />
Once you are transferred to a free node, type<br />
module load java/1.8.0<br />
module load beast/2.5.0<br />
This makes a more recent version of BEAST2 available to you and loads Java 8 (actually version 1.8), which is required by BEAST 2.5.0.<br />
<br />
=== Create a directory ===<br />
Use the unix mkdir command to create a directory to play in today:<br />
mkdir beastlab<br />
cd beastlab<br />
<br />
=== Download and save the data file ===<br />
<br />
The tutorial uses a data file named ''primate-mtDNA.nex'' that is located on the cluster. Here's how to copy it to your beastlab directory:<br />
cp $BEAST/examples/nexus/primate-mtDNA.nex .<br />
The $BEAST part represents a bash variable (bash is the scripting language you are using to communicate with the operating system). The module load beast/2.5.0 command above created the BEAST variable, which points to the directory on the cluster's hard drive where beast2 was installed. The cp command stands for copy, so the command above will copy the file primate-mtDNA.nex to the current directory, which should be your newly-created beastlab directory (the period at the end of the cp command is shorthand for the current directory). <br />
<br />
=== Download BEAST2 ===<br />
BEAST2 comprises two main programs: BEAST and BEAUTi. BEAUTi is a graphical application that helps you create the xml file that BEAST runs. Because BEAUTi is graphical, it is difficult to run on the cluster, so please download the latest version of BEAST2 to your own laptop and run BEAUTi from there. Here's the web site:<br />
<br />
http://www.beast2.org<br />
<br />
=== Divergence Dating Tutorial ===<br />
<br />
The tutorial at the web site address shown below walks you through the process of dating a primate mtDNA tree but does not show you how to assess model fit using marginal likelihood estimation. After you finish the tutorial, I will show you how to estimate the marginal likelihood of the model using BEAST2. <br />
<br />
==== Important: read before starting the tutorial ====<br />
In section 2.3 (Setting the clock model), we will try using three different clock models. Each person will be assigned one of the following:<br />
<br />
# strict clock (this is what the tutorial advises)<br />
# uncorrelated lognormal relaxed clock<br />
# random local clocks<br />
<br />
It is important that you '''pay attention''' and '''do not simply plow through section 2.3 at full tilt''' because then everyone will end up doing only the strict clock analysis!<br />
<br />
==== Here is the tutorial ====<br />
http://beast2-dev.github.io/beast-docs/beast2/DivergenceDating/DivergenceDatingTutorial.html<br />
<br />
Note that you will run BEAUTi on your own laptop to generate the xml file, then move the xml file (e.g. myfile.xml) to the cluster and run beast as follows:<br />
<br />
beast myfile.xml<br />
<br />
=== Marginal Likelihood Estimation ===<br />
<br />
BEAST2 can estimate the marginal likelihood of a model but unfortunately this currently (as of April 2018, BEAST 2.5.0) involves modifying the xml file directly (there is no menu option in BEAUTi that will do this for you). Open your xml file using your favorite text editor (e.g. BBEdit/TextWrangler on Mac, Notepad++ on Windows) and make the following modifications:<br />
<br />
# Search for "<run" and change it to "<mcmc"<br />
# Search for "</run>" and change it to "</mcmc>"<br />
# Before "<mcmc", add the following lines<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
# After "</mcmc>" add the following line:<br />
</run><br />
<br />
The sections below show what the file looks like before and after these modifications. <br />
<br />
==== Before ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</run><br />
</beast><br />
<br />
==== After ====<br />
<br />
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br />
<beast ...><br />
.<br />
.<br />
.<br />
<run spec='beast.inference.PathSampler' nrOfSteps='32' alpha='0.3'><br />
cd $(dir)<br />
java -cp $(java.class.path) beast.app.beastapp.BeastMain $(resume/overwrite) -java -seed $(seed) beast.xml <br />
<mcmc id="mcmc" spec="MCMC" chainLength="6000000"><br />
.<br />
.<br />
.<br />
</mcmc><br />
</run><br />
</beast><br />
<br />
Upload this new xml file to your beastlab directory (you may wish to move any output files from the previous run first so they will not be overwritten) on the cluster and run it using beast as before. When beast is finished, it will spit out the log marginal likelihood that it estimated.<br />
<br />
=== Further Information (no need to read this for today's lab) ===<br />
Read these section only if you are trying to run this lab entirely on your own computer (not the cluster) or if you are looking for more information about marginal likelihood estimation in BEAST2.<br />
<br />
==== Marginal likelihood estimation on your own laptop ====<br />
You used the cluster to perform the BEAST run that estimated the marginal likelihood. Were you to try this on your own computer, you would discover that you need to install the BEASTLabs and MODEL_SELECTION packages before performing the analysis. This is done using BEAUTi's ''File > Manage Packages...'' menu item. The packages are normally installed in the ''.beast'' directory inside your home directory on whichever computer you use to run BEAUTi. Rather than have everyone do this separately (and because some of you cannot use graphical applications such as BEAUTi if they are run on the cluster), I moved these packages to a place where BEAST can find them on the cluster (/usr/local/share/beast/2.5).<br />
<br />
==== Run options ====<br />
For reference, here are all the options that you can include in the <run ...> xml tag. We only used the first two in this lab and let the defaults apply to all other options.<br />
* '''alpha''': alpha parameter of Beta(alpha,1) distribution used to space out steps, default 0.3. If alpha <= 0, uniform intervals are used.<br />
* '''nrOfSteps''': the number of steps to use, default 8<br />
* '''rootdir''': root directory for storing particle states and log files (default /tmp)<br />
* '''mcmc''': MCMC analysis used to specify model and operations in each of the particles<br />
* '''chainLength''': number of sample to run a chain for a single step (default 100000L)<br />
* '''burnInPercentage''': burn-In Percentage used for analysing log files (default 50)<br />
* '''preBurnin''': number of samples that are discarded for the first step, but not the others (default 100000)<br />
* '''value''': script for launching a job: <br />
** $(dir) is replaced by the directory associated with the particle]<br />
** $(java.class.path) is replaced by a java class path used to launch this application<br />
** $(java.library.path) is replaced by a java library path used to launch this application<br />
** $(seed) is replaced by a random number seed that differs with every launch<br />
** $(host) is replaced by a host from the list of hosts<br />
* '''hosts''': comma separated list of hosts. If there are k hosts in the list, for particle i the term $(host) in the script will be replaced by the (i modulo k) host in the list. Note that whitespace is removed<br />
* '''doNotRun''': Set up all files but do not run analysis if true. This can be useful for setting up an analysis on a cluster (default false) <br />
* '''deleteOldLogs''': delete existing log files from root dir (default false)<br />
* '''posterior2prior''': whether to do steps from posterior to prior or the other way around. Going from posterior to prior is biased towards over estimates, while from prior to posterior the ML estimate is biased towards under estimates (default true)<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Ggtree&diff=39137Ggtree2018-04-13T17:11:24Z<p>Paul Lewis: /* Start R and Load Packages */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan<br />
<br />
== Goals ==<br />
<br />
To introduce you to the R package [http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12628/abstract ggtree] for plotting phylogenetic trees.<br />
<br />
== Introduction ==<br />
<br />
== Getting Started ==<br />
<br />
This tutorial is written for the cluster user in mind, but feel free to perform it with your own local version of <tt>R</tt> (>=3.4). There are instructions at the end of this tutorial on how to get your local version of <tt>R</tt> set-up for this exercise.<br />
<br />
====Get Situated on the Cluster====<br />
<br />
Log onto the cluster like normal but with an added flag to allow for any graphics to be displayed on your computer.<br />
<br />
ssh username@bbcsrv3.biotech.uconn.edu -Y<br />
<br />
Be sure to get off the head node to avoid litigation and subsequent incarceration:<br />
<br />
qlogin<br />
<br />
Navigate to the folder you want to be working in for the R portion of the lab and download the tree file we'll be working with:<br />
<br />
curl -OL http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/moths.txt<br />
<br />
For more information on the curl command and what options you can use with it consult [https://en.wikipedia.org/wiki/CURL Wikipedia]<br />
<br />
====Start R and Load Packages====<br />
<br />
See what versions of R are available:<br />
module avail<br />
<br />
Load R version 3.4.4<br />
module load R/3.4.4<br />
<br />
Start R<br />
R<br />
<br />
You'll need to load the following packages:<br />
<br />
BiocInstaller<br />
Biostrings<br />
ape<br />
ggplot2<br />
ggtree<br />
phytools<br />
ggrepel<br />
stringr<br />
stringi<br />
abind<br />
treeio<br />
<br />
You can load packages like so:<br />
<br />
library("BiocInstaller")<br />
<br />
To make it easier, you can load the <tt>easypackages</tt> library,<br />
<br />
library(easypackages)<br />
<br />
and then load all the libraries at once with this command:<br />
<br />
libraries("BiocInstaller","Biostrings","ape","ggplot2","ggtree","phytools","ggrepel","stringr","stringi","abind","treeio")<br />
<br />
====Read in the Tree File====<br />
<br />
We're dealing with a tree in the Newick file format which the function <tt>read.newick</tt> from the package <tt>treeio</tt> can handle:<br />
<br />
tree <- read.newick("moths.txt")<br />
<br />
R can handle more than just Newick formatted tree files. To see what other file formats from the various phylogenetic software that R can handle checkout [https://bioconductor.org/packages/release/bioc/html/treeio.html <tt>treeio</tt>]. The functionality within <tt>treeio</tt> used to be part of the <tt>ggtree</tt> package itself, but the authors recently split <tt>ggtree</tt> in two with one part (<tt>ggtree</tt>) handling mostly plotting, and the other other part (<tt>treeio</tt>) handling mostly file input/output operations.<br />
<br />
Let's quickly plot the tree to see what it looks like using the <tt>plot</tt> function from the <tt>ape</tt> package:<br />
<br />
plot(tree)<br />
<br />
Notice the tree has all of its tips labeled. It's also a little cramped. You can expand the plot window to try to get the tree to display more legibly. We'll eventually use the function <tt>ggsave</tt> to control the dimensions of the plot when we finally export it to a PDF file. Don't worry about getting it to display well at the moment.<br />
<br />
Now plot the tree using the <tt>ggtree</tt> package:<br />
<br />
ggtree(tree)<br />
<br />
What happened to our tree!? The <tt>plot</tt> function from the <tt>ape</tt> package plotted the tree with tip labels, but <tt>ggtree</tt> plotted just the bare bones of the tree. <tt>ggtree</tt> by default plots almost nothing, assuming you will add what you want to your tree plot. The grammar/logic of <tt>ggtree</tt> is meant to model that of <tt>ggplot2</tt> and not the <tt>R</tt> language in general. The syntax of <tt>ggtree/ggplot2</tt> makes them easily extendable and particularly useful for graphics, but is by no means intuitive to someone used to <tt>R</tt> and plotting trees using <tt>ape</tt>.<br />
<br />
===Adding/Altering Tree Elements with Geoms and Geom-Like Functions===<br />
<br />
<tt>ggtree</tt> has a variety of functions available to you that allow you to add different elements to a tree. Many of them have the prefix <tt>"geoms"</tt> and are collectively referred to as <tt>geoms</tt>. We'll only go over some of them. You start with a bare bones tree and elements to the tree, function by function, until you get the tree looking like you want it to. You'll see as we progress through this tutorial that visualizing trees in <tt>ggtree</tt> is a truly ''additive'' process.<br />
<br />
=====Tip Labels=====<br />
<br />
OK this tree would be more useful with tiplabels. Let's add them using <tt>geom_tiplab</tt>:<br />
<br />
ggtree(tree) + geom_tiplab()<br />
<br />
This tree is a little crowded. You can expand the graphics window vertically to get it all to fit, but it might be better to do a circular tree:<br />
<br />
ggtree(tree, layout="circular")<br />
<br />
OK that's a bit easier to work with. Those tip labels are nice but a little big. <tt>geom_tiplab</tt> has a bunch of arguments that you can play around with, including one for the text size. You can read more about the available arguments for a given function in [https://www.bioconductor.org/packages/release/bioc/manuals/ggtree/man/ggtree.pdf the <tt>ggtree</tt> manual]. Plot the tree again but with smaller labels:<br />
<br />
ggtree(tree, layout="circular") + geom_tiplab2(size=3.5)<br />
<br />
Notice we are using <tt>geom_tiplab2</tt> and not <tt>geom_tiplab</tt> to show labels on the circular tree. Don't ask me why there are two different tip label geoms for different tree layouts :)<br />
<br />
The tree is still a little crowded, but at this point just play around with the size of the graphics window so you can work with it. We'll finalize how the tree looks later on using the <tt>ggsave</tt> function.<br />
<br />
=====Clade Colors=====<br />
<br />
In order to label clades, we need to tell <tt>ggtree</tt> which nodes subtend each clade we want to label. Just like with the plot function in ape, you can plot a tree with node numbers, see which nodes subtend the clade of interest and then tell <tt>ggtree</tt> the nodes that define the clades you want to label. Another way to get your node of interest is to use the <tt>findMRCA</tt> function (find '''m'''ost '''r'''ecent '''c'''ommon '''a'''ncestor) from the <tt>phytools</tt> package. We will pass the function two tip labels as arguments that define each clade of interest. In their study, Keegan et al (in review) found the Amphipyrinae (as currently classified taxonomically) is polyphyletic -- astoundingly polyphyletic. Let's color two clades: one for what they found to be true Amphipyrinae, and one for a tribe (Stiriini) currently classified taxonomically in Amphipyrinae, that they show to be far removed phylogenetically and thus has no business being classified within Amphipyrinae.<br />
<br />
amphipyrinae_clade <- findMRCA(tree, c("*Redingtonia_alba_KLKDNA0031","MM01162_Amphipyra_perflua"))<br />
stiriini_clade <- findMRCA(tree, c("*Chrysoecia_scira_KLKDNA0002","*Annaphila_diva_KLKDNA0180"))<br />
<br />
You can't (as far as I know) tell <tt>ggtree</tt> directly, as in ape, that the lineages descending from a given node should all be a certain color. What we need to do is define a group that consists of the clades we want colored, and to tell ggtree that it should color the tree by according to the group.<br />
<br />
tree <- groupClade(tree, node=c(amphipyrinae_clade, stiriini_clade), group_name = "group")<br />
<br />
In the above line of code, we apply the <tt>groupClade</tt> function to the object <tt>tree</tt>. We are not overwriting tree and making it consist of only the Amphipyrinae and Stiriini clades. Now if you were to execute <tt>ggtree(tree, layout="circular") + geom_tiplab2(size=3.5)</tt> will still look the same. We need to amend the command to tell it to style the tree by the grouping of clades we just made called "group":<br />
<br />
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) +<br />
geom_tiplab2(size=3.5)<br />
<br />
As you can see the tree gets colored according to some default color scheme. We can define our own color scheme. Let's call it "palette":<br />
<br />
palette <- c("#000000", "#009E73","#e5bc06")<br />
<br />
The values in palette are color values represented by a [https://en.wikipedia.org/wiki/Hexadecimal hexadecimal] value. You can Google one of these hexadecimal values and a little interactive hexadecimal color picker will pop up. Feel free to pick two colors of your choosing to use in the palette -- but leave #000000 as it is. When you're designing a figure for publication, be sure to consider how easily your colors can be distinguished from each other by [http://www.somersault1824.com/tips-for-designing-scientific-figures-for-color-blind-readers/ colorblind] folks.<br />
<br />
Now let's amend the ggtree command and tell it to use the colors we defined:<br />
<br />
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + <br />
geom_tiplab2(size=3.5) + <br />
scale_colour_manual(values = palette)<br />
<br />
The order in which clades are colored is determined by the order of clades in the <tt>groupClade</tt> command. Every lineage in the tree not within a defined clade (i.e. within stiriini_clade or amphipyrinae_clade) is automatically colored according to the first palette value. The first defined clade (stiriini_clade) is colored according to the second palette value, and the second defined clade (amphipyrinae_clade) is colored according to the third palette value.<br />
<br />
=====Clade Labels=====<br />
<br />
Let's add some labels to the two clades. It's relatively straightforward now that we've already defined the subtending nodes:<br />
<br />
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + <br />
geom_tiplab2(size=3.5) + <br />
scale_colour_manual(values = palette) +<br />
geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae") +<br />
geom_cladelabel(node=stiriini_clade, label="Stiriini")<br />
<br />
OK we should move those labels so they're not directly over the tree:<br />
<br />
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + <br />
geom_tiplab2(size=3.5) + <br />
scale_colour_manual(values = palette) +<br />
geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae", fontsize=6, offset=1.9, align=TRUE) +<br />
geom_cladelabel(node=stiriini_clade, label="Stiriini",fontsize=6, offset=2.1, align=TRUE)<br />
<br />
You might have noticed that adding labels caused the rest of the tree to squish together. <tt>ggtree</tt> will try to fit everything into whatever size graphics window you have open. Try playing around with expanding and contracting the graphics window to see this functionality in action. Don't worry about getting everything to display perfectly in the graphics window, because we will use the function <tt>ggsave</tt> to create a PDF -- with definable dimensions -- to control how big the plot is, and thus how the tree looks with its many elements. You may wish to go back and change some of the tree elements after seeing your figure in PDF form.<br />
<br />
=====Node Labels=====<br />
<br />
Let's add some node labels. You can add labels that show the number of the node, but what you would probably like to do is show nodal support values (e.g. bootstraps) which are stored as node labels. We can display the node labels using <tt>geom_label</tt>. <br />
<br />
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + <br />
geom_tiplab2(size=3.5) + <br />
scale_colour_manual(values = palette) +<br />
geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae", fontsize=6, offset=1.9, align=TRUE) +<br />
geom_cladelabel(node=stiriini_clade, label="Stiriini",fontsize=6, offset=2.1, align=TRUE) +<br />
geom_label(aes(label=label))<br />
<br />
You should see A LOT of node labels appear. They get redrawn when you change the size of the graphics window which is quite mesmerizing to watch. Let's subset the node labels in order to just show the ones we want and reduce some of the clutter. We'll first create a dataframe from the data within <tt>tree</tt>:<br />
<br />
q <- ggtree(tree)<br />
d <- q$data<br />
<br />
First let's select only internal nodes (we don't need to show the leaf node labels, as we've already done that with <tt>geomtiplab2</tt>):<br />
<br />
d <- d[!d$isTip,]<br />
<br />
Now lets get rid of the root node:<br />
<br />
d <- d[!d$node=="Root",]<br />
<br />
And finally get rid of any node labels less than 75:<br />
<br />
subset_labels <- d[as.double(d$label) > 75,]<br />
<br />
Note that the object <tt>tree</tt> still has all of its labels. All we did was make a "copy" of <tt>tree</tt> called <tt>q</tt>, and then we created a subset of the data in <tt>q</tt> called <tt>d</tt>. Before, when we plotted the tree with node labels, we didn't specify which ones to label -- so <tt>ggtree</tt> labeled all of them. Now alter your <tt>geom_label</tt>, using the <tt>data</tt> argument available to <tt>geom_label</tt> display the dataset you just created consisting of a subset of node labels. Right now the only argument available to <tt>geom_label</tt> that we are using is the <tt>aes</tt> argument. Look in the <tt>ggtree</tt> manual for an argument that allows you to specify the data passed to <tt>geom_label</tt>.<br />
<br />
=====Scale Bar and Title=====<br />
<br />
Try adding a scale bar using the scale bar geom. I've added in some of the available arguments:<br />
<br />
geom_treescale(x=2,y=1,fontsize=5,linesize=1,offset=0.5)<br />
<br />
Add a title using <tt>ggtitle</tt>. Use it just like you would a <tt>geom</tt><br />
<br />
ggtitle("This is a Title")<br />
<br />
====Export Plot to PDF====<br />
<br />
<tt>ggsave</tt> cannot plot <tt>phylo</tt> objects (like <tt>tree</tt>) directly like <tt>ape</tt> can. You must first apply your <tt>ggtree</tt> function to your phylo object, and assign the result to a new variable. Let's call that variable <tt>tree_save</tt>:<br />
<br />
tree_save <- ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + <br />
geom_tiplab2(size=3.5) + <br />
scale_colour_manual(values = palette) +<br />
geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae", fontsize=6, offset=1.9, align=TRUE) +<br />
geom_cladelabel(node=stiriini_clade, label="Stiriini",fontsize=6, offset=2.1, align=TRUE)<br />
<br />
Now you can export <tt>tree_save</tt> to a PDF<br />
<br />
ggsave(tree_save,file="moth_tree.pdf", width=30, height=30)<br />
<br />
If the layout of your tree just isn't quite what you wanted, go back and play around with the geoms and geom-like functions until the PDF is to your liking.<br />
<br />
====Cite ggtree====<br />
<br />
Remember to cite <tt>ggtree</tt> if you use it in a published work!<br />
<br />
citation("ggtree")<br />
<br />
==Running ggtree on your Computer==<br />
<br />
You will need to install the following packages:<br />
<br />
BiocInstaller<br />
Biostrings<br />
ape<br />
ggplot2<br />
ggtree<br />
phytools<br />
ggrepel<br />
stringr<br />
stringi<br />
abind<br />
treeio<br />
<br />
<br />
The package <tt>BiocInstaller</tt> is special. You can think of it as a ''meta''-package, as it is used to handle the [https://www.bioconductor.org/install/#why-biocLite installation and interoperability] of a suite of closely related open-source bioinformatics packages.<br />
<br />
Install BiocInstaller like so:<br />
<br />
source("https://bioconductor.org/biocLite.R")<br />
biocLite()<br />
<br />
You can, and probably should, install BioConductor packages using BiocInstaller, and not through the regular <tt>install.packages("package_name")</tt> method. To install packages via BioConductor:<br />
<br />
source("https://bioconductor.org/biocLite.R")<br />
biocLite("ape")<br />
<br />
Alternatively:<br />
<br />
install.packages("ape")<br />
<br />
Or install multiple packages like so:<br />
<br />
install.packages(c("ape", "Biostrings"))<br />
<br />
<br />
Now load all of the above packages like so:<br />
<br />
library("ape")<br />
<br />
== Getting Help ==<br />
<br />
The [https://groups.google.com/forum/#!forum/bioc-ggtree Google Group] for ggtree is fairly active. The lead author of <tt>ggtree</tt> chimes in regularly to answer people's questions -- just be sure you've read the documentation first!<br />
<br />
Speaking of documentation there is the [https://www.bioconductor.org/packages/release/bioc/manuals/ggtree/man/ggtree.pdf <tt>ggtree</tt> manual], and lots of [http://www.bioconductor.org/packages/3.7/bioc/vignettes/ggtree/inst/doc/ggtree.html vignettes] concerning how to do particular things in <tt>ggtree</tt>.<br />
<br />
== References ==<br />
<br />
Yu G, Smith D, Zhu H, Guan Y and Lam TT (2017). “ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data.” Methods in Ecology and Evolution, 8, pp. 28-36. doi: 10.1111/2041-210X.12628, http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12628/abstract.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Ggtree&diff=39136Ggtree2018-04-13T16:50:45Z<p>Paul Lewis: /* Tip Labels */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan<br />
<br />
== Goals ==<br />
<br />
To introduce you to the R package [http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12628/abstract ggtree] for plotting phylogenetic trees.<br />
<br />
== Introduction ==<br />
<br />
== Getting Started ==<br />
<br />
This tutorial is written for the cluster user in mind, but feel free to perform it with your own local version of <tt>R</tt> (>=3.4). There are instructions at the end of this tutorial on how to get your local version of <tt>R</tt> set-up for this exercise.<br />
<br />
====Get Situated on the Cluster====<br />
<br />
Log onto the cluster like normal but with an added flag to allow for any graphics to be displayed on your computer.<br />
<br />
ssh username@bbcsrv3.biotech.uconn.edu -Y<br />
<br />
Be sure to get off the head node to avoid litigation and subsequent incarceration:<br />
<br />
qlogin<br />
<br />
Navigate to the folder you want to be working in for the R portion of the lab and download the tree file we'll be working with:<br />
<br />
curl -OL http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/moths.txt<br />
<br />
For more information on the curl command and what options you can use with it consult [https://en.wikipedia.org/wiki/CURL Wikipedia]<br />
<br />
====Start R and Load Packages====<br />
<br />
See what versions of R are available:<br />
module avail<br />
<br />
Load R version 3.4.4<br />
module load R/3.4.4<br />
<br />
Start R<br />
R<br />
<br />
You'll need to load the following packages:<br />
<br />
BiocInstaller<br />
Biostrings<br />
ape<br />
ggplot2<br />
ggtree<br />
phytools<br />
ggrepel<br />
stringr<br />
stringi<br />
abind<br />
treeio<br />
<br />
You can load packages like so:<br />
<br />
library("BiocInstaller")<br />
<br />
If the package <tt>easypackages</tt> were installed and loaded, you could load packages like so:<br />
<br />
libraries("BiocInstaller","Biostrings","ape","ggplot2","ggtree","phytools","ggrepel","stringr","stringi","abind","treeio")<br />
<br />
====Read in the Tree File====<br />
<br />
We're dealing with a tree in the Newick file format which the function <tt>read.newick</tt> from the package <tt>treeio</tt> can handle:<br />
<br />
tree <- read.newick("moths.txt")<br />
<br />
R can handle more than just Newick formatted tree files. To see what other file formats from the various phylogenetic software that R can handle checkout [https://bioconductor.org/packages/release/bioc/html/treeio.html <tt>treeio</tt>]. The functionality within <tt>treeio</tt> used to be part of the <tt>ggtree</tt> package itself, but the authors recently split <tt>ggtree</tt> in two with one part (<tt>ggtree</tt>) handling mostly plotting, and the other other part (<tt>treeio</tt>) handling mostly file input/output operations.<br />
<br />
Let's quickly plot the tree to see what it looks like using the <tt>plot</tt> function from the <tt>ape</tt> package:<br />
<br />
plot(tree)<br />
<br />
Notice the tree has all of its tips labeled. It's also a little cramped. You can expand the plot window to try to get the tree to display more legibly. We'll eventually use the function <tt>ggsave</tt> to control the dimensions of the plot when we finally export it to a PDF file. Don't worry about getting it to display well at the moment.<br />
<br />
Now plot the tree using the <tt>ggtree</tt> package:<br />
<br />
ggtree(tree)<br />
<br />
What happened to our tree!? The <tt>plot</tt> function from the <tt>ape</tt> package plotted the tree with tip labels, but <tt>ggtree</tt> plotted just the bare bones of the tree. <tt>ggtree</tt> by default plots almost nothing, assuming you will add what you want to your tree plot. The grammar/logic of <tt>ggtree</tt> is meant to model that of <tt>ggplot2</tt> and not the <tt>R</tt> language in general. The syntax of <tt>ggtree/ggplot2</tt> makes them easily extendable and particularly useful for graphics, but is by no means intuitive to someone used to <tt>R</tt> and plotting trees using <tt>ape</tt>.<br />
<br />
===Adding/Altering Tree Elements with Geoms and Geom-Like Functions===<br />
<br />
<tt>ggtree</tt> has a variety of functions available to you that allow you to add different elements to a tree. Many of them have the prefix <tt>"geoms"</tt> and are collectively referred to as <tt>geoms</tt>. We'll only go over some of them. You start with a bare bones tree and elements to the tree, function by function, until you get the tree looking like you want it to. You'll see as we progress through this tutorial that visualizing trees in <tt>ggtree</tt> is a truly ''additive'' process.<br />
<br />
=====Tip Labels=====<br />
<br />
OK this tree would be more useful with tiplabels. Let's add them using <tt>geom_tiplab</tt>:<br />
<br />
ggtree(tree) + geom_tiplab()<br />
<br />
This tree is a little crowded. You can expand the graphics window vertically to get it all to fit, but it might be better to do a circular tree:<br />
<br />
ggtree(tree, layout="circular")<br />
<br />
OK that's a bit easier to work with. Those tip labels are nice but a little big. <tt>geom_tiplab</tt> has a bunch of arguments that you can play around with, including one for the text size. You can read more about the available arguments for a given function in [https://www.bioconductor.org/packages/release/bioc/manuals/ggtree/man/ggtree.pdf the <tt>ggtree</tt> manual]. Plot the tree again but with smaller labels:<br />
<br />
ggtree(tree, layout="circular") + geom_tiplab2(size=3.5)<br />
<br />
Notice we are using <tt>geom_tiplab2</tt> and not <tt>geom_tiplab</tt> to show labels on the circular tree. Don't ask me why there are two different tip label geoms for different tree layouts :)<br />
<br />
The tree is still a little crowded, but at this point just play around with the size of the graphics window so you can work with it. We'll finalize how the tree looks later on using the <tt>ggsave</tt> function.<br />
<br />
=====Clade Colors=====<br />
<br />
In order to label clades, we need to tell <tt>ggtree</tt> which nodes subtend each clade we want to label. Just like with the plot function in ape, you can plot a tree with node numbers, see which nodes subtend the clade of interest and then tell <tt>ggtree</tt> the nodes that define the clades you want to label. Another way to get your node of interest is to use the <tt>findMRCA</tt> function (find '''m'''ost '''r'''ecent '''c'''ommon '''a'''ncestor) from the <tt>phytools</tt> package. We will pass the function two tip labels as arguments that define each clade of interest. In their study, Keegan et al (in review) found the Amphipyrinae (as currently classified taxonomically) is polyphyletic -- astoundingly polyphyletic. Let's color two clades: one for what they found to be true Amphipyrinae, and one for a tribe (Stiriini) currently classified taxonomically in Amphipyrinae, that they show to be far removed phylogenetically and thus has no business being classified within Amphipyrinae.<br />
<br />
amphipyrinae_clade <- findMRCA(tree, c("*Redingtonia_alba_KLKDNA0031","MM01162_Amphipyra_perflua"))<br />
stiriini_clade <- findMRCA(tree, c("*Chrysoecia_scira_KLKDNA0002","*Annaphila_diva_KLKDNA0180"))<br />
<br />
You can't (as far as I know) tell <tt>ggtree</tt> directly, as in ape, that the lineages descending from a given node should all be a certain color. What we need to do is define a group that consists of the clades we want colored, and to tell ggtree that it should color the tree by according to the group.<br />
<br />
tree <- groupClade(tree, node=c(amphipyrinae_clade, stiriini_clade), group_name = "group")<br />
<br />
In the above line of code, we apply the <tt>groupClade</tt> function to the object <tt>tree</tt>. We are not overwriting tree and making it consist of only the Amphipyrinae and Stiriini clades. Now if you were to execute <tt>ggtree(tree, layout="circular") + geom_tiplab2(size=3.5)</tt> will still look the same. We need to amend the command to tell it to style the tree by the grouping of clades we just made called "group":<br />
<br />
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) +<br />
geom_tiplab2(size=3.5)<br />
<br />
As you can see the tree gets colored according to some default color scheme. We can define our own color scheme. Let's call it "palette":<br />
<br />
palette <- c("#000000", "#009E73","#e5bc06")<br />
<br />
The values in palette are color values represented by a [https://en.wikipedia.org/wiki/Hexadecimal hexadecimal] value. You can Google one of these hexadecimal values and a little interactive hexadecimal color picker will pop up. Feel free to pick two colors of your choosing to use in the palette -- but leave #000000 as it is. When you're designing a figure for publication, be sure to consider how easily your colors can be distinguished from each other by [http://www.somersault1824.com/tips-for-designing-scientific-figures-for-color-blind-readers/ colorblind] folks.<br />
<br />
Now let's amend the ggtree command and tell it to use the colors we defined:<br />
<br />
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + <br />
geom_tiplab2(size=3.5) + <br />
scale_colour_manual(values = palette)<br />
<br />
The order in which clades are colored is determined by the order of clades in the <tt>groupClade</tt> command. Every lineage in the tree not within a defined clade (i.e. within stiriini_clade or amphipyrinae_clade) is automatically colored according to the first palette value. The first defined clade (stiriini_clade) is colored according to the second palette value, and the second defined clade (amphipyrinae_clade) is colored according to the third palette value.<br />
<br />
=====Clade Labels=====<br />
<br />
Let's add some labels to the two clades. It's relatively straightforward now that we've already defined the subtending nodes:<br />
<br />
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + <br />
geom_tiplab2(size=3.5) + <br />
scale_colour_manual(values = palette) +<br />
geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae") +<br />
geom_cladelabel(node=stiriini_clade, label="Stiriini")<br />
<br />
OK we should move those labels so they're not directly over the tree:<br />
<br />
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + <br />
geom_tiplab2(size=3.5) + <br />
scale_colour_manual(values = palette) +<br />
geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae", fontsize=6, offset=1.9, align=TRUE) +<br />
geom_cladelabel(node=stiriini_clade, label="Stiriini",fontsize=6, offset=2.1, align=TRUE)<br />
<br />
You might have noticed that adding labels caused the rest of the tree to squish together. <tt>ggtree</tt> will try to fit everything into whatever size graphics window you have open. Try playing around with expanding and contracting the graphics window to see this functionality in action. Don't worry about getting everything to display perfectly in the graphics window, because we will use the function <tt>ggsave</tt> to create a PDF -- with definable dimensions -- to control how big the plot is, and thus how the tree looks with its many elements. You may wish to go back and change some of the tree elements after seeing your figure in PDF form.<br />
<br />
=====Node Labels=====<br />
<br />
Let's add some node labels. You can add labels that show the number of the node, but what you would probably like to do is show nodal support values (e.g. bootstraps) which are stored as node labels. We can display the node labels using <tt>geom_label</tt>. <br />
<br />
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + <br />
geom_tiplab2(size=3.5) + <br />
scale_colour_manual(values = palette) +<br />
geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae", fontsize=6, offset=1.9, align=TRUE) +<br />
geom_cladelabel(node=stiriini_clade, label="Stiriini",fontsize=6, offset=2.1, align=TRUE) +<br />
geom_label(aes(label=label))<br />
<br />
You should see A LOT of node labels appear. They get redrawn when you change the size of the graphics window which is quite mesmerizing to watch. Let's subset the node labels in order to just show the ones we want and reduce some of the clutter. We'll first create a dataframe from the data within <tt>tree</tt>:<br />
<br />
q <- ggtree(tree)<br />
d <- q$data<br />
<br />
First let's select only internal nodes (we don't need to show the leaf node labels, as we've already done that with <tt>geomtiplab2</tt>):<br />
<br />
d <- d[!d$isTip,]<br />
<br />
Now lets get rid of the root node:<br />
<br />
d <- d[!d$node=="Root",]<br />
<br />
And finally get rid of any node labels less than 75:<br />
<br />
subset_labels <- d[as.double(d$label) > 75,]<br />
<br />
Note that the object <tt>tree</tt> still has all of its labels. All we did was make a "copy" of <tt>tree</tt> called <tt>q</tt>, and then we created a subset of the data in <tt>q</tt> called <tt>d</tt>. Before, when we plotted the tree with node labels, we didn't specify which ones to label -- so <tt>ggtree</tt> labeled all of them. Now alter your <tt>geom_label</tt>, using the <tt>data</tt> argument available to <tt>geom_label</tt> display the dataset you just created consisting of a subset of node labels. Right now the only argument available to <tt>geom_label</tt> that we are using is the <tt>aes</tt> argument. Look in the <tt>ggtree</tt> manual for an argument that allows you to specify the data passed to <tt>geom_label</tt>.<br />
<br />
=====Scale Bar and Title=====<br />
<br />
Try adding a scale bar using the scale bar geom. I've added in some of the available arguments:<br />
<br />
geom_treescale(x=2,y=1,fontsize=5,linesize=1,offset=0.5)<br />
<br />
Add a title using <tt>ggtitle</tt>. Use it just like you would a <tt>geom</tt><br />
<br />
ggtitle("This is a Title")<br />
<br />
====Export Plot to PDF====<br />
<br />
<tt>ggsave</tt> cannot plot <tt>phylo</tt> objects (like <tt>tree</tt>) directly like <tt>ape</tt> can. You must first apply your <tt>ggtree</tt> function to your phylo object, and assign the result to a new variable. Let's call that variable <tt>tree_save</tt>:<br />
<br />
tree_save <- ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + <br />
geom_tiplab2(size=3.5) + <br />
scale_colour_manual(values = palette) +<br />
geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae", fontsize=6, offset=1.9, align=TRUE) +<br />
geom_cladelabel(node=stiriini_clade, label="Stiriini",fontsize=6, offset=2.1, align=TRUE)<br />
<br />
Now you can export <tt>tree_save</tt> to a PDF<br />
<br />
ggsave(tree_save,file="moth_tree.pdf", width=30, height=30)<br />
<br />
If the layout of your tree just isn't quite what you wanted, go back and play around with the geoms and geom-like functions until the PDF is to your liking.<br />
<br />
====Cite ggtree====<br />
<br />
Remember to cite <tt>ggtree</tt> if you use it in a published work!<br />
<br />
citation("ggtree")<br />
<br />
==Running ggtree on your Computer==<br />
<br />
You will need to install the following packages:<br />
<br />
BiocInstaller<br />
Biostrings<br />
ape<br />
ggplot2<br />
ggtree<br />
phytools<br />
ggrepel<br />
stringr<br />
stringi<br />
abind<br />
treeio<br />
<br />
<br />
The package <tt>BiocInstaller</tt> is special. You can think of it as a ''meta''-package, as it is used to handle the [https://www.bioconductor.org/install/#why-biocLite installation and interoperability] of a suite of closely related open-source bioinformatics packages.<br />
<br />
Install BiocInstaller like so:<br />
<br />
source("https://bioconductor.org/biocLite.R")<br />
biocLite()<br />
<br />
You can, and probably should, install BioConductor packages using BiocInstaller, and not through the regular <tt>install.packages("package_name")</tt> method. To install packages via BioConductor:<br />
<br />
source("https://bioconductor.org/biocLite.R")<br />
biocLite("ape")<br />
<br />
Alternatively:<br />
<br />
install.packages("ape")<br />
<br />
Or install multiple packages like so:<br />
<br />
install.packages(c("ape", "Biostrings"))<br />
<br />
<br />
Now load all of the above packages like so:<br />
<br />
library("ape")<br />
<br />
== Getting Help ==<br />
<br />
The [https://groups.google.com/forum/#!forum/bioc-ggtree Google Group] for ggtree is fairly active. The lead author of <tt>ggtree</tt> chimes in regularly to answer people's questions -- just be sure you've read the documentation first!<br />
<br />
Speaking of documentation there is the [https://www.bioconductor.org/packages/release/bioc/manuals/ggtree/man/ggtree.pdf <tt>ggtree</tt> manual], and lots of [http://www.bioconductor.org/packages/3.7/bioc/vignettes/ggtree/inst/doc/ggtree.html vignettes] concerning how to do particular things in <tt>ggtree</tt>.<br />
<br />
== References ==<br />
<br />
Yu G, Smith D, Zhu H, Guan Y and Lam TT (2017). “ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data.” Methods in Ecology and Evolution, 8, pp. 28-36. doi: 10.1111/2041-210X.12628, http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12628/abstract.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Ggtree&diff=39135Ggtree2018-04-13T16:50:21Z<p>Paul Lewis: /* Get Situated on the Cluster */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan<br />
<br />
== Goals ==<br />
<br />
To introduce you to the R package [http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12628/abstract ggtree] for plotting phylogenetic trees.<br />
<br />
== Introduction ==<br />
<br />
== Getting Started ==<br />
<br />
This tutorial is written for the cluster user in mind, but feel free to perform it with your own local version of <tt>R</tt> (>=3.4). There are instructions at the end of this tutorial on how to get your local version of <tt>R</tt> set-up for this exercise.<br />
<br />
====Get Situated on the Cluster====<br />
<br />
Log onto the cluster like normal but with an added flag to allow for any graphics to be displayed on your computer.<br />
<br />
ssh username@bbcsrv3.biotech.uconn.edu -Y<br />
<br />
Be sure to get off the head node to avoid litigation and subsequent incarceration:<br />
<br />
qlogin<br />
<br />
Navigate to the folder you want to be working in for the R portion of the lab and download the tree file we'll be working with:<br />
<br />
curl -OL http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/moths.txt<br />
<br />
For more information on the curl command and what options you can use with it consult [https://en.wikipedia.org/wiki/CURL Wikipedia]<br />
<br />
====Start R and Load Packages====<br />
<br />
See what versions of R are available:<br />
module avail<br />
<br />
Load R version 3.4.4<br />
module load R/3.4.4<br />
<br />
Start R<br />
R<br />
<br />
You'll need to load the following packages:<br />
<br />
BiocInstaller<br />
Biostrings<br />
ape<br />
ggplot2<br />
ggtree<br />
phytools<br />
ggrepel<br />
stringr<br />
stringi<br />
abind<br />
treeio<br />
<br />
You can load packages like so:<br />
<br />
library("BiocInstaller")<br />
<br />
If the package <tt>easypackages</tt> were installed and loaded, you could load packages like so:<br />
<br />
libraries("BiocInstaller","Biostrings","ape","ggplot2","ggtree","phytools","ggrepel","stringr","stringi","abind","treeio")<br />
<br />
====Read in the Tree File====<br />
<br />
We're dealing with a tree in the Newick file format which the function <tt>read.newick</tt> from the package <tt>treeio</tt> can handle:<br />
<br />
tree <- read.newick("moths.txt")<br />
<br />
R can handle more than just Newick formatted tree files. To see what other file formats from the various phylogenetic software that R can handle checkout [https://bioconductor.org/packages/release/bioc/html/treeio.html <tt>treeio</tt>]. The functionality within <tt>treeio</tt> used to be part of the <tt>ggtree</tt> package itself, but the authors recently split <tt>ggtree</tt> in two with one part (<tt>ggtree</tt>) handling mostly plotting, and the other other part (<tt>treeio</tt>) handling mostly file input/output operations.<br />
<br />
Let's quickly plot the tree to see what it looks like using the <tt>plot</tt> function from the <tt>ape</tt> package:<br />
<br />
plot(tree)<br />
<br />
Notice the tree has all of its tips labeled. It's also a little cramped. You can expand the plot window to try to get the tree to display more legibly. We'll eventually use the function <tt>ggsave</tt> to control the dimensions of the plot when we finally export it to a PDF file. Don't worry about getting it to display well at the moment.<br />
<br />
Now plot the tree using the <tt>ggtree</tt> package:<br />
<br />
ggtree(tree)<br />
<br />
What happened to our tree!? The <tt>plot</tt> function from the <tt>ape</tt> package plotted the tree with tip labels, but <tt>ggtree</tt> plotted just the bare bones of the tree. <tt>ggtree</tt> by default plots almost nothing, assuming you will add what you want to your tree plot. The grammar/logic of <tt>ggtree</tt> is meant to model that of <tt>ggplot2</tt> and not the <tt>R</tt> language in general. The syntax of <tt>ggtree/ggplot2</tt> makes them easily extendable and particularly useful for graphics, but is by no means intuitive to someone used to <tt>R</tt> and plotting trees using <tt>ape</tt>.<br />
<br />
===Adding/Altering Tree Elements with Geoms and Geom-Like Functions===<br />
<br />
<tt>ggtree</tt> has a variety of functions available to you that allow you to add different elements to a tree. Many of them have the prefix <tt>"geoms"</tt> and are collectively referred to as <tt>geoms</tt>. We'll only go over some of them. You start with a bare bones tree and elements to the tree, function by function, until you get the tree looking like you want it to. You'll see as we progress through this tutorial that visualizing trees in <tt>ggtree</tt> is a truly ''additive'' process.<br />
<br />
=====Tip Labels=====<br />
<br />
OK this tree would be more useful with tiplabels. Let's add them using <tt>geom_tiplab</tt>:<br />
<br />
ggtree(tree) + geom_tiplab()<br />
<br />
This tree is a little crowded. You can expand the graphics window vertically to get it all to fit, but it might be better to do a circular tree:<br />
<br />
ggtree(tree, layout="circular")<br />
<br />
OK that's a bit easier to work with. Those tip labels are nice but a little big. <tt>geom_tiplab</tt> has a bunch of arguments that you can play around with, including one for the text size. You can read more about the available arguments for a given function in [https://www.bioconductor.org/packages/release/bioc/manuals/ggtree/man/ggtree.pdf the <tt>ggtree</tt> manual]. Plot the tree again but with smaller labels:<br />
<br />
ggtree(tree, layout="circular") + geom_tiplab2(size=3.5)<br />
<br />
Notice we are using <tt>geom_tiplab2</tt> and not <tt>geom_tiplab</tt> to show labels on the circular tree. Don't ask my why there are two different tip label geoms for different tree layouts :)<br />
<br />
The tree is still a little crowded, but at this point just play around with the size of the graphics window so you can work with it. We'll finalize how the tree looks later on using the <tt>ggsave</tt> function.<br />
<br />
=====Clade Colors=====<br />
<br />
In order to label clades, we need to tell <tt>ggtree</tt> which nodes subtend each clade we want to label. Just like with the plot function in ape, you can plot a tree with node numbers, see which nodes subtend the clade of interest and then tell <tt>ggtree</tt> the nodes that define the clades you want to label. Another way to get your node of interest is to use the <tt>findMRCA</tt> function (find '''m'''ost '''r'''ecent '''c'''ommon '''a'''ncestor) from the <tt>phytools</tt> package. We will pass the function two tip labels as arguments that define each clade of interest. In their study, Keegan et al (in review) found the Amphipyrinae (as currently classified taxonomically) is polyphyletic -- astoundingly polyphyletic. Let's color two clades: one for what they found to be true Amphipyrinae, and one for a tribe (Stiriini) currently classified taxonomically in Amphipyrinae, that they show to be far removed phylogenetically and thus has no business being classified within Amphipyrinae.<br />
<br />
amphipyrinae_clade <- findMRCA(tree, c("*Redingtonia_alba_KLKDNA0031","MM01162_Amphipyra_perflua"))<br />
stiriini_clade <- findMRCA(tree, c("*Chrysoecia_scira_KLKDNA0002","*Annaphila_diva_KLKDNA0180"))<br />
<br />
You can't (as far as I know) tell <tt>ggtree</tt> directly, as in ape, that the lineages descending from a given node should all be a certain color. What we need to do is define a group that consists of the clades we want colored, and to tell ggtree that it should color the tree by according to the group.<br />
<br />
tree <- groupClade(tree, node=c(amphipyrinae_clade, stiriini_clade), group_name = "group")<br />
<br />
In the above line of code, we apply the <tt>groupClade</tt> function to the object <tt>tree</tt>. We are not overwriting tree and making it consist of only the Amphipyrinae and Stiriini clades. Now if you were to execute <tt>ggtree(tree, layout="circular") + geom_tiplab2(size=3.5)</tt> will still look the same. We need to amend the command to tell it to style the tree by the grouping of clades we just made called "group":<br />
<br />
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) +<br />
geom_tiplab2(size=3.5)<br />
<br />
As you can see the tree gets colored according to some default color scheme. We can define our own color scheme. Let's call it "palette":<br />
<br />
palette <- c("#000000", "#009E73","#e5bc06")<br />
<br />
The values in palette are color values represented by a [https://en.wikipedia.org/wiki/Hexadecimal hexadecimal] value. You can Google one of these hexadecimal values and a little interactive hexadecimal color picker will pop up. Feel free to pick two colors of your choosing to use in the palette -- but leave #000000 as it is. When you're designing a figure for publication, be sure to consider how easily your colors can be distinguished from each other by [http://www.somersault1824.com/tips-for-designing-scientific-figures-for-color-blind-readers/ colorblind] folks.<br />
<br />
Now let's amend the ggtree command and tell it to use the colors we defined:<br />
<br />
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + <br />
geom_tiplab2(size=3.5) + <br />
scale_colour_manual(values = palette)<br />
<br />
The order in which clades are colored is determined by the order of clades in the <tt>groupClade</tt> command. Every lineage in the tree not within a defined clade (i.e. within stiriini_clade or amphipyrinae_clade) is automatically colored according to the first palette value. The first defined clade (stiriini_clade) is colored according to the second palette value, and the second defined clade (amphipyrinae_clade) is colored according to the third palette value.<br />
<br />
=====Clade Labels=====<br />
<br />
Let's add some labels to the two clades. It's relatively straightforward now that we've already defined the subtending nodes:<br />
<br />
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + <br />
geom_tiplab2(size=3.5) + <br />
scale_colour_manual(values = palette) +<br />
geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae") +<br />
geom_cladelabel(node=stiriini_clade, label="Stiriini")<br />
<br />
OK we should move those labels so they're not directly over the tree:<br />
<br />
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + <br />
geom_tiplab2(size=3.5) + <br />
scale_colour_manual(values = palette) +<br />
geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae", fontsize=6, offset=1.9, align=TRUE) +<br />
geom_cladelabel(node=stiriini_clade, label="Stiriini",fontsize=6, offset=2.1, align=TRUE)<br />
<br />
You might have noticed that adding labels caused the rest of the tree to squish together. <tt>ggtree</tt> will try to fit everything into whatever size graphics window you have open. Try playing around with expanding and contracting the graphics window to see this functionality in action. Don't worry about getting everything to display perfectly in the graphics window, because we will use the function <tt>ggsave</tt> to create a PDF -- with definable dimensions -- to control how big the plot is, and thus how the tree looks with its many elements. You may wish to go back and change some of the tree elements after seeing your figure in PDF form.<br />
<br />
=====Node Labels=====<br />
<br />
Let's add some node labels. You can add labels that show the number of the node, but what you would probably like to do is show nodal support values (e.g. bootstraps) which are stored as node labels. We can display the node labels using <tt>geom_label</tt>. <br />
<br />
ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + <br />
geom_tiplab2(size=3.5) + <br />
scale_colour_manual(values = palette) +<br />
geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae", fontsize=6, offset=1.9, align=TRUE) +<br />
geom_cladelabel(node=stiriini_clade, label="Stiriini",fontsize=6, offset=2.1, align=TRUE) +<br />
geom_label(aes(label=label))<br />
<br />
You should see A LOT of node labels appear. They get redrawn when you change the size of the graphics window which is quite mesmerizing to watch. Let's subset the node labels in order to just show the ones we want and reduce some of the clutter. We'll first create a dataframe from the data within <tt>tree</tt>:<br />
<br />
q <- ggtree(tree)<br />
d <- q$data<br />
<br />
First let's select only internal nodes (we don't need to show the leaf node labels, as we've already done that with <tt>geomtiplab2</tt>):<br />
<br />
d <- d[!d$isTip,]<br />
<br />
Now lets get rid of the root node:<br />
<br />
d <- d[!d$node=="Root",]<br />
<br />
And finally get rid of any node labels less than 75:<br />
<br />
subset_labels <- d[as.double(d$label) > 75,]<br />
<br />
Note that the object <tt>tree</tt> still has all of its labels. All we did was make a "copy" of <tt>tree</tt> called <tt>q</tt>, and then we created a subset of the data in <tt>q</tt> called <tt>d</tt>. Before, when we plotted the tree with node labels, we didn't specify which ones to label -- so <tt>ggtree</tt> labeled all of them. Now alter your <tt>geom_label</tt>, using the <tt>data</tt> argument available to <tt>geom_label</tt> display the dataset you just created consisting of a subset of node labels. Right now the only argument available to <tt>geom_label</tt> that we are using is the <tt>aes</tt> argument. Look in the <tt>ggtree</tt> manual for an argument that allows you to specify the data passed to <tt>geom_label</tt>.<br />
<br />
=====Scale Bar and Title=====<br />
<br />
Try adding a scale bar using the scale bar geom. I've added in some of the available arguments:<br />
<br />
geom_treescale(x=2,y=1,fontsize=5,linesize=1,offset=0.5)<br />
<br />
Add a title using <tt>ggtitle</tt>. Use it just like you would a <tt>geom</tt><br />
<br />
ggtitle("This is a Title")<br />
<br />
====Export Plot to PDF====<br />
<br />
<tt>ggsave</tt> cannot plot <tt>phylo</tt> objects (like <tt>tree</tt>) directly like <tt>ape</tt> can. You must first apply your <tt>ggtree</tt> function to your phylo object, and assign the result to a new variable. Let's call that variable <tt>tree_save</tt>:<br />
<br />
tree_save <- ggtree(tree, layout="circular",aes(color=group, linetype="solid")) + <br />
geom_tiplab2(size=3.5) + <br />
scale_colour_manual(values = palette) +<br />
geom_cladelabel(node=amphipyrinae_clade, label="Amphipyrinae", fontsize=6, offset=1.9, align=TRUE) +<br />
geom_cladelabel(node=stiriini_clade, label="Stiriini",fontsize=6, offset=2.1, align=TRUE)<br />
<br />
Now you can export <tt>tree_save</tt> to a PDF<br />
<br />
ggsave(tree_save,file="moth_tree.pdf", width=30, height=30)<br />
<br />
If the layout of your tree just isn't quite what you wanted, go back and play around with the geoms and geom-like functions until the PDF is to your liking.<br />
<br />
====Cite ggtree====<br />
<br />
Remember to cite <tt>ggtree</tt> if you use it in a published work!<br />
<br />
citation("ggtree")<br />
<br />
==Running ggtree on your Computer==<br />
<br />
You will need to install the following packages:<br />
<br />
BiocInstaller<br />
Biostrings<br />
ape<br />
ggplot2<br />
ggtree<br />
phytools<br />
ggrepel<br />
stringr<br />
stringi<br />
abind<br />
treeio<br />
<br />
<br />
The package <tt>BiocInstaller</tt> is special. You can think of it as a ''meta''-package, as it is used to handle the [https://www.bioconductor.org/install/#why-biocLite installation and interoperability] of a suite of closely related open-source bioinformatics packages.<br />
<br />
Install BiocInstaller like so:<br />
<br />
source("https://bioconductor.org/biocLite.R")<br />
biocLite()<br />
<br />
You can, and probably should, install BioConductor packages using BiocInstaller, and not through the regular <tt>install.packages("package_name")</tt> method. To install packages via BioConductor:<br />
<br />
source("https://bioconductor.org/biocLite.R")<br />
biocLite("ape")<br />
<br />
Alternatively:<br />
<br />
install.packages("ape")<br />
<br />
Or install multiple packages like so:<br />
<br />
install.packages(c("ape", "Biostrings"))<br />
<br />
<br />
Now load all of the above packages like so:<br />
<br />
library("ape")<br />
<br />
== Getting Help ==<br />
<br />
The [https://groups.google.com/forum/#!forum/bioc-ggtree Google Group] for ggtree is fairly active. The lead author of <tt>ggtree</tt> chimes in regularly to answer people's questions -- just be sure you've read the documentation first!<br />
<br />
Speaking of documentation there is the [https://www.bioconductor.org/packages/release/bioc/manuals/ggtree/man/ggtree.pdf <tt>ggtree</tt> manual], and lots of [http://www.bioconductor.org/packages/3.7/bioc/vignettes/ggtree/inst/doc/ggtree.html vignettes] concerning how to do particular things in <tt>ggtree</tt>.<br />
<br />
== References ==<br />
<br />
Yu G, Smith D, Zhu H, Guan Y and Lam TT (2017). “ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data.” Methods in Ecology and Evolution, 8, pp. 28-36. doi: 10.1111/2041-210X.12628, http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12628/abstract.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_APE_Lab&diff=38953Phylogenetics: APE Lab2018-04-03T18:35:05Z<p>Paul Lewis: /* Phylogenetic Generalized Least Squares (PGLS) regression */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|This lab is an introduction to some of the capabilities of APE, a phylogenetic analysis package written for the R language. You may want to review the [[Phylogenetics: R Primer|R Primer]] lab if you've already forgotten everything you learned about R.<br />
|}<br />
<br />
== Installing APE and apTreeshape ==<br />
'''APE''' is a package largely written and maintained by Emmanuel Paradis, who has written a very nice book<ref>Paradis, E. 2006. Analysis of phylogenetics and evolution with R. Springer. ISBN: 0-387-32914-5</ref> explaining in detail how to use APE. APE is designed to be used inside the [http://www.r-project.org/ R] programming language, to which you were introduced earlier in the semester (see [[Phylogenetics: R Primer]]). APE can do an impressive array of analyses. For example, it is possible to estimate trees using neighbor-joining or maximum likelihood, estimate ancestral states (for either discrete or continuous data), perform Sanderson's penalized likelihood relaxed clock method to estimate divergence times, evaluate Felsenstein's independent contrasts, estimate birth/death rates, perform bootstrapping, and even automatically pull sequences from GenBank given a vector of accession numbers! APE also has impressive tree plotting capabilities, of which we will only scratch the surface today (flip through Chapter 4 of the Paradis book to see what more APE can do).<br />
<br />
'''apTreeshape''' is a different R package (written by Nicolas Bortolussi et al.) that we will also make use of today.<br />
<br />
To install APE and apTreeshape, start R and type the following at the R command prompt:<br />
> install.packages("ape")<br />
> install.packages("apTreeshape")<br />
Assuming you are connected to the internet, R should locate these packages and install them for you. After they are installed, you will need to load them into R in order to use them (note that no quotes are used this time):<br />
> library(ape)<br />
> library(apTreeshape)<br />
You should never again need to issue the <tt>install.packages</tt> command for APE and apTreeshape, but you will need to use the <tt>library</tt> command to load them whenever you want to use them.<br />
<br />
== Reading in trees from a file and exploring tree data structure ==<br />
Download [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/yule.tre this tree file] and save it as a file named <tt>yule.tre</tt> in a new folder somewhere on your computer. Tell R where this folder is using the <tt>setwd</tt> (set working directory) command. For example, I created a folder named <tt>apelab</tt> on my desktop, so I typed this to make that folder my working directory:<br />
> setwd("/Users/plewis/Desktop/apelab")<br />
Now you should be able to read in the tree using this ape command (the <tt>t</tt> is an arbitrary name I chose for the variable used to hold the tree; you could use <tt>tree</tt> if you want):<br />
> t <- read.nexus("yule.tre")<br />
We use <tt>read.nexus</tt> because the tree at hand is in NEXUS format, but APE has a variety of functions to read in different tree file types. If APR can't read your tree file, then give the package treeio a spin. APE stores trees as an object of type "phylo". <br />
<br />
==== Getting a tree summary ====<br />
Some basic information about the tree can be obtained by simply typing the name of the variable you used to store the tree:<br />
> t<br />
<br />
Phylogenetic tree with 20 tips and 19 internal nodes.<br />
<br />
Tip labels:<br />
B, C, D, E, F, G, ...<br />
<br />
Rooted; includes branch lengths.<br />
<br />
==== Obtaining vectors of tip and internal node labels ====<br />
The variable <tt>t</tt> has several attributes that can be queried by following the variable name with a dollar sign and then the name of the attribute. For example, the vector of tip labels can be obtained as follows:<br />
> t$tip.label<br />
[1] "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U"<br />
The internal node labels, if they exist, can be obtained this way:<br />
> t$node.label<br />
NULL<br />
The result above means that labels for the internal nodes were not stored with this tree.<br />
<br />
==== Obtaining the nodes attached to each edge ==== <br />
The nodes at the ends of all the edges in the tree can be had by asking for the edge attribute:<br />
> t$edge<br />
[,1] [,2]<br />
[1,] 21 22<br />
[2,] 22 23<br />
[3,] 23 1<br />
. . .<br />
. . .<br />
. . .<br />
[38,] 38 12 <br />
<br />
==== Obtaining a vector of edge lengths ==== <br />
The edge lengths can be printed thusly:<br />
> t$edge.length<br />
[1] 0.07193600 0.01755700 0.17661500 0.02632500 0.01009100 0.06893900 0.07126000 0.03970200 0.01912900<br />
[10] 0.01243000 0.01243000 0.03155800 0.05901300 0.08118600 0.08118600 0.00476400 0.14552600 0.07604800<br />
[19] 0.00070400 0.06877400 0.06877400 0.02423800 0.02848800 0.01675100 0.01675100 0.04524000 0.19417200<br />
[28] 0.07015000 0.12596600 0.06999200 0.06797400 0.00201900 0.00201900 0.12462600 0.07128300 0.00004969<br />
[37] 0.00004969 0.07133200<br />
<br />
==== About this tree ==== <br />
This tree in the file <tt>yule.tre</tt> was obtained using PAUP from 10,000 nucleotide sites simulated from a Yule tree. The model used to generate the simulated data (HKY model, kappa = 4, base frequencies = 0.3 A, 0.2 C, 0.2 G, and 0.3 T, no rate heterogeneity) was also used in the analysis by PAUP (the final ML tree was made ultrametric by enforcing the clock constraint).<br />
<!-- I analyzed these data in BEAST for part of a lecture. See slide 22 and beyond in [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/lectures/DivTimeBayesianBEAST.pdf this PDF file] for details.--><br />
<br />
== Fun with plotting trees in APE ==<br />
<br />
You can plot the tree using all defaults with this ape command:<br />
> plot(t)<br />
<br />
Let's try changing a few defaults and plot the tree in a variety of ways. All of the following change just one default option, but feel free to combine these to create the plot you want.<br />
<br />
==== Left-facing, up-facing, or down-facing trees ====<br />
> plot(t, direction="l")<br />
> plot(t, direction="u")<br />
> plot(t, direction="d")<br />
The default is to plot the tree right-facing (<tt>direction="r"</tt>).<br />
<br />
==== Hide the taxon names ====<br />
> plot(t, show.tip.label=FALSE)<br />
The default behavior is to show the taxon names.<br />
<br />
==== Make the edges thicker ====<br />
> plot(t, edge.width=4)<br />
An edge width of 1 is the default. If you specify several edge widths, APE will alternate them as it draws the tree:<br />
> plot(t, edge.width=c(1,2,3,4))<br />
<br />
==== Color the edges ====<br />
> plot(t, edge.color="red")<br />
Black edges are the default. If you specify several edge colors, APE will alternate them as it draws the tree:<br />
> plot(t, edge.color=c("black","red","green","blue"))<br />
<br />
==== Make taxon labels smaller or larger ====<br />
> plot(t, cex=0.5)<br />
The cex parameter governs relative scaling of the taxon labels, with 1.0 being the default. Thus, the command above makes the taxon labels half the default size. To double the size, use<br />
> plot(t, cex=2.0)<br />
<br />
==== Plot tree as an unrooted or radial tree ====<br />
> plot(t, type="u")<br />
The default type is "p" (phylogram), but "c" (cladogram), "u" (unrooted), "r" (radial) are other options. Some of these options (e.g. "r") create very funky looking trees, leading me to think there is something about the tree description in the file <tt>yule.tre</tt> that APE is not expecting.<br />
<br />
==== Labeling internal nodes ====<br />
> plot(t)<br />
> nodelabels()<br />
This is primarily useful if you want to annotate one of the nodes:<br />
> plot(t)<br />
> nodelabels("Clade A", 22)<br />
> nodelabels("Clade B", 35)<br />
To put the labels inside a circle rather than a rectangle, use <tt>frame="c"</tt> rather than the default (<tt>frame="r"</tt>). To use a background color of white rather than the default "lightblue", use <tt>bg="white"</tt>:<br />
> plot(t)<br />
> nodelabels("Clade A", 22, frame="c", bg="white")<br />
> nodelabels("Clade B", 35, frame="c", bg="yellow")<br />
<br />
==== Adding a scale bar ====<br />
> plot(t)<br />
> add.scale.bar(length=0.05)<br />
The above commands add a scale bar to the bottom left of the plot. To add a scale going all the way across the bottom of the plot, try this:<br />
> plot(t)<br />
> axisPhylo()<br />
<br />
== Diversification analyses ==<br />
APE can perform some lineage-through-time type analyses. The tree read in from the file <tt>yule.tre</tt> that you already have in memory is perfect for testing APE's diversification analyses because we know (since it is based on simulated data) that this tree was generated under a pure-birth (Yule) model.<br />
<br />
==== Lineage through time plots ====<br />
This is a rather small tree, so a lineage through time (LTT) plot will be rather crude, but let's go through the motions anyway.<br />
> ltt.plot(t)<br />
LTT plots usually have a log scale for the number of lineages (y-axis), and this can be easily accomplished:<br />
> ltt.plot(t, log = "y")<br />
Now add a line extending from the point (t = -0.265, N = 2) to the point (t = 0, N = 20) using the command <tt>segments</tt> (note that the &quot;segments&quot; command is a core R command, not something added by the APE package):<br />
> segments(-0.265, 2, 0, 20, lty="dotted")<br />
The slope of this line should (ideally) be equal to the birth rate of the yule process used to generate the tree, which was <math>\lambda=10</math>.<br />
<div style="background-color:#ccccff"><br />
* Calculate the slope of this line. Is it close to the birth rate 10? {{title|8.689|answer}}<br />
</div><br />
If you get something like 68 for the slope, then you forgot to take the natural log of 2 and 20. The plot uses a log scale for the y-axis, so the two endpoints of the dotted line are really (-0.265, log(2)) and (0, log(20)).<br />
<br />
==== Birth/death analysis ====<br />
Now let's perform a birth/death analysis. APE's <tt>birthdeath</tt> command estimates the birth and death rates using the node ages in a tree:<br />
> birthdeath(t)<br />
Estimation of Speciation and Extinction Rates<br />
with Birth-Death Models <br />
<br />
Phylogenetic tree: t <br />
Number of tips: 20 <br />
Deviance: -120.4538 <br />
Log-likelihood: 60.22689 <br />
Parameter estimates:<br />
d / b = 0 StdErr = 0 <br />
b - d = 8.674513 StdErr = 1.445897 <br />
(b: speciation rate, d: extinction rate)<br />
Profile likelihood 95% confidence intervals:<br />
d / b: [-1.193549, 0.5286254]<br />
b - d: [5.25955, 13.32028]<br />
See the excellent [https://cran.r-project.org/web/packages/ape/ape.pdf APE manual] to learn more about the <tt>birthdeath</tt> function. <br />
<div style="background-color:#ccccff"><br />
* What is the death rate estimated by APE? {{title|0.0|answer}}<br />
* Is the true diversification rate within one standard deviation of the estimated diversification rate? {{title|yes, one standard deviation each side of (10-0) is the interval from 8.554 to 11.446, and 8.674513 is inside that interval|answer}}<br />
* Are the true diversification and relative extinction values within the profile likelihood 95% confidence intervals? {{title|yes|answer}}<br />
</div><br />
A "profile" likelihood is obtained by varying one parameter in the model and re-estimating all the other parameters conditional on the current value of the focal parameter. This is, technically, not the correct way of getting a confidence interval, but is easier to compute and may be more stable for small samples than getting confidence intervals the correct way.<br />
<div style="background-color:#ccccff"><br />
* What is the correct way to interpret the 95% confidence interval for b - d: [5.25955, 13.32028]? Is it that there is 95% chance that the true value of b - d is in that interval? {{title|no, that is the definition of a Bayesian credible interval|answer}}<br />
* Or, does it mean that our estimate (8.674513) is within the middle 95% of values that would be produced if the true b - d value was in that interval? {{title|yes|answer}}<br />
</div><br />
<br />
==== Analyses involving tree shape ====<br />
The apTreeshape package (as the name applies) lets you perform analyses of tree shape (which measure how balanced or imbalanced a tree is). apTreeshape stores trees differently than APE, so you can't use a tree object that you created with APE in functions associated with apTreeshape. You can, however, convert a "phylo" object from APE to a "treeshape" object used by apTreeshape:<br />
> ts <- as.treeshape(t)<br />
Here, I'm assuming that <tt>t</tt> still refers to the tree you read in from the file <tt>yule.tre</tt> using the APE command <tt>read.nexus</tt>. We can now obtain a measure of '''tree imbalance''' known as Colless's index:<br />
> c <- colless(ts)<br />
> c<br />
[1] 44<br />
The formula for Colless's index is easy to understand. Each internal node branches into a left and right lineage. The absolute value of the difference between the number of left-hand descendants and right-hand descendants provides a measure of how imbalanced the tree is with respect to that particular node. Adding these imbalance measures up over all internal nodes yields Colless's overall tree imbalance index:<br />
<br />
<math>I_C = \sum_{j=1}^{n-1} |L_j - R_j|</math><br />
<br />
apTreeshape can do an analysis to assess whether the tree has the amount of imbalance one would expect from a Yule tree:<br />
> colless.test(ts, model = "yule", alternative="greater", n.mc = 1000)<br />
This generates 1000 trees from a Yule process and compares the Colless index from our tree (44) to the distribution of such indices obtained from the simulated trees. The p-value is the proportion of the 1000 trees generated from the null distribution that have indices greater than 44 (i.e. the proportion of Yule trees that are more ''im''balanced than our tree). If the p-value was 0.5, for example, then our tree would be right in the middle of the distribution expected for Yule trees. If the p-value was 0.01, however, it would mean that very few Yule trees are as imbalanced as our tree, which would make it hard to believe that our tree is a Yule tree.<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that a Yule process generated our tree? {{title|I got 0.288 so no, the Yule process can easily generate trees with the same level of imbalance as our tree|answer}}<br />
</div><br />
You can also test one other model with the <tt>colless</tt> function: the "proportional to distinguishable" (or PDA) model. This null model produces random trees by starting with three taxa joined to a single internal node, then building upon that by adding new taxa to randomly-chosen (discrete uniform distribution) edges that already exist in the (unrooted) tree. The edge to which a new taxon is added can be an internal edge as well as a terminal edge, which causes this process to produce trees with a different distribution of shapes than the Yule process, which only adds new taxa to the tips of a growing rooted tree.<br />
> colless.test(ts, model = "pda", alternative="greater", n.mc = 1000)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that our tree is a PDA tree? {{title|I got 0.912, so no, PDA trees are almost always less imbalanced (more balanced) than our tree|answer}}<br />
</div><br />
You might also wish to test whether our tree is more ''balanced'' than would be expected under the Yule or PDA models. apTreeshape let's you look at the other end of the distribution too:<br />
> colless.test(ts, model = "yule", alternative="less", n.mc = 1000)<br />
> colless.test(ts, model = "pda", alternative="less", n.mc = 1000)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value for the first test above indicate that our tree is more balanced than would be expected were it a Yule tree? {{title|I got 0.716, so no, most Yule trees are more balanced than our tree|answer}}<br />
* Does the p-value for the second test above indicate that our tree is more balanced than would be expected were it a PDA tree? {{title|I got 0.064, so no, our tree is not significantly more balanced than a PDA tree, but it is getting close|answer}}<br />
</div><br />
<br />
You might want to see a histogram of the Colless index like that used to determine the p-values for the tests above. apTreeshape lets you generate 10 trees each with 20 tips under a Yule model as follows:<br />
> rtreeshape(10,20,model="yule")<br />
That spits out a summary of the 10 trees created, but what we really wanted was to know the Colless index for each of the trees generated. To do this, use the R command <tt>sapply</tt> to call the apTreeshape command <tt>colless</tt> for each tree generated by the <tt>rtreeshape</tt> command:<br />
> sapply(rtreeshape(10,20,model="yule"),FUN=colless)<br />
[1] 38 92 85 91 73 71 94 75 72 93<br />
<div style="background-color:#ccccff"><br />
* Why do you think your Colless indices differ from the ones above?{{title|In a Yule model, the time between speciation events is determined by drawing random numbers from an exponential distribution. If we all started with the same initial (seed) random number, then our Colless indices would be identical| answer}}<br />
</div><br />
That's more like it! Now, generate 1000 Yule trees instead of just 10, and create a histogram using the standard R command <tt>hist</tt>:<br />
> yulecolless <- sapply(rtreeshape(1000,20,model="yule"),FUN=colless)<br />
> hist(yulecolless)<br />
Now create a histogram for PDA trees:<br />
> pdacolless <- sapply(rtreeshape(1000,20,model="pda"),FUN=colless)<br />
> hist(pdacolless)<br />
Use the following to compare the mean Colless index for the PDA trees to the Yule trees:<br />
> summary(yulecolless)<br />
> summary(pdacolless)<br />
<div style="background-color:#ccccff"><br />
* Which generates the most balanced trees, on average: Yule or PDA? {{title|Yule trees are more balanced, with mean Colless index 39 versus 81 for PDA|answer}}<br />
</div><br />
<br />
apTreeshape provides one more function (<tt>likelihood.test</tt>) that performs a likelihood ratio test of the PDA model against the Yule model null hypothesis. This test says that we cannot reject the null hypothesis of a Yule model in favor of the PDA model:<br />
> likelihood.test(ts)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that our tree is a Yule tree? {{title|I got 0.4095684, so no, our tree is consistent with a Yule tree|answer}}<br />
</div><br />
<br />
== Independent contrasts ==<br />
APE can compute Felsenstein's independent contrasts, as well as several other methods for assessing phylogenetically-corrected correlations between traits that I did not discuss in lecture (autocorrelation, generalized least squares, mixed models and variance partitioning, and the very interesting Ornstein-Uhlenbeck model, which can be used to assess the correlation between a continuous character and a discrete habitat variable).<br />
<br />
Today, however, we will just play with independent contrasts and phylogenetic generalized least squares (PGLS) regression. Let's try to use APE's <tt>pic</tt> command to reproduce the Independent Contrasts example from lecture:<br />
{| class="wikitable" style="text-align:center"<br />
|- style="background-color: #eeeeee"<br />
| <br />
| style="width:10%" | X<br />
| style="width:10%" | Y<br />
| style="width:10%" | Var<br />
| style="width:10%" | X*<br />
| style="width:10%" | Y*<br />
|-<br />
| style="width:10%" | A-B <br />
| -6<br />
| -2<br />
| 4<br />
| -3<br />
| -1<br />
|-<br />
| C-D <br />
| -4<br />
| -2<br />
| 4<br />
| -2<br />
| -1<br />
|-<br />
| E-F<br />
| 10<br />
| -4<br />
| 9<br />
| 3.333<br />
| -1.333<br />
|}<br />
In the table, X and Y denote the raw contrasts, while X* and Y* denote the rescaled contrasts (raw contrasts divided by the square root of the variance). The correlation among the rescaled contrasts was -0.98916.<br />
<br />
==== Enter the tree ====<br />
Start by entering the tree:<br />
> t <- read.tree(text="((A:2,B:2)E:3.5,(C:2,D:2)F:3.5)G;")<br />
The attribute <tt>text</tt> is needed because we are entering the Newick tree description in the form of a string, not supplying a file name. Note that I have labeled the root node G and the two interior nodes E (ancestor of A and B) and F (ancestor of C and D).<br />
<br />
Plot the tree to make sure the tree definition worked:<br />
> plot(t)<br />
<br />
==== Enter the data ====<br />
Now we must tell APE the X and Y values. Do this by supplying vectors of numbers. We will tell APE which tips these numbers are associated with in the next step:<br />
> x <- c(27, 33, 18, 22)<br />
> y <- c(122, 124, 126, 128)<br />
<br />
Here's how we tell APE what taxa the numbers belong to:<br />
> names(x) <- c("A","B","C","D")<br />
> names(y) <- c("A","B","C","D")<br />
If you want to avoid repetition, you can enter the names for both x and y simultaneously like this:<br />
> names(x) <- names(y) <- c("A","B","C","D")<br />
<br />
==== Compute independent contrasts ====<br />
Now compute the contrasts with the APE function <tt>pic</tt>:<br />
> cx <- pic(x,t)<br />
> cy <- pic(y,t)<br />
The variables cx and cy are arbitrary; you could use different names for these if you wanted. Let's see what values cx and cy hold:<br />
> cx<br />
G E F <br />
3.333333 -3.000000 -2.000000 <br />
> cy<br />
G E F <br />
-1.333333 -1.000000 -1.000000 <br />
The top row in each case holds the node name in the tree, the bottom row holds the rescaled contrasts.<br />
<br />
==== Label interior nodes with the contrasts ====<br />
APE makes it fairly easy to label the tree with the contrasts:<br />
> plot(t)<br />
> nodelabels(round(cx,3), adj=c(0,-1), frame="n")<br />
> nodelabels(round(cy,3), adj=c(0,+1), frame="n")<br />
In the nodelabels command, we supply the numbers with which to label the nodes. The vectors cx and cy contain information about the nodes to label, so APE knows from this which numbers to place at which nodes in the tree. The round command simply rounds the contrasts to 3 decimal places. The <tt>adj</tt> setting adjusts the spacing so that the contrasts for X are not placed directly on top of the contrasts for Y. The command <tt>adj=c(0,-1)</tt> causes the labels to be horizontally displaced 0 lines and vertically displaced one line up (the -1 means go up 1 line) from where they would normally be plotted. The contrasts for Y are displaced vertically one line down from where they would normally appear. Finally, the <tt>frame="n"</tt> just says to not place a box or circle around the labels.<br />
<br />
You should find that the contrasts are the same as those shown as X* and Y* in the table above (as well as the summary slide in the Independent Contrasts lecture). <br />
<br />
Computing the correlation coefficient is as easy as:<br />
> cor(cx, cy)<br />
[1] -0.9891585<br />
<br />
== Phylogenetic Generalized Least Squares (PGLS) regression ==<br />
<br />
Now let's reproduce the PGLS regression example given in lecture. Here are the data we used:<br />
{| class="wikitable" style="text-align:center"<br />
|- style="background-color: #eeeeee"<br />
| <br />
| style="width:10%" | X<br />
| style="width:10%" | Y<br />
|-<br />
| style="width:10%" | A<br />
| 2<br />
| 1<br />
|-<br />
| B<br />
| 3<br />
| 3<br />
|-<br />
| C<br />
| 1<br />
| 2<br />
|-<br />
| D<br />
| 4<br />
| 7<br />
|-<br />
| E<br />
| 5<br />
| 6<br />
|}<br />
<br />
Enter the data as we did for the Independent Contrasts example:<br />
> x <- c(2,3,1,4,5)<br />
> y <- c(1,3,2,7,6)<br />
> names(x) <- names(y) <- c("A","B","C","D","E")<br />
> df <- data.frame(x,y)<br />
<br />
In order to carry out generalized least squares regression, we will need the <tt>gls</tt> command, which is part of the <tt>nlme</tt> R package. Thus, you will need to load this package before you can use the <tt>gls</tt> command:<br />
> library(nlme)<br />
<br />
Let's first do an ordinary linear regression for comparison:<br />
> m0 <- gls(y ~ x)<br />
> summary(m0) <br />
<br />
<div style="background-color:#ccccff"><br />
* What is the estimate of the intercept? {{title|-0.4|answer}}<br />
* What is the estimate of the slope? {{title|1.4|answer}}<br />
</div><br />
<br />
Let's plot the regression line on the original data:<br />
> plot(x, y, pch=19, xlim=c(0,6),ylim=c(0,8))<br />
> text(x, y, labels=c("A", "B", "C", "D", "E"), pos=4, offset=1)<br />
> segments(0, -0.4, 6, -0.4 + 1.4*6, lwd=2, lty="solid", col="blue")<br />
You will have noticed that the '''first line''' plots the points using a filled circle (pch=19) and specifying that the x-axis should go from 0 to 6 and the y-axis should extend from 0 to 8. The '''second line''' labels the points with the taxon to make it easier to interpret the plot. Here, pos=4 says to put the labels to the right of each point (pos = 1, 2, 3 means below, left, and above, respectively) and offset=1 specifies how far away from the point each label should be. The '''third line''' draws the regression line using the intercept and slope values provided by gls, making the line width 2 (lwd=2) and solid (lty="solid") and blue (col="blue").<br />
<br />
To do PGLS, we will need to enter the tree with edge lengths:<br />
> t <- read.tree(text="(((A:1,B:1)F:1,C:2)G:1,(D:0.5,E:0.5)H:2.5)I;")<br />
<br />
You are ready to estimate the parameters of the PGLS regression model:<br />
> m1 <- gls(y ~ x, correlation=corBrownian(1,t), data=df)<br />
> summary(m1) <br />
<br />
<div style="background-color:#ccccff"><br />
* What is the estimate of the intercept? {{title|1.7521186|answer}}<br />
* What is the estimate of the slope? {{title|0.7055085|answer}}<br />
* The <tt>corBrownian</tt> function specified for the correlation in the gls command comes from the APE package. What does <tt>corBrownian</tt> do? You might want to check out the excellent [https://cran.r-project.org/web/packages/ape/ape.pdf APE manual] {{title|it computes the variance-covariance matrix from the tree t assuming a Brownian motion model|answer}}<br />
* In <tt>corBrownian(1,t)</tt>, t is the tree, but what do you think the 1 signifies? {{title|It is the variance per unit time for the Brownian motion model|answer}}<br />
</div><br />
<br />
Assuming you still have the plot window available, let's add the PGLS regression line to the existing plot (if you've closed the plot window you will have to recreate the plot first):<br />
> segments(0, 1.7521186, 6, 1.7521186 + 0.7055085*6, lwd=2, lty="dotted", col="blue")<br />
<br />
== Literature Cited ==<br />
<references/><br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_APE_Lab&diff=38952Phylogenetics: APE Lab2018-04-03T18:34:28Z<p>Paul Lewis: /* Label interior nodes with the contrasts */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|This lab is an introduction to some of the capabilities of APE, a phylogenetic analysis package written for the R language. You may want to review the [[Phylogenetics: R Primer|R Primer]] lab if you've already forgotten everything you learned about R.<br />
|}<br />
<br />
== Installing APE and apTreeshape ==<br />
'''APE''' is a package largely written and maintained by Emmanuel Paradis, who has written a very nice book<ref>Paradis, E. 2006. Analysis of phylogenetics and evolution with R. Springer. ISBN: 0-387-32914-5</ref> explaining in detail how to use APE. APE is designed to be used inside the [http://www.r-project.org/ R] programming language, to which you were introduced earlier in the semester (see [[Phylogenetics: R Primer]]). APE can do an impressive array of analyses. For example, it is possible to estimate trees using neighbor-joining or maximum likelihood, estimate ancestral states (for either discrete or continuous data), perform Sanderson's penalized likelihood relaxed clock method to estimate divergence times, evaluate Felsenstein's independent contrasts, estimate birth/death rates, perform bootstrapping, and even automatically pull sequences from GenBank given a vector of accession numbers! APE also has impressive tree plotting capabilities, of which we will only scratch the surface today (flip through Chapter 4 of the Paradis book to see what more APE can do).<br />
<br />
'''apTreeshape''' is a different R package (written by Nicolas Bortolussi et al.) that we will also make use of today.<br />
<br />
To install APE and apTreeshape, start R and type the following at the R command prompt:<br />
> install.packages("ape")<br />
> install.packages("apTreeshape")<br />
Assuming you are connected to the internet, R should locate these packages and install them for you. After they are installed, you will need to load them into R in order to use them (note that no quotes are used this time):<br />
> library(ape)<br />
> library(apTreeshape)<br />
You should never again need to issue the <tt>install.packages</tt> command for APE and apTreeshape, but you will need to use the <tt>library</tt> command to load them whenever you want to use them.<br />
<br />
== Reading in trees from a file and exploring tree data structure ==<br />
Download [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/yule.tre this tree file] and save it as a file named <tt>yule.tre</tt> in a new folder somewhere on your computer. Tell R where this folder is using the <tt>setwd</tt> (set working directory) command. For example, I created a folder named <tt>apelab</tt> on my desktop, so I typed this to make that folder my working directory:<br />
> setwd("/Users/plewis/Desktop/apelab")<br />
Now you should be able to read in the tree using this ape command (the <tt>t</tt> is an arbitrary name I chose for the variable used to hold the tree; you could use <tt>tree</tt> if you want):<br />
> t <- read.nexus("yule.tre")<br />
We use <tt>read.nexus</tt> because the tree at hand is in NEXUS format, but APE has a variety of functions to read in different tree file types. If APR can't read your tree file, then give the package treeio a spin. APE stores trees as an object of type "phylo". <br />
<br />
==== Getting a tree summary ====<br />
Some basic information about the tree can be obtained by simply typing the name of the variable you used to store the tree:<br />
> t<br />
<br />
Phylogenetic tree with 20 tips and 19 internal nodes.<br />
<br />
Tip labels:<br />
B, C, D, E, F, G, ...<br />
<br />
Rooted; includes branch lengths.<br />
<br />
==== Obtaining vectors of tip and internal node labels ====<br />
The variable <tt>t</tt> has several attributes that can be queried by following the variable name with a dollar sign and then the name of the attribute. For example, the vector of tip labels can be obtained as follows:<br />
> t$tip.label<br />
[1] "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U"<br />
The internal node labels, if they exist, can be obtained this way:<br />
> t$node.label<br />
NULL<br />
The result above means that labels for the internal nodes were not stored with this tree.<br />
<br />
==== Obtaining the nodes attached to each edge ==== <br />
The nodes at the ends of all the edges in the tree can be had by asking for the edge attribute:<br />
> t$edge<br />
[,1] [,2]<br />
[1,] 21 22<br />
[2,] 22 23<br />
[3,] 23 1<br />
. . .<br />
. . .<br />
. . .<br />
[38,] 38 12 <br />
<br />
==== Obtaining a vector of edge lengths ==== <br />
The edge lengths can be printed thusly:<br />
> t$edge.length<br />
[1] 0.07193600 0.01755700 0.17661500 0.02632500 0.01009100 0.06893900 0.07126000 0.03970200 0.01912900<br />
[10] 0.01243000 0.01243000 0.03155800 0.05901300 0.08118600 0.08118600 0.00476400 0.14552600 0.07604800<br />
[19] 0.00070400 0.06877400 0.06877400 0.02423800 0.02848800 0.01675100 0.01675100 0.04524000 0.19417200<br />
[28] 0.07015000 0.12596600 0.06999200 0.06797400 0.00201900 0.00201900 0.12462600 0.07128300 0.00004969<br />
[37] 0.00004969 0.07133200<br />
<br />
==== About this tree ==== <br />
This tree in the file <tt>yule.tre</tt> was obtained using PAUP from 10,000 nucleotide sites simulated from a Yule tree. The model used to generate the simulated data (HKY model, kappa = 4, base frequencies = 0.3 A, 0.2 C, 0.2 G, and 0.3 T, no rate heterogeneity) was also used in the analysis by PAUP (the final ML tree was made ultrametric by enforcing the clock constraint).<br />
<!-- I analyzed these data in BEAST for part of a lecture. See slide 22 and beyond in [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/lectures/DivTimeBayesianBEAST.pdf this PDF file] for details.--><br />
<br />
== Fun with plotting trees in APE ==<br />
<br />
You can plot the tree using all defaults with this ape command:<br />
> plot(t)<br />
<br />
Let's try changing a few defaults and plot the tree in a variety of ways. All of the following change just one default option, but feel free to combine these to create the plot you want.<br />
<br />
==== Left-facing, up-facing, or down-facing trees ====<br />
> plot(t, direction="l")<br />
> plot(t, direction="u")<br />
> plot(t, direction="d")<br />
The default is to plot the tree right-facing (<tt>direction="r"</tt>).<br />
<br />
==== Hide the taxon names ====<br />
> plot(t, show.tip.label=FALSE)<br />
The default behavior is to show the taxon names.<br />
<br />
==== Make the edges thicker ====<br />
> plot(t, edge.width=4)<br />
An edge width of 1 is the default. If you specify several edge widths, APE will alternate them as it draws the tree:<br />
> plot(t, edge.width=c(1,2,3,4))<br />
<br />
==== Color the edges ====<br />
> plot(t, edge.color="red")<br />
Black edges are the default. If you specify several edge colors, APE will alternate them as it draws the tree:<br />
> plot(t, edge.color=c("black","red","green","blue"))<br />
<br />
==== Make taxon labels smaller or larger ====<br />
> plot(t, cex=0.5)<br />
The cex parameter governs relative scaling of the taxon labels, with 1.0 being the default. Thus, the command above makes the taxon labels half the default size. To double the size, use<br />
> plot(t, cex=2.0)<br />
<br />
==== Plot tree as an unrooted or radial tree ====<br />
> plot(t, type="u")<br />
The default type is "p" (phylogram), but "c" (cladogram), "u" (unrooted), "r" (radial) are other options. Some of these options (e.g. "r") create very funky looking trees, leading me to think there is something about the tree description in the file <tt>yule.tre</tt> that APE is not expecting.<br />
<br />
==== Labeling internal nodes ====<br />
> plot(t)<br />
> nodelabels()<br />
This is primarily useful if you want to annotate one of the nodes:<br />
> plot(t)<br />
> nodelabels("Clade A", 22)<br />
> nodelabels("Clade B", 35)<br />
To put the labels inside a circle rather than a rectangle, use <tt>frame="c"</tt> rather than the default (<tt>frame="r"</tt>). To use a background color of white rather than the default "lightblue", use <tt>bg="white"</tt>:<br />
> plot(t)<br />
> nodelabels("Clade A", 22, frame="c", bg="white")<br />
> nodelabels("Clade B", 35, frame="c", bg="yellow")<br />
<br />
==== Adding a scale bar ====<br />
> plot(t)<br />
> add.scale.bar(length=0.05)<br />
The above commands add a scale bar to the bottom left of the plot. To add a scale going all the way across the bottom of the plot, try this:<br />
> plot(t)<br />
> axisPhylo()<br />
<br />
== Diversification analyses ==<br />
APE can perform some lineage-through-time type analyses. The tree read in from the file <tt>yule.tre</tt> that you already have in memory is perfect for testing APE's diversification analyses because we know (since it is based on simulated data) that this tree was generated under a pure-birth (Yule) model.<br />
<br />
==== Lineage through time plots ====<br />
This is a rather small tree, so a lineage through time (LTT) plot will be rather crude, but let's go through the motions anyway.<br />
> ltt.plot(t)<br />
LTT plots usually have a log scale for the number of lineages (y-axis), and this can be easily accomplished:<br />
> ltt.plot(t, log = "y")<br />
Now add a line extending from the point (t = -0.265, N = 2) to the point (t = 0, N = 20) using the command <tt>segments</tt> (note that the &quot;segments&quot; command is a core R command, not something added by the APE package):<br />
> segments(-0.265, 2, 0, 20, lty="dotted")<br />
The slope of this line should (ideally) be equal to the birth rate of the yule process used to generate the tree, which was <math>\lambda=10</math>.<br />
<div style="background-color:#ccccff"><br />
* Calculate the slope of this line. Is it close to the birth rate 10? {{title|8.689|answer}}<br />
</div><br />
If you get something like 68 for the slope, then you forgot to take the natural log of 2 and 20. The plot uses a log scale for the y-axis, so the two endpoints of the dotted line are really (-0.265, log(2)) and (0, log(20)).<br />
<br />
==== Birth/death analysis ====<br />
Now let's perform a birth/death analysis. APE's <tt>birthdeath</tt> command estimates the birth and death rates using the node ages in a tree:<br />
> birthdeath(t)<br />
Estimation of Speciation and Extinction Rates<br />
with Birth-Death Models <br />
<br />
Phylogenetic tree: t <br />
Number of tips: 20 <br />
Deviance: -120.4538 <br />
Log-likelihood: 60.22689 <br />
Parameter estimates:<br />
d / b = 0 StdErr = 0 <br />
b - d = 8.674513 StdErr = 1.445897 <br />
(b: speciation rate, d: extinction rate)<br />
Profile likelihood 95% confidence intervals:<br />
d / b: [-1.193549, 0.5286254]<br />
b - d: [5.25955, 13.32028]<br />
See the excellent [https://cran.r-project.org/web/packages/ape/ape.pdf APE manual] to learn more about the <tt>birthdeath</tt> function. <br />
<div style="background-color:#ccccff"><br />
* What is the death rate estimated by APE? {{title|0.0|answer}}<br />
* Is the true diversification rate within one standard deviation of the estimated diversification rate? {{title|yes, one standard deviation each side of (10-0) is the interval from 8.554 to 11.446, and 8.674513 is inside that interval|answer}}<br />
* Are the true diversification and relative extinction values within the profile likelihood 95% confidence intervals? {{title|yes|answer}}<br />
</div><br />
A "profile" likelihood is obtained by varying one parameter in the model and re-estimating all the other parameters conditional on the current value of the focal parameter. This is, technically, not the correct way of getting a confidence interval, but is easier to compute and may be more stable for small samples than getting confidence intervals the correct way.<br />
<div style="background-color:#ccccff"><br />
* What is the correct way to interpret the 95% confidence interval for b - d: [5.25955, 13.32028]? Is it that there is 95% chance that the true value of b - d is in that interval? {{title|no, that is the definition of a Bayesian credible interval|answer}}<br />
* Or, does it mean that our estimate (8.674513) is within the middle 95% of values that would be produced if the true b - d value was in that interval? {{title|yes|answer}}<br />
</div><br />
<br />
==== Analyses involving tree shape ====<br />
The apTreeshape package (as the name applies) lets you perform analyses of tree shape (which measure how balanced or imbalanced a tree is). apTreeshape stores trees differently than APE, so you can't use a tree object that you created with APE in functions associated with apTreeshape. You can, however, convert a "phylo" object from APE to a "treeshape" object used by apTreeshape:<br />
> ts <- as.treeshape(t)<br />
Here, I'm assuming that <tt>t</tt> still refers to the tree you read in from the file <tt>yule.tre</tt> using the APE command <tt>read.nexus</tt>. We can now obtain a measure of '''tree imbalance''' known as Colless's index:<br />
> c <- colless(ts)<br />
> c<br />
[1] 44<br />
The formula for Colless's index is easy to understand. Each internal node branches into a left and right lineage. The absolute value of the difference between the number of left-hand descendants and right-hand descendants provides a measure of how imbalanced the tree is with respect to that particular node. Adding these imbalance measures up over all internal nodes yields Colless's overall tree imbalance index:<br />
<br />
<math>I_C = \sum_{j=1}^{n-1} |L_j - R_j|</math><br />
<br />
apTreeshape can do an analysis to assess whether the tree has the amount of imbalance one would expect from a Yule tree:<br />
> colless.test(ts, model = "yule", alternative="greater", n.mc = 1000)<br />
This generates 1000 trees from a Yule process and compares the Colless index from our tree (44) to the distribution of such indices obtained from the simulated trees. The p-value is the proportion of the 1000 trees generated from the null distribution that have indices greater than 44 (i.e. the proportion of Yule trees that are more ''im''balanced than our tree). If the p-value was 0.5, for example, then our tree would be right in the middle of the distribution expected for Yule trees. If the p-value was 0.01, however, it would mean that very few Yule trees are as imbalanced as our tree, which would make it hard to believe that our tree is a Yule tree.<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that a Yule process generated our tree? {{title|I got 0.288 so no, the Yule process can easily generate trees with the same level of imbalance as our tree|answer}}<br />
</div><br />
You can also test one other model with the <tt>colless</tt> function: the "proportional to distinguishable" (or PDA) model. This null model produces random trees by starting with three taxa joined to a single internal node, then building upon that by adding new taxa to randomly-chosen (discrete uniform distribution) edges that already exist in the (unrooted) tree. The edge to which a new taxon is added can be an internal edge as well as a terminal edge, which causes this process to produce trees with a different distribution of shapes than the Yule process, which only adds new taxa to the tips of a growing rooted tree.<br />
> colless.test(ts, model = "pda", alternative="greater", n.mc = 1000)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that our tree is a PDA tree? {{title|I got 0.912, so no, PDA trees are almost always less imbalanced (more balanced) than our tree|answer}}<br />
</div><br />
You might also wish to test whether our tree is more ''balanced'' than would be expected under the Yule or PDA models. apTreeshape let's you look at the other end of the distribution too:<br />
> colless.test(ts, model = "yule", alternative="less", n.mc = 1000)<br />
> colless.test(ts, model = "pda", alternative="less", n.mc = 1000)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value for the first test above indicate that our tree is more balanced than would be expected were it a Yule tree? {{title|I got 0.716, so no, most Yule trees are more balanced than our tree|answer}}<br />
* Does the p-value for the second test above indicate that our tree is more balanced than would be expected were it a PDA tree? {{title|I got 0.064, so no, our tree is not significantly more balanced than a PDA tree, but it is getting close|answer}}<br />
</div><br />
<br />
You might want to see a histogram of the Colless index like that used to determine the p-values for the tests above. apTreeshape lets you generate 10 trees each with 20 tips under a Yule model as follows:<br />
> rtreeshape(10,20,model="yule")<br />
That spits out a summary of the 10 trees created, but what we really wanted was to know the Colless index for each of the trees generated. To do this, use the R command <tt>sapply</tt> to call the apTreeshape command <tt>colless</tt> for each tree generated by the <tt>rtreeshape</tt> command:<br />
> sapply(rtreeshape(10,20,model="yule"),FUN=colless)<br />
[1] 38 92 85 91 73 71 94 75 72 93<br />
<div style="background-color:#ccccff"><br />
* Why do you think your Colless indices differ from the ones above?{{title|In a Yule model, the time between speciation events is determined by drawing random numbers from an exponential distribution. If we all started with the same initial (seed) random number, then our Colless indices would be identical| answer}}<br />
</div><br />
That's more like it! Now, generate 1000 Yule trees instead of just 10, and create a histogram using the standard R command <tt>hist</tt>:<br />
> yulecolless <- sapply(rtreeshape(1000,20,model="yule"),FUN=colless)<br />
> hist(yulecolless)<br />
Now create a histogram for PDA trees:<br />
> pdacolless <- sapply(rtreeshape(1000,20,model="pda"),FUN=colless)<br />
> hist(pdacolless)<br />
Use the following to compare the mean Colless index for the PDA trees to the Yule trees:<br />
> summary(yulecolless)<br />
> summary(pdacolless)<br />
<div style="background-color:#ccccff"><br />
* Which generates the most balanced trees, on average: Yule or PDA? {{title|Yule trees are more balanced, with mean Colless index 39 versus 81 for PDA|answer}}<br />
</div><br />
<br />
apTreeshape provides one more function (<tt>likelihood.test</tt>) that performs a likelihood ratio test of the PDA model against the Yule model null hypothesis. This test says that we cannot reject the null hypothesis of a Yule model in favor of the PDA model:<br />
> likelihood.test(ts)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that our tree is a Yule tree? {{title|I got 0.4095684, so no, our tree is consistent with a Yule tree|answer}}<br />
</div><br />
<br />
== Independent contrasts ==<br />
APE can compute Felsenstein's independent contrasts, as well as several other methods for assessing phylogenetically-corrected correlations between traits that I did not discuss in lecture (autocorrelation, generalized least squares, mixed models and variance partitioning, and the very interesting Ornstein-Uhlenbeck model, which can be used to assess the correlation between a continuous character and a discrete habitat variable).<br />
<br />
Today, however, we will just play with independent contrasts and phylogenetic generalized least squares (PGLS) regression. Let's try to use APE's <tt>pic</tt> command to reproduce the Independent Contrasts example from lecture:<br />
{| class="wikitable" style="text-align:center"<br />
|- style="background-color: #eeeeee"<br />
| <br />
| style="width:10%" | X<br />
| style="width:10%" | Y<br />
| style="width:10%" | Var<br />
| style="width:10%" | X*<br />
| style="width:10%" | Y*<br />
|-<br />
| style="width:10%" | A-B <br />
| -6<br />
| -2<br />
| 4<br />
| -3<br />
| -1<br />
|-<br />
| C-D <br />
| -4<br />
| -2<br />
| 4<br />
| -2<br />
| -1<br />
|-<br />
| E-F<br />
| 10<br />
| -4<br />
| 9<br />
| 3.333<br />
| -1.333<br />
|}<br />
In the table, X and Y denote the raw contrasts, while X* and Y* denote the rescaled contrasts (raw contrasts divided by the square root of the variance). The correlation among the rescaled contrasts was -0.98916.<br />
<br />
==== Enter the tree ====<br />
Start by entering the tree:<br />
> t <- read.tree(text="((A:2,B:2)E:3.5,(C:2,D:2)F:3.5)G;")<br />
The attribute <tt>text</tt> is needed because we are entering the Newick tree description in the form of a string, not supplying a file name. Note that I have labeled the root node G and the two interior nodes E (ancestor of A and B) and F (ancestor of C and D).<br />
<br />
Plot the tree to make sure the tree definition worked:<br />
> plot(t)<br />
<br />
==== Enter the data ====<br />
Now we must tell APE the X and Y values. Do this by supplying vectors of numbers. We will tell APE which tips these numbers are associated with in the next step:<br />
> x <- c(27, 33, 18, 22)<br />
> y <- c(122, 124, 126, 128)<br />
<br />
Here's how we tell APE what taxa the numbers belong to:<br />
> names(x) <- c("A","B","C","D")<br />
> names(y) <- c("A","B","C","D")<br />
If you want to avoid repetition, you can enter the names for both x and y simultaneously like this:<br />
> names(x) <- names(y) <- c("A","B","C","D")<br />
<br />
==== Compute independent contrasts ====<br />
Now compute the contrasts with the APE function <tt>pic</tt>:<br />
> cx <- pic(x,t)<br />
> cy <- pic(y,t)<br />
The variables cx and cy are arbitrary; you could use different names for these if you wanted. Let's see what values cx and cy hold:<br />
> cx<br />
G E F <br />
3.333333 -3.000000 -2.000000 <br />
> cy<br />
G E F <br />
-1.333333 -1.000000 -1.000000 <br />
The top row in each case holds the node name in the tree, the bottom row holds the rescaled contrasts.<br />
<br />
==== Label interior nodes with the contrasts ====<br />
APE makes it fairly easy to label the tree with the contrasts:<br />
> plot(t)<br />
> nodelabels(round(cx,3), adj=c(0,-1), frame="n")<br />
> nodelabels(round(cy,3), adj=c(0,+1), frame="n")<br />
In the nodelabels command, we supply the numbers with which to label the nodes. The vectors cx and cy contain information about the nodes to label, so APE knows from this which numbers to place at which nodes in the tree. The round command simply rounds the contrasts to 3 decimal places. The <tt>adj</tt> setting adjusts the spacing so that the contrasts for X are not placed directly on top of the contrasts for Y. The command <tt>adj=c(0,-1)</tt> causes the labels to be horizontally displaced 0 lines and vertically displaced one line up (the -1 means go up 1 line) from where they would normally be plotted. The contrasts for Y are displaced vertically one line down from where they would normally appear. Finally, the <tt>frame="n"</tt> just says to not place a box or circle around the labels.<br />
<br />
You should find that the contrasts are the same as those shown as X* and Y* in the table above (as well as the summary slide in the Independent Contrasts lecture). <br />
<br />
Computing the correlation coefficient is as easy as:<br />
> cor(cx, cy)<br />
[1] -0.9891585<br />
<br />
== Phylogenetic Generalized Least Squares (PGLS) regression ==<br />
<br />
Now let's reproduce the PGLS regression example given in lecture. Here are the data we used:<br />
{| class="wikitable" style="text-align:center"<br />
|- style="background-color: #eeeeee"<br />
| <br />
| style="width:10%" | X<br />
| style="width:10%" | Y<br />
|-<br />
| style="width:10%" | A<br />
| 2<br />
| 1<br />
|-<br />
| B<br />
| 3<br />
| 3<br />
|-<br />
| C<br />
| 1<br />
| 2<br />
|-<br />
| D<br />
| 4<br />
| 7<br />
|-<br />
| E<br />
| 5<br />
| 6<br />
|}<br />
<br />
Enter the data as we did for the Independent Contrasts example:<br />
> x <- c(2,3,1,4,5)<br />
> y <- c(1,3,2,7,6)<br />
> names(x) <- names(y) <- c("A","B","C","D","E")<br />
<br />
In order to carry out generalized least squares regression, we will need the <tt>gls</tt> command, which is part of the <tt>nlme</tt> R package. Thus, you will need to load this package before you can use the <tt>gls</tt> command:<br />
> library(nlme)<br />
<br />
Let's first do an ordinary linear regression for comparison:<br />
> m0 <- gls(y ~ x)<br />
> summary(m0) <br />
<br />
<div style="background-color:#ccccff"><br />
* What is the estimate of the intercept? {{title|-0.4|answer}}<br />
* What is the estimate of the slope? {{title|1.4|answer}}<br />
</div><br />
<br />
Let's plot the regression line on the original data:<br />
> plot(x, y, pch=19, xlim=c(0,6),ylim=c(0,8))<br />
> text(x, y, labels=c("A", "B", "C", "D", "E"), pos=4, offset=1)<br />
> segments(0, -0.4, 6, -0.4 + 1.4*6, lwd=2, lty="solid", col="blue")<br />
You will have noticed that the '''first line''' plots the points using a filled circle (pch=19) and specifying that the x-axis should go from 0 to 6 and the y-axis should extend from 0 to 8. The '''second line''' labels the points with the taxon to make it easier to interpret the plot. Here, pos=4 says to put the labels to the right of each point (pos = 1, 2, 3 means below, left, and above, respectively) and offset=1 specifies how far away from the point each label should be. The '''third line''' draws the regression line using the intercept and slope values provided by gls, making the line width 2 (lwd=2) and solid (lty="solid") and blue (col="blue").<br />
<br />
To do PGLS, we will need to enter the tree with edge lengths:<br />
> t <- read.tree(text="(((A:1,B:1)F:1,C:2)G:1,(D:0.5,E:0.5)H:2.5)I;")<br />
<br />
You are ready to estimate the parameters of the PGLS regression model:<br />
> m1 <- gls(y ~ x, correlation=corBrownian(1,t))<br />
> summary(m1) <br />
<br />
<div style="background-color:#ccccff"><br />
* What is the estimate of the intercept? {{title|1.7521186|answer}}<br />
* What is the estimate of the slope? {{title|0.7055085|answer}}<br />
* The <tt>corBrownian</tt> function specified for the correlation in the gls command comes from the APE package. What does <tt>corBrownian</tt> do? You might want to check out the excellent [https://cran.r-project.org/web/packages/ape/ape.pdf APE manual] {{title|it computes the variance-covariance matrix from the tree t assuming a Brownian motion model|answer}}<br />
* In <tt>corBrownian(1,t)</tt>, t is the tree, but what do you think the 1 signifies? {{title|It is the variance per unit time for the Brownian motion model|answer}}<br />
</div><br />
<br />
Assuming you still have the plot window available, let's add the PGLS regression line to the existing plot (if you've closed the plot window you will have to recreate the plot first):<br />
> segments(0, 1.7521186, 6, 1.7521186 + 0.7055085*6, lwd=2, lty="dotted", col="blue")<br />
<br />
== Literature Cited ==<br />
<references/><br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_APE_Lab&diff=38951Phylogenetics: APE Lab2018-04-03T18:33:54Z<p>Paul Lewis: /* Label interior nodes with the contrasts */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|This lab is an introduction to some of the capabilities of APE, a phylogenetic analysis package written for the R language. You may want to review the [[Phylogenetics: R Primer|R Primer]] lab if you've already forgotten everything you learned about R.<br />
|}<br />
<br />
== Installing APE and apTreeshape ==<br />
'''APE''' is a package largely written and maintained by Emmanuel Paradis, who has written a very nice book<ref>Paradis, E. 2006. Analysis of phylogenetics and evolution with R. Springer. ISBN: 0-387-32914-5</ref> explaining in detail how to use APE. APE is designed to be used inside the [http://www.r-project.org/ R] programming language, to which you were introduced earlier in the semester (see [[Phylogenetics: R Primer]]). APE can do an impressive array of analyses. For example, it is possible to estimate trees using neighbor-joining or maximum likelihood, estimate ancestral states (for either discrete or continuous data), perform Sanderson's penalized likelihood relaxed clock method to estimate divergence times, evaluate Felsenstein's independent contrasts, estimate birth/death rates, perform bootstrapping, and even automatically pull sequences from GenBank given a vector of accession numbers! APE also has impressive tree plotting capabilities, of which we will only scratch the surface today (flip through Chapter 4 of the Paradis book to see what more APE can do).<br />
<br />
'''apTreeshape''' is a different R package (written by Nicolas Bortolussi et al.) that we will also make use of today.<br />
<br />
To install APE and apTreeshape, start R and type the following at the R command prompt:<br />
> install.packages("ape")<br />
> install.packages("apTreeshape")<br />
Assuming you are connected to the internet, R should locate these packages and install them for you. After they are installed, you will need to load them into R in order to use them (note that no quotes are used this time):<br />
> library(ape)<br />
> library(apTreeshape)<br />
You should never again need to issue the <tt>install.packages</tt> command for APE and apTreeshape, but you will need to use the <tt>library</tt> command to load them whenever you want to use them.<br />
<br />
== Reading in trees from a file and exploring tree data structure ==<br />
Download [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/yule.tre this tree file] and save it as a file named <tt>yule.tre</tt> in a new folder somewhere on your computer. Tell R where this folder is using the <tt>setwd</tt> (set working directory) command. For example, I created a folder named <tt>apelab</tt> on my desktop, so I typed this to make that folder my working directory:<br />
> setwd("/Users/plewis/Desktop/apelab")<br />
Now you should be able to read in the tree using this ape command (the <tt>t</tt> is an arbitrary name I chose for the variable used to hold the tree; you could use <tt>tree</tt> if you want):<br />
> t <- read.nexus("yule.tre")<br />
We use <tt>read.nexus</tt> because the tree at hand is in NEXUS format, but APE has a variety of functions to read in different tree file types. If APR can't read your tree file, then give the package treeio a spin. APE stores trees as an object of type "phylo". <br />
<br />
==== Getting a tree summary ====<br />
Some basic information about the tree can be obtained by simply typing the name of the variable you used to store the tree:<br />
> t<br />
<br />
Phylogenetic tree with 20 tips and 19 internal nodes.<br />
<br />
Tip labels:<br />
B, C, D, E, F, G, ...<br />
<br />
Rooted; includes branch lengths.<br />
<br />
==== Obtaining vectors of tip and internal node labels ====<br />
The variable <tt>t</tt> has several attributes that can be queried by following the variable name with a dollar sign and then the name of the attribute. For example, the vector of tip labels can be obtained as follows:<br />
> t$tip.label<br />
[1] "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U"<br />
The internal node labels, if they exist, can be obtained this way:<br />
> t$node.label<br />
NULL<br />
The result above means that labels for the internal nodes were not stored with this tree.<br />
<br />
==== Obtaining the nodes attached to each edge ==== <br />
The nodes at the ends of all the edges in the tree can be had by asking for the edge attribute:<br />
> t$edge<br />
[,1] [,2]<br />
[1,] 21 22<br />
[2,] 22 23<br />
[3,] 23 1<br />
. . .<br />
. . .<br />
. . .<br />
[38,] 38 12 <br />
<br />
==== Obtaining a vector of edge lengths ==== <br />
The edge lengths can be printed thusly:<br />
> t$edge.length<br />
[1] 0.07193600 0.01755700 0.17661500 0.02632500 0.01009100 0.06893900 0.07126000 0.03970200 0.01912900<br />
[10] 0.01243000 0.01243000 0.03155800 0.05901300 0.08118600 0.08118600 0.00476400 0.14552600 0.07604800<br />
[19] 0.00070400 0.06877400 0.06877400 0.02423800 0.02848800 0.01675100 0.01675100 0.04524000 0.19417200<br />
[28] 0.07015000 0.12596600 0.06999200 0.06797400 0.00201900 0.00201900 0.12462600 0.07128300 0.00004969<br />
[37] 0.00004969 0.07133200<br />
<br />
==== About this tree ==== <br />
This tree in the file <tt>yule.tre</tt> was obtained using PAUP from 10,000 nucleotide sites simulated from a Yule tree. The model used to generate the simulated data (HKY model, kappa = 4, base frequencies = 0.3 A, 0.2 C, 0.2 G, and 0.3 T, no rate heterogeneity) was also used in the analysis by PAUP (the final ML tree was made ultrametric by enforcing the clock constraint).<br />
<!-- I analyzed these data in BEAST for part of a lecture. See slide 22 and beyond in [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/lectures/DivTimeBayesianBEAST.pdf this PDF file] for details.--><br />
<br />
== Fun with plotting trees in APE ==<br />
<br />
You can plot the tree using all defaults with this ape command:<br />
> plot(t)<br />
<br />
Let's try changing a few defaults and plot the tree in a variety of ways. All of the following change just one default option, but feel free to combine these to create the plot you want.<br />
<br />
==== Left-facing, up-facing, or down-facing trees ====<br />
> plot(t, direction="l")<br />
> plot(t, direction="u")<br />
> plot(t, direction="d")<br />
The default is to plot the tree right-facing (<tt>direction="r"</tt>).<br />
<br />
==== Hide the taxon names ====<br />
> plot(t, show.tip.label=FALSE)<br />
The default behavior is to show the taxon names.<br />
<br />
==== Make the edges thicker ====<br />
> plot(t, edge.width=4)<br />
An edge width of 1 is the default. If you specify several edge widths, APE will alternate them as it draws the tree:<br />
> plot(t, edge.width=c(1,2,3,4))<br />
<br />
==== Color the edges ====<br />
> plot(t, edge.color="red")<br />
Black edges are the default. If you specify several edge colors, APE will alternate them as it draws the tree:<br />
> plot(t, edge.color=c("black","red","green","blue"))<br />
<br />
==== Make taxon labels smaller or larger ====<br />
> plot(t, cex=0.5)<br />
The cex parameter governs relative scaling of the taxon labels, with 1.0 being the default. Thus, the command above makes the taxon labels half the default size. To double the size, use<br />
> plot(t, cex=2.0)<br />
<br />
==== Plot tree as an unrooted or radial tree ====<br />
> plot(t, type="u")<br />
The default type is "p" (phylogram), but "c" (cladogram), "u" (unrooted), "r" (radial) are other options. Some of these options (e.g. "r") create very funky looking trees, leading me to think there is something about the tree description in the file <tt>yule.tre</tt> that APE is not expecting.<br />
<br />
==== Labeling internal nodes ====<br />
> plot(t)<br />
> nodelabels()<br />
This is primarily useful if you want to annotate one of the nodes:<br />
> plot(t)<br />
> nodelabels("Clade A", 22)<br />
> nodelabels("Clade B", 35)<br />
To put the labels inside a circle rather than a rectangle, use <tt>frame="c"</tt> rather than the default (<tt>frame="r"</tt>). To use a background color of white rather than the default "lightblue", use <tt>bg="white"</tt>:<br />
> plot(t)<br />
> nodelabels("Clade A", 22, frame="c", bg="white")<br />
> nodelabels("Clade B", 35, frame="c", bg="yellow")<br />
<br />
==== Adding a scale bar ====<br />
> plot(t)<br />
> add.scale.bar(length=0.05)<br />
The above commands add a scale bar to the bottom left of the plot. To add a scale going all the way across the bottom of the plot, try this:<br />
> plot(t)<br />
> axisPhylo()<br />
<br />
== Diversification analyses ==<br />
APE can perform some lineage-through-time type analyses. The tree read in from the file <tt>yule.tre</tt> that you already have in memory is perfect for testing APE's diversification analyses because we know (since it is based on simulated data) that this tree was generated under a pure-birth (Yule) model.<br />
<br />
==== Lineage through time plots ====<br />
This is a rather small tree, so a lineage through time (LTT) plot will be rather crude, but let's go through the motions anyway.<br />
> ltt.plot(t)<br />
LTT plots usually have a log scale for the number of lineages (y-axis), and this can be easily accomplished:<br />
> ltt.plot(t, log = "y")<br />
Now add a line extending from the point (t = -0.265, N = 2) to the point (t = 0, N = 20) using the command <tt>segments</tt> (note that the &quot;segments&quot; command is a core R command, not something added by the APE package):<br />
> segments(-0.265, 2, 0, 20, lty="dotted")<br />
The slope of this line should (ideally) be equal to the birth rate of the yule process used to generate the tree, which was <math>\lambda=10</math>.<br />
<div style="background-color:#ccccff"><br />
* Calculate the slope of this line. Is it close to the birth rate 10? {{title|8.689|answer}}<br />
</div><br />
If you get something like 68 for the slope, then you forgot to take the natural log of 2 and 20. The plot uses a log scale for the y-axis, so the two endpoints of the dotted line are really (-0.265, log(2)) and (0, log(20)).<br />
<br />
==== Birth/death analysis ====<br />
Now let's perform a birth/death analysis. APE's <tt>birthdeath</tt> command estimates the birth and death rates using the node ages in a tree:<br />
> birthdeath(t)<br />
Estimation of Speciation and Extinction Rates<br />
with Birth-Death Models <br />
<br />
Phylogenetic tree: t <br />
Number of tips: 20 <br />
Deviance: -120.4538 <br />
Log-likelihood: 60.22689 <br />
Parameter estimates:<br />
d / b = 0 StdErr = 0 <br />
b - d = 8.674513 StdErr = 1.445897 <br />
(b: speciation rate, d: extinction rate)<br />
Profile likelihood 95% confidence intervals:<br />
d / b: [-1.193549, 0.5286254]<br />
b - d: [5.25955, 13.32028]<br />
See the excellent [https://cran.r-project.org/web/packages/ape/ape.pdf APE manual] to learn more about the <tt>birthdeath</tt> function. <br />
<div style="background-color:#ccccff"><br />
* What is the death rate estimated by APE? {{title|0.0|answer}}<br />
* Is the true diversification rate within one standard deviation of the estimated diversification rate? {{title|yes, one standard deviation each side of (10-0) is the interval from 8.554 to 11.446, and 8.674513 is inside that interval|answer}}<br />
* Are the true diversification and relative extinction values within the profile likelihood 95% confidence intervals? {{title|yes|answer}}<br />
</div><br />
A "profile" likelihood is obtained by varying one parameter in the model and re-estimating all the other parameters conditional on the current value of the focal parameter. This is, technically, not the correct way of getting a confidence interval, but is easier to compute and may be more stable for small samples than getting confidence intervals the correct way.<br />
<div style="background-color:#ccccff"><br />
* What is the correct way to interpret the 95% confidence interval for b - d: [5.25955, 13.32028]? Is it that there is 95% chance that the true value of b - d is in that interval? {{title|no, that is the definition of a Bayesian credible interval|answer}}<br />
* Or, does it mean that our estimate (8.674513) is within the middle 95% of values that would be produced if the true b - d value was in that interval? {{title|yes|answer}}<br />
</div><br />
<br />
==== Analyses involving tree shape ====<br />
The apTreeshape package (as the name applies) lets you perform analyses of tree shape (which measure how balanced or imbalanced a tree is). apTreeshape stores trees differently than APE, so you can't use a tree object that you created with APE in functions associated with apTreeshape. You can, however, convert a "phylo" object from APE to a "treeshape" object used by apTreeshape:<br />
> ts <- as.treeshape(t)<br />
Here, I'm assuming that <tt>t</tt> still refers to the tree you read in from the file <tt>yule.tre</tt> using the APE command <tt>read.nexus</tt>. We can now obtain a measure of '''tree imbalance''' known as Colless's index:<br />
> c <- colless(ts)<br />
> c<br />
[1] 44<br />
The formula for Colless's index is easy to understand. Each internal node branches into a left and right lineage. The absolute value of the difference between the number of left-hand descendants and right-hand descendants provides a measure of how imbalanced the tree is with respect to that particular node. Adding these imbalance measures up over all internal nodes yields Colless's overall tree imbalance index:<br />
<br />
<math>I_C = \sum_{j=1}^{n-1} |L_j - R_j|</math><br />
<br />
apTreeshape can do an analysis to assess whether the tree has the amount of imbalance one would expect from a Yule tree:<br />
> colless.test(ts, model = "yule", alternative="greater", n.mc = 1000)<br />
This generates 1000 trees from a Yule process and compares the Colless index from our tree (44) to the distribution of such indices obtained from the simulated trees. The p-value is the proportion of the 1000 trees generated from the null distribution that have indices greater than 44 (i.e. the proportion of Yule trees that are more ''im''balanced than our tree). If the p-value was 0.5, for example, then our tree would be right in the middle of the distribution expected for Yule trees. If the p-value was 0.01, however, it would mean that very few Yule trees are as imbalanced as our tree, which would make it hard to believe that our tree is a Yule tree.<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that a Yule process generated our tree? {{title|I got 0.288 so no, the Yule process can easily generate trees with the same level of imbalance as our tree|answer}}<br />
</div><br />
You can also test one other model with the <tt>colless</tt> function: the "proportional to distinguishable" (or PDA) model. This null model produces random trees by starting with three taxa joined to a single internal node, then building upon that by adding new taxa to randomly-chosen (discrete uniform distribution) edges that already exist in the (unrooted) tree. The edge to which a new taxon is added can be an internal edge as well as a terminal edge, which causes this process to produce trees with a different distribution of shapes than the Yule process, which only adds new taxa to the tips of a growing rooted tree.<br />
> colless.test(ts, model = "pda", alternative="greater", n.mc = 1000)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that our tree is a PDA tree? {{title|I got 0.912, so no, PDA trees are almost always less imbalanced (more balanced) than our tree|answer}}<br />
</div><br />
You might also wish to test whether our tree is more ''balanced'' than would be expected under the Yule or PDA models. apTreeshape let's you look at the other end of the distribution too:<br />
> colless.test(ts, model = "yule", alternative="less", n.mc = 1000)<br />
> colless.test(ts, model = "pda", alternative="less", n.mc = 1000)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value for the first test above indicate that our tree is more balanced than would be expected were it a Yule tree? {{title|I got 0.716, so no, most Yule trees are more balanced than our tree|answer}}<br />
* Does the p-value for the second test above indicate that our tree is more balanced than would be expected were it a PDA tree? {{title|I got 0.064, so no, our tree is not significantly more balanced than a PDA tree, but it is getting close|answer}}<br />
</div><br />
<br />
You might want to see a histogram of the Colless index like that used to determine the p-values for the tests above. apTreeshape lets you generate 10 trees each with 20 tips under a Yule model as follows:<br />
> rtreeshape(10,20,model="yule")<br />
That spits out a summary of the 10 trees created, but what we really wanted was to know the Colless index for each of the trees generated. To do this, use the R command <tt>sapply</tt> to call the apTreeshape command <tt>colless</tt> for each tree generated by the <tt>rtreeshape</tt> command:<br />
> sapply(rtreeshape(10,20,model="yule"),FUN=colless)<br />
[1] 38 92 85 91 73 71 94 75 72 93<br />
<div style="background-color:#ccccff"><br />
* Why do you think your Colless indices differ from the ones above?{{title|In a Yule model, the time between speciation events is determined by drawing random numbers from an exponential distribution. If we all started with the same initial (seed) random number, then our Colless indices would be identical| answer}}<br />
</div><br />
That's more like it! Now, generate 1000 Yule trees instead of just 10, and create a histogram using the standard R command <tt>hist</tt>:<br />
> yulecolless <- sapply(rtreeshape(1000,20,model="yule"),FUN=colless)<br />
> hist(yulecolless)<br />
Now create a histogram for PDA trees:<br />
> pdacolless <- sapply(rtreeshape(1000,20,model="pda"),FUN=colless)<br />
> hist(pdacolless)<br />
Use the following to compare the mean Colless index for the PDA trees to the Yule trees:<br />
> summary(yulecolless)<br />
> summary(pdacolless)<br />
<div style="background-color:#ccccff"><br />
* Which generates the most balanced trees, on average: Yule or PDA? {{title|Yule trees are more balanced, with mean Colless index 39 versus 81 for PDA|answer}}<br />
</div><br />
<br />
apTreeshape provides one more function (<tt>likelihood.test</tt>) that performs a likelihood ratio test of the PDA model against the Yule model null hypothesis. This test says that we cannot reject the null hypothesis of a Yule model in favor of the PDA model:<br />
> likelihood.test(ts)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that our tree is a Yule tree? {{title|I got 0.4095684, so no, our tree is consistent with a Yule tree|answer}}<br />
</div><br />
<br />
== Independent contrasts ==<br />
APE can compute Felsenstein's independent contrasts, as well as several other methods for assessing phylogenetically-corrected correlations between traits that I did not discuss in lecture (autocorrelation, generalized least squares, mixed models and variance partitioning, and the very interesting Ornstein-Uhlenbeck model, which can be used to assess the correlation between a continuous character and a discrete habitat variable).<br />
<br />
Today, however, we will just play with independent contrasts and phylogenetic generalized least squares (PGLS) regression. Let's try to use APE's <tt>pic</tt> command to reproduce the Independent Contrasts example from lecture:<br />
{| class="wikitable" style="text-align:center"<br />
|- style="background-color: #eeeeee"<br />
| <br />
| style="width:10%" | X<br />
| style="width:10%" | Y<br />
| style="width:10%" | Var<br />
| style="width:10%" | X*<br />
| style="width:10%" | Y*<br />
|-<br />
| style="width:10%" | A-B <br />
| -6<br />
| -2<br />
| 4<br />
| -3<br />
| -1<br />
|-<br />
| C-D <br />
| -4<br />
| -2<br />
| 4<br />
| -2<br />
| -1<br />
|-<br />
| E-F<br />
| 10<br />
| -4<br />
| 9<br />
| 3.333<br />
| -1.333<br />
|}<br />
In the table, X and Y denote the raw contrasts, while X* and Y* denote the rescaled contrasts (raw contrasts divided by the square root of the variance). The correlation among the rescaled contrasts was -0.98916.<br />
<br />
==== Enter the tree ====<br />
Start by entering the tree:<br />
> t <- read.tree(text="((A:2,B:2)E:3.5,(C:2,D:2)F:3.5)G;")<br />
The attribute <tt>text</tt> is needed because we are entering the Newick tree description in the form of a string, not supplying a file name. Note that I have labeled the root node G and the two interior nodes E (ancestor of A and B) and F (ancestor of C and D).<br />
<br />
Plot the tree to make sure the tree definition worked:<br />
> plot(t)<br />
<br />
==== Enter the data ====<br />
Now we must tell APE the X and Y values. Do this by supplying vectors of numbers. We will tell APE which tips these numbers are associated with in the next step:<br />
> x <- c(27, 33, 18, 22)<br />
> y <- c(122, 124, 126, 128)<br />
<br />
Here's how we tell APE what taxa the numbers belong to:<br />
> names(x) <- c("A","B","C","D")<br />
> names(y) <- c("A","B","C","D")<br />
If you want to avoid repetition, you can enter the names for both x and y simultaneously like this:<br />
> names(x) <- names(y) <- c("A","B","C","D")<br />
<br />
==== Compute independent contrasts ====<br />
Now compute the contrasts with the APE function <tt>pic</tt>:<br />
> cx <- pic(x,t)<br />
> cy <- pic(y,t)<br />
The variables cx and cy are arbitrary; you could use different names for these if you wanted. Let's see what values cx and cy hold:<br />
> cx<br />
G E F <br />
3.333333 -3.000000 -2.000000 <br />
> cy<br />
G E F <br />
-1.333333 -1.000000 -1.000000 <br />
The top row in each case holds the node name in the tree, the bottom row holds the rescaled contrasts.<br />
<br />
==== Label interior nodes with the contrasts ====<br />
APE makes it fairly easy to label the tree with the contrasts:<br />
> plot(t)<br />
> nodelabels(round(cx,3), adj=c(0,-1), frame="n")<br />
> nodelabels(round(cy,3), adj=c(0,+1), frame="n")<br />
> df <- data.frame(x,y)<br />
In the nodelabels command, we supply the numbers with which to label the nodes. The vectors cx and cy contain information about the nodes to label, so APE knows from this which numbers to place at which nodes in the tree. The round command simply rounds the contrasts to 3 decimal places. The <tt>adj</tt> setting adjusts the spacing so that the contrasts for X are not placed directly on top of the contrasts for Y. The command <tt>adj=c(0,-1)</tt> causes the labels to be horizontally displaced 0 lines and vertically displaced one line up (the -1 means go up 1 line) from where they would normally be plotted. The contrasts for Y are displaced vertically one line down from where they would normally appear. Finally, the <tt>frame="n"</tt> just says to not place a box or circle around the labels.<br />
<br />
You should find that the contrasts are the same as those shown as X* and Y* in the table above (as well as the summary slide in the Independent Contrasts lecture). <br />
<br />
Computing the correlation coefficient is as easy as:<br />
> cor(cx, cy)<br />
[1] -0.9891585<br />
<br />
== Phylogenetic Generalized Least Squares (PGLS) regression ==<br />
<br />
Now let's reproduce the PGLS regression example given in lecture. Here are the data we used:<br />
{| class="wikitable" style="text-align:center"<br />
|- style="background-color: #eeeeee"<br />
| <br />
| style="width:10%" | X<br />
| style="width:10%" | Y<br />
|-<br />
| style="width:10%" | A<br />
| 2<br />
| 1<br />
|-<br />
| B<br />
| 3<br />
| 3<br />
|-<br />
| C<br />
| 1<br />
| 2<br />
|-<br />
| D<br />
| 4<br />
| 7<br />
|-<br />
| E<br />
| 5<br />
| 6<br />
|}<br />
<br />
Enter the data as we did for the Independent Contrasts example:<br />
> x <- c(2,3,1,4,5)<br />
> y <- c(1,3,2,7,6)<br />
> names(x) <- names(y) <- c("A","B","C","D","E")<br />
<br />
In order to carry out generalized least squares regression, we will need the <tt>gls</tt> command, which is part of the <tt>nlme</tt> R package. Thus, you will need to load this package before you can use the <tt>gls</tt> command:<br />
> library(nlme)<br />
<br />
Let's first do an ordinary linear regression for comparison:<br />
> m0 <- gls(y ~ x)<br />
> summary(m0) <br />
<br />
<div style="background-color:#ccccff"><br />
* What is the estimate of the intercept? {{title|-0.4|answer}}<br />
* What is the estimate of the slope? {{title|1.4|answer}}<br />
</div><br />
<br />
Let's plot the regression line on the original data:<br />
> plot(x, y, pch=19, xlim=c(0,6),ylim=c(0,8))<br />
> text(x, y, labels=c("A", "B", "C", "D", "E"), pos=4, offset=1)<br />
> segments(0, -0.4, 6, -0.4 + 1.4*6, lwd=2, lty="solid", col="blue")<br />
You will have noticed that the '''first line''' plots the points using a filled circle (pch=19) and specifying that the x-axis should go from 0 to 6 and the y-axis should extend from 0 to 8. The '''second line''' labels the points with the taxon to make it easier to interpret the plot. Here, pos=4 says to put the labels to the right of each point (pos = 1, 2, 3 means below, left, and above, respectively) and offset=1 specifies how far away from the point each label should be. The '''third line''' draws the regression line using the intercept and slope values provided by gls, making the line width 2 (lwd=2) and solid (lty="solid") and blue (col="blue").<br />
<br />
To do PGLS, we will need to enter the tree with edge lengths:<br />
> t <- read.tree(text="(((A:1,B:1)F:1,C:2)G:1,(D:0.5,E:0.5)H:2.5)I;")<br />
<br />
You are ready to estimate the parameters of the PGLS regression model:<br />
> m1 <- gls(y ~ x, correlation=corBrownian(1,t))<br />
> summary(m1) <br />
<br />
<div style="background-color:#ccccff"><br />
* What is the estimate of the intercept? {{title|1.7521186|answer}}<br />
* What is the estimate of the slope? {{title|0.7055085|answer}}<br />
* The <tt>corBrownian</tt> function specified for the correlation in the gls command comes from the APE package. What does <tt>corBrownian</tt> do? You might want to check out the excellent [https://cran.r-project.org/web/packages/ape/ape.pdf APE manual] {{title|it computes the variance-covariance matrix from the tree t assuming a Brownian motion model|answer}}<br />
* In <tt>corBrownian(1,t)</tt>, t is the tree, but what do you think the 1 signifies? {{title|It is the variance per unit time for the Brownian motion model|answer}}<br />
</div><br />
<br />
Assuming you still have the plot window available, let's add the PGLS regression line to the existing plot (if you've closed the plot window you will have to recreate the plot first):<br />
> segments(0, 1.7521186, 6, 1.7521186 + 0.7055085*6, lwd=2, lty="dotted", col="blue")<br />
<br />
== Literature Cited ==<br />
<references/><br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_APE_Lab&diff=38948Phylogenetics: APE Lab2018-04-03T18:00:06Z<p>Paul Lewis: /* Birth/death analysis */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|This lab is an introduction to some of the capabilities of APE, a phylogenetic analysis package written for the R language. You may want to review the [[Phylogenetics: R Primer|R Primer]] lab if you've already forgotten everything you learned about R.<br />
|}<br />
<br />
== Installing APE and apTreeshape ==<br />
'''APE''' is a package largely written and maintained by Emmanuel Paradis, who has written a very nice book<ref>Paradis, E. 2006. Analysis of phylogenetics and evolution with R. Springer. ISBN: 0-387-32914-5</ref> explaining in detail how to use APE. APE is designed to be used inside the [http://www.r-project.org/ R] programming language, to which you were introduced earlier in the semester (see [[Phylogenetics: R Primer]]). APE can do an impressive array of analyses. For example, it is possible to estimate trees using neighbor-joining or maximum likelihood, estimate ancestral states (for either discrete or continuous data), perform Sanderson's penalized likelihood relaxed clock method to estimate divergence times, evaluate Felsenstein's independent contrasts, estimate birth/death rates, perform bootstrapping, and even automatically pull sequences from GenBank given a vector of accession numbers! APE also has impressive tree plotting capabilities, of which we will only scratch the surface today (flip through Chapter 4 of the Paradis book to see what more APE can do).<br />
<br />
'''apTreeshape''' is a different R package (written by Nicolas Bortolussi et al.) that we will also make use of today.<br />
<br />
To install APE and apTreeshape, start R and type the following at the R command prompt:<br />
> install.packages("ape")<br />
> install.packages("apTreeshape")<br />
Assuming you are connected to the internet, R should locate these packages and install them for you. After they are installed, you will need to load them into R in order to use them (note that no quotes are used this time):<br />
> library(ape)<br />
> library(apTreeshape)<br />
You should never again need to issue the <tt>install.packages</tt> command for APE and apTreeshape, but you will need to use the <tt>library</tt> command to load them whenever you want to use them.<br />
<br />
== Reading in trees from a file and exploring tree data structure ==<br />
Download [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/yule.tre this tree file] and save it as a file named <tt>yule.tre</tt> in a new folder somewhere on your computer. Tell R where this folder is using the <tt>setwd</tt> (set working directory) command. For example, I created a folder named <tt>apelab</tt> on my desktop, so I typed this to make that folder my working directory:<br />
> setwd("/Users/plewis/Desktop/apelab")<br />
Now you should be able to read in the tree using this ape command (the <tt>t</tt> is an arbitrary name I chose for the variable used to hold the tree; you could use <tt>tree</tt> if you want):<br />
> t <- read.nexus("yule.tre")<br />
We use <tt>read.nexus</tt> because the tree at hand is in NEXUS format, but APE has a variety of functions to read in different tree file types. If APR can't read your tree file, then give the package treeio a spin. APE stores trees as an object of type "phylo". <br />
<br />
==== Getting a tree summary ====<br />
Some basic information about the tree can be obtained by simply typing the name of the variable you used to store the tree:<br />
> t<br />
<br />
Phylogenetic tree with 20 tips and 19 internal nodes.<br />
<br />
Tip labels:<br />
B, C, D, E, F, G, ...<br />
<br />
Rooted; includes branch lengths.<br />
<br />
==== Obtaining vectors of tip and internal node labels ====<br />
The variable <tt>t</tt> has several attributes that can be queried by following the variable name with a dollar sign and then the name of the attribute. For example, the vector of tip labels can be obtained as follows:<br />
> t$tip.label<br />
[1] "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U"<br />
The internal node labels, if they exist, can be obtained this way:<br />
> t$node.label<br />
NULL<br />
The result above means that labels for the internal nodes were not stored with this tree.<br />
<br />
==== Obtaining the nodes attached to each edge ==== <br />
The nodes at the ends of all the edges in the tree can be had by asking for the edge attribute:<br />
> t$edge<br />
[,1] [,2]<br />
[1,] 21 22<br />
[2,] 22 23<br />
[3,] 23 1<br />
. . .<br />
. . .<br />
. . .<br />
[38,] 38 12 <br />
<br />
==== Obtaining a vector of edge lengths ==== <br />
The edge lengths can be printed thusly:<br />
> t$edge.length<br />
[1] 0.07193600 0.01755700 0.17661500 0.02632500 0.01009100 0.06893900 0.07126000 0.03970200 0.01912900<br />
[10] 0.01243000 0.01243000 0.03155800 0.05901300 0.08118600 0.08118600 0.00476400 0.14552600 0.07604800<br />
[19] 0.00070400 0.06877400 0.06877400 0.02423800 0.02848800 0.01675100 0.01675100 0.04524000 0.19417200<br />
[28] 0.07015000 0.12596600 0.06999200 0.06797400 0.00201900 0.00201900 0.12462600 0.07128300 0.00004969<br />
[37] 0.00004969 0.07133200<br />
<br />
==== About this tree ==== <br />
This tree in the file <tt>yule.tre</tt> was obtained using PAUP from 10,000 nucleotide sites simulated from a Yule tree. The model used to generate the simulated data (HKY model, kappa = 4, base frequencies = 0.3 A, 0.2 C, 0.2 G, and 0.3 T, no rate heterogeneity) was also used in the analysis by PAUP (the final ML tree was made ultrametric by enforcing the clock constraint).<br />
<!-- I analyzed these data in BEAST for part of a lecture. See slide 22 and beyond in [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/lectures/DivTimeBayesianBEAST.pdf this PDF file] for details.--><br />
<br />
== Fun with plotting trees in APE ==<br />
<br />
You can plot the tree using all defaults with this ape command:<br />
> plot(t)<br />
<br />
Let's try changing a few defaults and plot the tree in a variety of ways. All of the following change just one default option, but feel free to combine these to create the plot you want.<br />
<br />
==== Left-facing, up-facing, or down-facing trees ====<br />
> plot(t, direction="l")<br />
> plot(t, direction="u")<br />
> plot(t, direction="d")<br />
The default is to plot the tree right-facing (<tt>direction="r"</tt>).<br />
<br />
==== Hide the taxon names ====<br />
> plot(t, show.tip.label=FALSE)<br />
The default behavior is to show the taxon names.<br />
<br />
==== Make the edges thicker ====<br />
> plot(t, edge.width=4)<br />
An edge width of 1 is the default. If you specify several edge widths, APE will alternate them as it draws the tree:<br />
> plot(t, edge.width=c(1,2,3,4))<br />
<br />
==== Color the edges ====<br />
> plot(t, edge.color="red")<br />
Black edges are the default. If you specify several edge colors, APE will alternate them as it draws the tree:<br />
> plot(t, edge.color=c("black","red","green","blue"))<br />
<br />
==== Make taxon labels smaller or larger ====<br />
> plot(t, cex=0.5)<br />
The cex parameter governs relative scaling of the taxon labels, with 1.0 being the default. Thus, the command above makes the taxon labels half the default size. To double the size, use<br />
> plot(t, cex=2.0)<br />
<br />
==== Plot tree as an unrooted or radial tree ====<br />
> plot(t, type="u")<br />
The default type is "p" (phylogram), but "c" (cladogram), "u" (unrooted), "r" (radial) are other options. Some of these options (e.g. "r") create very funky looking trees, leading me to think there is something about the tree description in the file <tt>yule.tre</tt> that APE is not expecting.<br />
<br />
==== Labeling internal nodes ====<br />
> plot(t)<br />
> nodelabels()<br />
This is primarily useful if you want to annotate one of the nodes:<br />
> plot(t)<br />
> nodelabels("Clade A", 22)<br />
> nodelabels("Clade B", 35)<br />
To put the labels inside a circle rather than a rectangle, use <tt>frame="c"</tt> rather than the default (<tt>frame="r"</tt>). To use a background color of white rather than the default "lightblue", use <tt>bg="white"</tt>:<br />
> plot(t)<br />
> nodelabels("Clade A", 22, frame="c", bg="white")<br />
> nodelabels("Clade B", 35, frame="c", bg="yellow")<br />
<br />
==== Adding a scale bar ====<br />
> plot(t)<br />
> add.scale.bar(length=0.05)<br />
The above commands add a scale bar to the bottom left of the plot. To add a scale going all the way across the bottom of the plot, try this:<br />
> plot(t)<br />
> axisPhylo()<br />
<br />
== Diversification analyses ==<br />
APE can perform some lineage-through-time type analyses. The tree read in from the file <tt>yule.tre</tt> that you already have in memory is perfect for testing APE's diversification analyses because we know (since it is based on simulated data) that this tree was generated under a pure-birth (Yule) model.<br />
<br />
==== Lineage through time plots ====<br />
This is a rather small tree, so a lineage through time (LTT) plot will be rather crude, but let's go through the motions anyway.<br />
> ltt.plot(t)<br />
LTT plots usually have a log scale for the number of lineages (y-axis), and this can be easily accomplished:<br />
> ltt.plot(t, log = "y")<br />
Now add a line extending from the point (t = -0.265, N = 2) to the point (t = 0, N = 20) using the command <tt>segments</tt> (note that the &quot;segments&quot; command is a core R command, not something added by the APE package):<br />
> segments(-0.265, 2, 0, 20, lty="dotted")<br />
The slope of this line should (ideally) be equal to the birth rate of the yule process used to generate the tree, which was <math>\lambda=10</math>.<br />
<div style="background-color:#ccccff"><br />
* Calculate the slope of this line. Is it close to the birth rate 10? {{title|8.689|answer}}<br />
</div><br />
If you get something like 68 for the slope, then you forgot to take the natural log of 2 and 20. The plot uses a log scale for the y-axis, so the two endpoints of the dotted line are really (-0.265, log(2)) and (0, log(20)).<br />
<br />
==== Birth/death analysis ====<br />
Now let's perform a birth/death analysis. APE's <tt>birthdeath</tt> command estimates the birth and death rates using the node ages in a tree:<br />
> birthdeath(t)<br />
Estimation of Speciation and Extinction Rates<br />
with Birth-Death Models <br />
<br />
Phylogenetic tree: t <br />
Number of tips: 20 <br />
Deviance: -120.4538 <br />
Log-likelihood: 60.22689 <br />
Parameter estimates:<br />
d / b = 0 StdErr = 0 <br />
b - d = 8.674513 StdErr = 1.445897 <br />
(b: speciation rate, d: extinction rate)<br />
Profile likelihood 95% confidence intervals:<br />
d / b: [-1.193549, 0.5286254]<br />
b - d: [5.25955, 13.32028]<br />
See the excellent [https://cran.r-project.org/web/packages/ape/ape.pdf APE manual] to learn more about the <tt>birthdeath</tt> function. <br />
<div style="background-color:#ccccff"><br />
* What is the death rate estimated by APE? {{title|0.0|answer}}<br />
* Is the true diversification rate within one standard deviation of the estimated diversification rate? {{title|yes, one standard deviation each side of (10-0) is the interval from 8.554 to 11.446, and 8.674513 is inside that interval|answer}}<br />
* Are the true diversification and relative extinction values within the profile likelihood 95% confidence intervals? {{title|yes|answer}}<br />
</div><br />
A "profile" likelihood is obtained by varying one parameter in the model and re-estimating all the other parameters conditional on the current value of the focal parameter. This is, technically, not the correct way of getting a confidence interval, but is easier to compute and may be more stable for small samples than getting confidence intervals the correct way.<br />
<div style="background-color:#ccccff"><br />
* What is the correct way to interpret the 95% confidence interval for b - d: [5.25955, 13.32028]? Is it that there is 95% chance that the true value of b - d is in that interval? {{title|no, that is the definition of a Bayesian credible interval|answer}}<br />
* Or, does it mean that our estimate (8.674513) is within the middle 95% of values that would be produced if the true b - d value was in that interval? {{title|yes|answer}}<br />
</div><br />
<br />
==== Analyses involving tree shape ====<br />
The apTreeshape package (as the name applies) lets you perform analyses of tree shape (which measure how balanced or imbalanced a tree is). apTreeshape stores trees differently than APE, so you can't use a tree object that you created with APE in functions associated with apTreeshape. You can, however, convert a "phylo" object from APE to a "treeshape" object used by apTreeshape:<br />
> ts <- as.treeshape(t)<br />
Here, I'm assuming that <tt>t</tt> still refers to the tree you read in from the file <tt>yule.tre</tt> using the APE command <tt>read.nexus</tt>. We can now obtain a measure of '''tree imbalance''' known as Colless's index:<br />
> c <- colless(ts)<br />
> c<br />
[1] 44<br />
The formula for Colless's index is easy to understand. Each internal node branches into a left and right lineage. The absolute value of the difference between the number of left-hand descendants and right-hand descendants provides a measure of how imbalanced the tree is with respect to that particular node. Adding these imbalance measures up over all internal nodes yields Colless's overall tree imbalance index:<br />
<br />
<math>I_C = \sum_{j=1}^{n-1} |L_j - R_j|</math><br />
<br />
apTreeshape can do an analysis to assess whether the tree has the amount of imbalance one would expect from a Yule tree:<br />
> colless.test(ts, model = "yule", alternative="greater", n.mc = 1000)<br />
This generates 1000 trees from a Yule process and compares the Colless index from our tree (44) to the distribution of such indices obtained from the simulated trees. The p-value is the proportion of the 1000 trees generated from the null distribution that have indices greater than 44 (i.e. the proportion of Yule trees that are more ''im''balanced than our tree). If the p-value was 0.5, for example, then our tree would be right in the middle of the distribution expected for Yule trees. If the p-value was 0.01, however, it would mean that very few Yule trees are as imbalanced as our tree, which would make it hard to believe that our tree is a Yule tree.<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that a Yule process generated our tree? {{title|I got 0.288 so no, the Yule process can easily generate trees with the same level of imbalance as our tree|answer}}<br />
</div><br />
You can also test one other model with the <tt>colless</tt> function: the "proportional to distinguishable" (or PDA) model. This null model produces random trees by starting with three taxa joined to a single internal node, then building upon that by adding new taxa to randomly-chosen (discrete uniform distribution) edges that already exist in the (unrooted) tree. The edge to which a new taxon is added can be an internal edge as well as a terminal edge, which causes this process to produce trees with a different distribution of shapes than the Yule process, which only adds new taxa to the tips of a growing rooted tree.<br />
> colless.test(ts, model = "pda", alternative="greater", n.mc = 1000)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that our tree is a PDA tree? {{title|I got 0.912, so no, PDA trees are almost always less imbalanced (more balanced) than our tree|answer}}<br />
</div><br />
You might also wish to test whether our tree is more ''balanced'' than would be expected under the Yule or PDA models. apTreeshape let's you look at the other end of the distribution too:<br />
> colless.test(ts, model = "yule", alternative="less", n.mc = 1000)<br />
> colless.test(ts, model = "pda", alternative="less", n.mc = 1000)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value for the first test above indicate that our tree is more balanced than would be expected were it a Yule tree? {{title|I got 0.716, so no, most Yule trees are more balanced than our tree|answer}}<br />
* Does the p-value for the second test above indicate that our tree is more balanced than would be expected were it a PDA tree? {{title|I got 0.064, so no, our tree is not significantly more balanced than a PDA tree, but it is getting close|answer}}<br />
</div><br />
<br />
You might want to see a histogram of the Colless index like that used to determine the p-values for the tests above. apTreeshape lets you generate 10 trees each with 20 tips under a Yule model as follows:<br />
> rtreeshape(10,20,model="yule")<br />
That spits out a summary of the 10 trees created, but what we really wanted was to know the Colless index for each of the trees generated. To do this, use the R command <tt>sapply</tt> to call the apTreeshape command <tt>colless</tt> for each tree generated by the <tt>rtreeshape</tt> command:<br />
> sapply(rtreeshape(10,20,model="yule"),FUN=colless)<br />
[1] 38 92 85 91 73 71 94 75 72 93<br />
<div style="background-color:#ccccff"><br />
* Do you expect your Colless indices to be identical to the ones above? Why or why not?{{title|It is highly unlikely that they will be indentical. The indices are for trees generated under a Yule model| answer}}<br />
</div><br />
That's more like it! Now, generate 1000 Yule trees instead of just 10, and create a histogram using the standard R command <tt>hist</tt>:<br />
> yulecolless <- sapply(rtreeshape(1000,20,model="yule"),FUN=colless)<br />
> hist(yulecolless)<br />
Now create a histogram for PDA trees:<br />
> pdacolless <- sapply(rtreeshape(1000,20,model="pda"),FUN=colless)<br />
> hist(pdacolless)<br />
Use the following to compare the mean Colless index for the PDA trees to the Yule trees:<br />
> summary(yulecolless)<br />
> summary(pdacolless)<br />
<div style="background-color:#ccccff"><br />
* Which generates the most balanced trees, on average: Yule or PDA? {{title|Yule trees are more balanced, with mean Colless index 39 versus 81 for PDA|answer}}<br />
</div><br />
<br />
apTreeshape provides one more function (<tt>likelihood.test</tt>) that performs a likelihood ratio test of the PDA model against the Yule model null hypothesis. This test says that we cannot reject the null hypothesis of a Yule model in favor of the PDA model:<br />
> likelihood.test(ts)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that our tree is a Yule tree? {{title|I got 0.4095684, so no, our tree is consistent with a Yule tree|answer}}<br />
</div><br />
<br />
== Independent contrasts ==<br />
APE can compute Felsenstein's independent contrasts, as well as several other methods for assessing phylogenetically-corrected correlations between traits that I did not discuss in lecture (autocorrelation, generalized least squares, mixed models and variance partitioning, and the very interesting Ornstein-Uhlenbeck model, which can be used to assess the correlation between a continuous character and a discrete habitat variable).<br />
<br />
Today, however, we will just play with independent contrasts and phylogenetic generalized least squares (PGLS) regression. Let's try to use APE's <tt>pic</tt> command to reproduce the Independent Contrasts example from lecture:<br />
{| class="wikitable" style="text-align:center"<br />
|- style="background-color: #eeeeee"<br />
| <br />
| style="width:10%" | X<br />
| style="width:10%" | Y<br />
| style="width:10%" | Var<br />
| style="width:10%" | X*<br />
| style="width:10%" | Y*<br />
|-<br />
| style="width:10%" | A-B <br />
| -6<br />
| -2<br />
| 4<br />
| -3<br />
| -1<br />
|-<br />
| C-D <br />
| -4<br />
| -2<br />
| 4<br />
| -2<br />
| -1<br />
|-<br />
| E-F<br />
| 10<br />
| -4<br />
| 9<br />
| 3.333<br />
| -1.333<br />
|}<br />
In the table, X and Y denote the raw contrasts, while X* and Y* denote the rescaled contrasts (raw contrasts divided by the square root of the variance). The correlation among the rescaled contrasts was -0.98916.<br />
<br />
==== Enter the tree ====<br />
Start by entering the tree:<br />
> t <- read.tree(text="((A:2,B:2)E:3.5,(C:2,D:2)F:3.5)G;")<br />
The attribute <tt>text</tt> is needed because we are entering the Newick tree description in the form of a string, not supplying a file name. Note that I have labeled the root node G and the two interior nodes E (ancestor of A and B) and F (ancestor of C and D).<br />
<br />
Plot the tree to make sure the tree definition worked:<br />
> plot(t)<br />
<br />
==== Enter the data ====<br />
Now we must tell APE the X and Y values. Do this by supplying vectors of numbers. We will tell APE which tips these numbers are associated with in the next step:<br />
> x <- c(27, 33, 18, 22)<br />
> y <- c(122, 124, 126, 128)<br />
<br />
Here's how we tell APE what taxa the numbers belong to:<br />
> names(x) <- c("A","B","C","D")<br />
> names(y) <- c("A","B","C","D")<br />
If you want to avoid repetition, you can enter the names for both x and y simultaneously like this:<br />
> names(x) <- names(y) <- c("A","B","C","D")<br />
<br />
==== Compute independent contrasts ====<br />
Now compute the contrasts with the APE function <tt>pic</tt>:<br />
> cx <- pic(x,t)<br />
> cy <- pic(y,t)<br />
The variables cx and cy are arbitrary; you could use different names for these if you wanted. Let's see what values cx and cy hold:<br />
> cx<br />
G E F <br />
3.333333 -3.000000 -2.000000 <br />
> cy<br />
G E F <br />
-1.333333 -1.000000 -1.000000 <br />
The top row in each case holds the node name in the tree, the bottom row holds the rescaled contrasts.<br />
<br />
==== Label interior nodes with the contrasts ====<br />
APE makes it fairly easy to label the tree with the contrasts:<br />
> plot(t)<br />
> nodelabels(round(cx,3), adj=c(0,-1), frame="n")<br />
> nodelabels(round(cy,3), adj=c(0,+1), frame="n")<br />
In the nodelabels command, we supply the numbers with which to label the nodes. The vectors cx and cy contain information about the nodes to label, so APE knows from this which numbers to place at which nodes in the tree. The round command simply rounds the contrasts to 3 decimal places. The <tt>adj</tt> setting adjusts the spacing so that the contrasts for X are not placed directly on top of the contrasts for Y. The command <tt>adj=c(0,-1)</tt> causes the labels to be horizontally displaced 0 lines and vertically displaced one line up (the -1 means go up 1 line) from where they would normally be plotted. The contrasts for Y are displaced vertically one line down from where they would normally appear. Finally, the <tt>frame="n"</tt> just says to not place a box or circle around the labels.<br />
<br />
You should find that the contrasts are the same as those shown as X* and Y* in the table above (as well as the summary slide in the Independent Contrasts lecture). <br />
<br />
Computing the correlation coefficient is as easy as:<br />
> cor(cx, cy)<br />
[1] -0.9891585<br />
<br />
== Phylogenetic Generalized Least Squares (PGLS) regression ==<br />
<br />
Now let's reproduce the PGLS regression example given in lecture. Here are the data we used:<br />
{| class="wikitable" style="text-align:center"<br />
|- style="background-color: #eeeeee"<br />
| <br />
| style="width:10%" | X<br />
| style="width:10%" | Y<br />
|-<br />
| style="width:10%" | A<br />
| 2<br />
| 1<br />
|-<br />
| B<br />
| 3<br />
| 3<br />
|-<br />
| C<br />
| 1<br />
| 2<br />
|-<br />
| D<br />
| 4<br />
| 7<br />
|-<br />
| E<br />
| 5<br />
| 6<br />
|}<br />
<br />
Enter the data as we did for the Independent Contrasts example:<br />
> x <- c(2,3,1,4,5)<br />
> y <- c(1,3,2,7,6)<br />
> names(x) <- names(y) <- c("A","B","C","D","E")<br />
<br />
In order to carry out generalized least squares regression, we will need the <tt>gls</tt> command, which is part of the <tt>nlme</tt> R package. Thus, you will need to load this package before you can use the <tt>gls</tt> command:<br />
> library(nlme)<br />
<br />
Let's first do an ordinary linear regression for comparison:<br />
> m0 <- gls(y ~ x)<br />
> summary(m0) <br />
<br />
<div style="background-color:#ccccff"><br />
* What is the estimate of the intercept? {{title|-0.4|answer}}<br />
* What is the estimate of the slope? {{title|1.4|answer}}<br />
</div><br />
<br />
Let's plot the regression line on the original data:<br />
> plot(x, y, pch=19, xlim=c(0,6),ylim=c(0,8))<br />
> text(x, y, labels=c("A", "B", "C", "D", "E"), pos=4, offset=1)<br />
> segments(0, -0.4, 6, -0.4 + 1.4*6, lwd=2, lty="solid", col="blue")<br />
You will have noticed that the '''first line''' plots the points using a filled circle (pch=19) and specifying that the x-axis should go from 0 to 6 and the y-axis should extend from 0 to 8. The '''second line''' labels the points with the taxon to make it easier to interpret the plot. Here, pos=4 says to put the labels to the right of each point (pos = 1, 2, 3 means below, left, and above, respectively) and offset=1 specifies how far away from the point each label should be. The '''third line''' draws the regression line using the intercept and slope values provided by gls, making the line width 2 (lwd=2) and solid (lty="solid") and blue (col="blue").<br />
<br />
To do PGLS, we will need to enter the tree with edge lengths:<br />
> t <- read.tree(text="(((A:1,B:1)F:1,C:2)G:1,(D:0.5,E:0.5)H:2.5)I;")<br />
<br />
You are ready to estimate the parameters of the PGLS regression model:<br />
> m1 <- gls(y ~ x, correlation=corBrownian(1,t))<br />
> summary(m1) <br />
<br />
<div style="background-color:#ccccff"><br />
* What is the estimate of the intercept? {{title|1.7521186|answer}}<br />
* What is the estimate of the slope? {{title|0.7055085|answer}}<br />
* The <tt>corBrownian</tt> function specified for the correlation in the gls command comes from the APE package. What does <tt>corBrownian</tt> do? You might want to check out the excellent [https://cran.r-project.org/web/packages/ape/ape.pdf APE manual] {{title|it computes the variance-covariance matrix from the tree t assuming a Brownian motion model|answer}}<br />
* In <tt>corBrownian(1,t)</tt>, t is the tree, but what do you think the 1 signifies? {{title|It is the variance per unit time for the Brownian motion model|answer}}<br />
</div><br />
<br />
Assuming you still have the plot window available, let's add the PGLS regression line to the existing plot (if you've closed the plot window you will have to recreate the plot first):<br />
> segments(0, 1.7521186, 6, 1.7521186 + 0.7055085*6, lwd=2, lty="dotted", col="blue")<br />
<br />
== Literature Cited ==<br />
<references/><br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_APE_Lab&diff=38947Phylogenetics: APE Lab2018-04-03T17:58:48Z<p>Paul Lewis: /* Birth/death analysis */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|This lab is an introduction to some of the capabilities of APE, a phylogenetic analysis package written for the R language. You may want to review the [[Phylogenetics: R Primer|R Primer]] lab if you've already forgotten everything you learned about R.<br />
|}<br />
<br />
== Installing APE and apTreeshape ==<br />
'''APE''' is a package largely written and maintained by Emmanuel Paradis, who has written a very nice book<ref>Paradis, E. 2006. Analysis of phylogenetics and evolution with R. Springer. ISBN: 0-387-32914-5</ref> explaining in detail how to use APE. APE is designed to be used inside the [http://www.r-project.org/ R] programming language, to which you were introduced earlier in the semester (see [[Phylogenetics: R Primer]]). APE can do an impressive array of analyses. For example, it is possible to estimate trees using neighbor-joining or maximum likelihood, estimate ancestral states (for either discrete or continuous data), perform Sanderson's penalized likelihood relaxed clock method to estimate divergence times, evaluate Felsenstein's independent contrasts, estimate birth/death rates, perform bootstrapping, and even automatically pull sequences from GenBank given a vector of accession numbers! APE also has impressive tree plotting capabilities, of which we will only scratch the surface today (flip through Chapter 4 of the Paradis book to see what more APE can do).<br />
<br />
'''apTreeshape''' is a different R package (written by Nicolas Bortolussi et al.) that we will also make use of today.<br />
<br />
To install APE and apTreeshape, start R and type the following at the R command prompt:<br />
> install.packages("ape")<br />
> install.packages("apTreeshape")<br />
Assuming you are connected to the internet, R should locate these packages and install them for you. After they are installed, you will need to load them into R in order to use them (note that no quotes are used this time):<br />
> library(ape)<br />
> library(apTreeshape)<br />
You should never again need to issue the <tt>install.packages</tt> command for APE and apTreeshape, but you will need to use the <tt>library</tt> command to load them whenever you want to use them.<br />
<br />
== Reading in trees from a file and exploring tree data structure ==<br />
Download [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/yule.tre this tree file] and save it as a file named <tt>yule.tre</tt> in a new folder somewhere on your computer. Tell R where this folder is using the <tt>setwd</tt> (set working directory) command. For example, I created a folder named <tt>apelab</tt> on my desktop, so I typed this to make that folder my working directory:<br />
> setwd("/Users/plewis/Desktop/apelab")<br />
Now you should be able to read in the tree using this ape command (the <tt>t</tt> is an arbitrary name I chose for the variable used to hold the tree; you could use <tt>tree</tt> if you want):<br />
> t <- read.nexus("yule.tre")<br />
We use <tt>read.nexus</tt> because the tree at hand is in NEXUS format, but APE has a variety of functions to read in different tree file types. If APR can't read your tree file, then give the package treeio a spin. APE stores trees as an object of type "phylo". <br />
<br />
==== Getting a tree summary ====<br />
Some basic information about the tree can be obtained by simply typing the name of the variable you used to store the tree:<br />
> t<br />
<br />
Phylogenetic tree with 20 tips and 19 internal nodes.<br />
<br />
Tip labels:<br />
B, C, D, E, F, G, ...<br />
<br />
Rooted; includes branch lengths.<br />
<br />
==== Obtaining vectors of tip and internal node labels ====<br />
The variable <tt>t</tt> has several attributes that can be queried by following the variable name with a dollar sign and then the name of the attribute. For example, the vector of tip labels can be obtained as follows:<br />
> t$tip.label<br />
[1] "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U"<br />
The internal node labels, if they exist, can be obtained this way:<br />
> t$node.label<br />
NULL<br />
The result above means that labels for the internal nodes were not stored with this tree.<br />
<br />
==== Obtaining the nodes attached to each edge ==== <br />
The nodes at the ends of all the edges in the tree can be had by asking for the edge attribute:<br />
> t$edge<br />
[,1] [,2]<br />
[1,] 21 22<br />
[2,] 22 23<br />
[3,] 23 1<br />
. . .<br />
. . .<br />
. . .<br />
[38,] 38 12 <br />
<br />
==== Obtaining a vector of edge lengths ==== <br />
The edge lengths can be printed thusly:<br />
> t$edge.length<br />
[1] 0.07193600 0.01755700 0.17661500 0.02632500 0.01009100 0.06893900 0.07126000 0.03970200 0.01912900<br />
[10] 0.01243000 0.01243000 0.03155800 0.05901300 0.08118600 0.08118600 0.00476400 0.14552600 0.07604800<br />
[19] 0.00070400 0.06877400 0.06877400 0.02423800 0.02848800 0.01675100 0.01675100 0.04524000 0.19417200<br />
[28] 0.07015000 0.12596600 0.06999200 0.06797400 0.00201900 0.00201900 0.12462600 0.07128300 0.00004969<br />
[37] 0.00004969 0.07133200<br />
<br />
==== About this tree ==== <br />
This tree in the file <tt>yule.tre</tt> was obtained using PAUP from 10,000 nucleotide sites simulated from a Yule tree. The model used to generate the simulated data (HKY model, kappa = 4, base frequencies = 0.3 A, 0.2 C, 0.2 G, and 0.3 T, no rate heterogeneity) was also used in the analysis by PAUP (the final ML tree was made ultrametric by enforcing the clock constraint).<br />
<!-- I analyzed these data in BEAST for part of a lecture. See slide 22 and beyond in [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/lectures/DivTimeBayesianBEAST.pdf this PDF file] for details.--><br />
<br />
== Fun with plotting trees in APE ==<br />
<br />
You can plot the tree using all defaults with this ape command:<br />
> plot(t)<br />
<br />
Let's try changing a few defaults and plot the tree in a variety of ways. All of the following change just one default option, but feel free to combine these to create the plot you want.<br />
<br />
==== Left-facing, up-facing, or down-facing trees ====<br />
> plot(t, direction="l")<br />
> plot(t, direction="u")<br />
> plot(t, direction="d")<br />
The default is to plot the tree right-facing (<tt>direction="r"</tt>).<br />
<br />
==== Hide the taxon names ====<br />
> plot(t, show.tip.label=FALSE)<br />
The default behavior is to show the taxon names.<br />
<br />
==== Make the edges thicker ====<br />
> plot(t, edge.width=4)<br />
An edge width of 1 is the default. If you specify several edge widths, APE will alternate them as it draws the tree:<br />
> plot(t, edge.width=c(1,2,3,4))<br />
<br />
==== Color the edges ====<br />
> plot(t, edge.color="red")<br />
Black edges are the default. If you specify several edge colors, APE will alternate them as it draws the tree:<br />
> plot(t, edge.color=c("black","red","green","blue"))<br />
<br />
==== Make taxon labels smaller or larger ====<br />
> plot(t, cex=0.5)<br />
The cex parameter governs relative scaling of the taxon labels, with 1.0 being the default. Thus, the command above makes the taxon labels half the default size. To double the size, use<br />
> plot(t, cex=2.0)<br />
<br />
==== Plot tree as an unrooted or radial tree ====<br />
> plot(t, type="u")<br />
The default type is "p" (phylogram), but "c" (cladogram), "u" (unrooted), "r" (radial) are other options. Some of these options (e.g. "r") create very funky looking trees, leading me to think there is something about the tree description in the file <tt>yule.tre</tt> that APE is not expecting.<br />
<br />
==== Labeling internal nodes ====<br />
> plot(t)<br />
> nodelabels()<br />
This is primarily useful if you want to annotate one of the nodes:<br />
> plot(t)<br />
> nodelabels("Clade A", 22)<br />
> nodelabels("Clade B", 35)<br />
To put the labels inside a circle rather than a rectangle, use <tt>frame="c"</tt> rather than the default (<tt>frame="r"</tt>). To use a background color of white rather than the default "lightblue", use <tt>bg="white"</tt>:<br />
> plot(t)<br />
> nodelabels("Clade A", 22, frame="c", bg="white")<br />
> nodelabels("Clade B", 35, frame="c", bg="yellow")<br />
<br />
==== Adding a scale bar ====<br />
> plot(t)<br />
> add.scale.bar(length=0.05)<br />
The above commands add a scale bar to the bottom left of the plot. To add a scale going all the way across the bottom of the plot, try this:<br />
> plot(t)<br />
> axisPhylo()<br />
<br />
== Diversification analyses ==<br />
APE can perform some lineage-through-time type analyses. The tree read in from the file <tt>yule.tre</tt> that you already have in memory is perfect for testing APE's diversification analyses because we know (since it is based on simulated data) that this tree was generated under a pure-birth (Yule) model.<br />
<br />
==== Lineage through time plots ====<br />
This is a rather small tree, so a lineage through time (LTT) plot will be rather crude, but let's go through the motions anyway.<br />
> ltt.plot(t)<br />
LTT plots usually have a log scale for the number of lineages (y-axis), and this can be easily accomplished:<br />
> ltt.plot(t, log = "y")<br />
Now add a line extending from the point (t = -0.265, N = 2) to the point (t = 0, N = 20) using the command <tt>segments</tt> (note that the &quot;segments&quot; command is a core R command, not something added by the APE package):<br />
> segments(-0.265, 2, 0, 20, lty="dotted")<br />
The slope of this line should (ideally) be equal to the birth rate of the yule process used to generate the tree, which was <math>\lambda=10</math>.<br />
<div style="background-color:#ccccff"><br />
* Calculate the slope of this line. Is it close to the birth rate 10? {{title|8.689|answer}}<br />
</div><br />
If you get something like 68 for the slope, then you forgot to take the natural log of 2 and 20. The plot uses a log scale for the y-axis, so the two endpoints of the dotted line are really (-0.265, log(2)) and (0, log(20)).<br />
<br />
==== Birth/death analysis ====<br />
Now let's perform a birth/death analysis. APE's <tt>birthdeath</tt> command estimates the birth and death rates using the node ages in a tree:<br />
> birthdeath(t)<br />
Estimation of Speciation and Extinction Rates<br />
with Birth-Death Models <br />
<br />
Phylogenetic tree: t <br />
Number of tips: 20 <br />
Deviance: -120.4538 <br />
Log-likelihood: 60.22689 <br />
Parameter estimates:<br />
d / b = 0 StdErr = 0 <br />
b - d = 8.674513 StdErr = 1.445897 <br />
(b: speciation rate, d: extinction rate)<br />
Profile likelihood 95% confidence intervals:<br />
d / b: [-1.193549, 0.5286254]<br />
b - d: [5.25955, 13.32028]<br />
See the excellent [https://cran.r-project.org/web/packages/ape/ape.pdf APE manual] to learn more about the <tt>birthdeath</tt> function. The output indicates that it was correctly able to detect that the death rate was 0, and the estimated birth rate should be very close to the value you calculated for the slope of the dotted line in the LTT plot.<br />
<div style="background-color:#ccccff"><br />
* What is the death rate estimated by APE? {{title|0.0|answer}}<br />
* Is the true diversification rate within one standard deviation of the estimated diversification rate? {{title|yes, one standard deviation each side of (10-0) is the interval from 8.554 to 11.446, and 8.674513 is inside that interval|answer}}<br />
* Are the true diversification and relative extinction values within the profile likelihood 95% confidence intervals? {{title|yes|answer}}<br />
</div><br />
A "profile" likelihood is obtained by varying one parameter in the model and re-estimating all the other parameters conditional on the current value of the focal parameter. This is, technically, not the correct way of getting a confidence interval, but is easier to compute and may be more stable for small samples than getting confidence intervals the correct way.<br />
<div style="background-color:#ccccff"><br />
* What is the correct way to interpret the 95% confidence interval for b - d: [5.25955, 13.32028]? Is it that there is 95% chance that the true value of b - d is in that interval? {{title|no, that is the definition of a Bayesian credible interval|answer}}<br />
* Or, does it mean that our estimate (8.674513) is within the middle 95% of values that would be produced if the true b - d value was in that interval? {{title|yes|answer}}<br />
</div><br />
<br />
==== Analyses involving tree shape ====<br />
The apTreeshape package (as the name applies) lets you perform analyses of tree shape (which measure how balanced or imbalanced a tree is). apTreeshape stores trees differently than APE, so you can't use a tree object that you created with APE in functions associated with apTreeshape. You can, however, convert a "phylo" object from APE to a "treeshape" object used by apTreeshape:<br />
> ts <- as.treeshape(t)<br />
Here, I'm assuming that <tt>t</tt> still refers to the tree you read in from the file <tt>yule.tre</tt> using the APE command <tt>read.nexus</tt>. We can now obtain a measure of '''tree imbalance''' known as Colless's index:<br />
> c <- colless(ts)<br />
> c<br />
[1] 44<br />
The formula for Colless's index is easy to understand. Each internal node branches into a left and right lineage. The absolute value of the difference between the number of left-hand descendants and right-hand descendants provides a measure of how imbalanced the tree is with respect to that particular node. Adding these imbalance measures up over all internal nodes yields Colless's overall tree imbalance index:<br />
<br />
<math>I_C = \sum_{j=1}^{n-1} |L_j - R_j|</math><br />
<br />
apTreeshape can do an analysis to assess whether the tree has the amount of imbalance one would expect from a Yule tree:<br />
> colless.test(ts, model = "yule", alternative="greater", n.mc = 1000)<br />
This generates 1000 trees from a Yule process and compares the Colless index from our tree (44) to the distribution of such indices obtained from the simulated trees. The p-value is the proportion of the 1000 trees generated from the null distribution that have indices greater than 44 (i.e. the proportion of Yule trees that are more ''im''balanced than our tree). If the p-value was 0.5, for example, then our tree would be right in the middle of the distribution expected for Yule trees. If the p-value was 0.01, however, it would mean that very few Yule trees are as imbalanced as our tree, which would make it hard to believe that our tree is a Yule tree.<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that a Yule process generated our tree? {{title|I got 0.288 so no, the Yule process can easily generate trees with the same level of imbalance as our tree|answer}}<br />
</div><br />
You can also test one other model with the <tt>colless</tt> function: the "proportional to distinguishable" (or PDA) model. This null model produces random trees by starting with three taxa joined to a single internal node, then building upon that by adding new taxa to randomly-chosen (discrete uniform distribution) edges that already exist in the (unrooted) tree. The edge to which a new taxon is added can be an internal edge as well as a terminal edge, which causes this process to produce trees with a different distribution of shapes than the Yule process, which only adds new taxa to the tips of a growing rooted tree.<br />
> colless.test(ts, model = "pda", alternative="greater", n.mc = 1000)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that our tree is a PDA tree? {{title|I got 0.912, so no, PDA trees are almost always less imbalanced (more balanced) than our tree|answer}}<br />
</div><br />
You might also wish to test whether our tree is more ''balanced'' than would be expected under the Yule or PDA models. apTreeshape let's you look at the other end of the distribution too:<br />
> colless.test(ts, model = "yule", alternative="less", n.mc = 1000)<br />
> colless.test(ts, model = "pda", alternative="less", n.mc = 1000)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value for the first test above indicate that our tree is more balanced than would be expected were it a Yule tree? {{title|I got 0.716, so no, most Yule trees are more balanced than our tree|answer}}<br />
* Does the p-value for the second test above indicate that our tree is more balanced than would be expected were it a PDA tree? {{title|I got 0.064, so no, our tree is not significantly more balanced than a PDA tree, but it is getting close|answer}}<br />
</div><br />
<br />
You might want to see a histogram of the Colless index like that used to determine the p-values for the tests above. apTreeshape lets you generate 10 trees each with 20 tips under a Yule model as follows:<br />
> rtreeshape(10,20,model="yule")<br />
That spits out a summary of the 10 trees created, but what we really wanted was to know the Colless index for each of the trees generated. To do this, use the R command <tt>sapply</tt> to call the apTreeshape command <tt>colless</tt> for each tree generated by the <tt>rtreeshape</tt> command:<br />
> sapply(rtreeshape(10,20,model="yule"),FUN=colless)<br />
[1] 38 92 85 91 73 71 94 75 72 93<br />
<div style="background-color:#ccccff"><br />
* Do you expect your Colless indices to be identical to the ones above? Why or why not?{{title|It is highly unlikely that they will be indentical. The indices are for trees generated under a Yule model| answer}}<br />
</div><br />
That's more like it! Now, generate 1000 Yule trees instead of just 10, and create a histogram using the standard R command <tt>hist</tt>:<br />
> yulecolless <- sapply(rtreeshape(1000,20,model="yule"),FUN=colless)<br />
> hist(yulecolless)<br />
Now create a histogram for PDA trees:<br />
> pdacolless <- sapply(rtreeshape(1000,20,model="pda"),FUN=colless)<br />
> hist(pdacolless)<br />
Use the following to compare the mean Colless index for the PDA trees to the Yule trees:<br />
> summary(yulecolless)<br />
> summary(pdacolless)<br />
<div style="background-color:#ccccff"><br />
* Which generates the most balanced trees, on average: Yule or PDA? {{title|Yule trees are more balanced, with mean Colless index 39 versus 81 for PDA|answer}}<br />
</div><br />
<br />
apTreeshape provides one more function (<tt>likelihood.test</tt>) that performs a likelihood ratio test of the PDA model against the Yule model null hypothesis. This test says that we cannot reject the null hypothesis of a Yule model in favor of the PDA model:<br />
> likelihood.test(ts)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that our tree is a Yule tree? {{title|I got 0.4095684, so no, our tree is consistent with a Yule tree|answer}}<br />
</div><br />
<br />
== Independent contrasts ==<br />
APE can compute Felsenstein's independent contrasts, as well as several other methods for assessing phylogenetically-corrected correlations between traits that I did not discuss in lecture (autocorrelation, generalized least squares, mixed models and variance partitioning, and the very interesting Ornstein-Uhlenbeck model, which can be used to assess the correlation between a continuous character and a discrete habitat variable).<br />
<br />
Today, however, we will just play with independent contrasts and phylogenetic generalized least squares (PGLS) regression. Let's try to use APE's <tt>pic</tt> command to reproduce the Independent Contrasts example from lecture:<br />
{| class="wikitable" style="text-align:center"<br />
|- style="background-color: #eeeeee"<br />
| <br />
| style="width:10%" | X<br />
| style="width:10%" | Y<br />
| style="width:10%" | Var<br />
| style="width:10%" | X*<br />
| style="width:10%" | Y*<br />
|-<br />
| style="width:10%" | A-B <br />
| -6<br />
| -2<br />
| 4<br />
| -3<br />
| -1<br />
|-<br />
| C-D <br />
| -4<br />
| -2<br />
| 4<br />
| -2<br />
| -1<br />
|-<br />
| E-F<br />
| 10<br />
| -4<br />
| 9<br />
| 3.333<br />
| -1.333<br />
|}<br />
In the table, X and Y denote the raw contrasts, while X* and Y* denote the rescaled contrasts (raw contrasts divided by the square root of the variance). The correlation among the rescaled contrasts was -0.98916.<br />
<br />
==== Enter the tree ====<br />
Start by entering the tree:<br />
> t <- read.tree(text="((A:2,B:2)E:3.5,(C:2,D:2)F:3.5)G;")<br />
The attribute <tt>text</tt> is needed because we are entering the Newick tree description in the form of a string, not supplying a file name. Note that I have labeled the root node G and the two interior nodes E (ancestor of A and B) and F (ancestor of C and D).<br />
<br />
Plot the tree to make sure the tree definition worked:<br />
> plot(t)<br />
<br />
==== Enter the data ====<br />
Now we must tell APE the X and Y values. Do this by supplying vectors of numbers. We will tell APE which tips these numbers are associated with in the next step:<br />
> x <- c(27, 33, 18, 22)<br />
> y <- c(122, 124, 126, 128)<br />
<br />
Here's how we tell APE what taxa the numbers belong to:<br />
> names(x) <- c("A","B","C","D")<br />
> names(y) <- c("A","B","C","D")<br />
If you want to avoid repetition, you can enter the names for both x and y simultaneously like this:<br />
> names(x) <- names(y) <- c("A","B","C","D")<br />
<br />
==== Compute independent contrasts ====<br />
Now compute the contrasts with the APE function <tt>pic</tt>:<br />
> cx <- pic(x,t)<br />
> cy <- pic(y,t)<br />
The variables cx and cy are arbitrary; you could use different names for these if you wanted. Let's see what values cx and cy hold:<br />
> cx<br />
G E F <br />
3.333333 -3.000000 -2.000000 <br />
> cy<br />
G E F <br />
-1.333333 -1.000000 -1.000000 <br />
The top row in each case holds the node name in the tree, the bottom row holds the rescaled contrasts.<br />
<br />
==== Label interior nodes with the contrasts ====<br />
APE makes it fairly easy to label the tree with the contrasts:<br />
> plot(t)<br />
> nodelabels(round(cx,3), adj=c(0,-1), frame="n")<br />
> nodelabels(round(cy,3), adj=c(0,+1), frame="n")<br />
In the nodelabels command, we supply the numbers with which to label the nodes. The vectors cx and cy contain information about the nodes to label, so APE knows from this which numbers to place at which nodes in the tree. The round command simply rounds the contrasts to 3 decimal places. The <tt>adj</tt> setting adjusts the spacing so that the contrasts for X are not placed directly on top of the contrasts for Y. The command <tt>adj=c(0,-1)</tt> causes the labels to be horizontally displaced 0 lines and vertically displaced one line up (the -1 means go up 1 line) from where they would normally be plotted. The contrasts for Y are displaced vertically one line down from where they would normally appear. Finally, the <tt>frame="n"</tt> just says to not place a box or circle around the labels.<br />
<br />
You should find that the contrasts are the same as those shown as X* and Y* in the table above (as well as the summary slide in the Independent Contrasts lecture). <br />
<br />
Computing the correlation coefficient is as easy as:<br />
> cor(cx, cy)<br />
[1] -0.9891585<br />
<br />
== Phylogenetic Generalized Least Squares (PGLS) regression ==<br />
<br />
Now let's reproduce the PGLS regression example given in lecture. Here are the data we used:<br />
{| class="wikitable" style="text-align:center"<br />
|- style="background-color: #eeeeee"<br />
| <br />
| style="width:10%" | X<br />
| style="width:10%" | Y<br />
|-<br />
| style="width:10%" | A<br />
| 2<br />
| 1<br />
|-<br />
| B<br />
| 3<br />
| 3<br />
|-<br />
| C<br />
| 1<br />
| 2<br />
|-<br />
| D<br />
| 4<br />
| 7<br />
|-<br />
| E<br />
| 5<br />
| 6<br />
|}<br />
<br />
Enter the data as we did for the Independent Contrasts example:<br />
> x <- c(2,3,1,4,5)<br />
> y <- c(1,3,2,7,6)<br />
> names(x) <- names(y) <- c("A","B","C","D","E")<br />
<br />
In order to carry out generalized least squares regression, we will need the <tt>gls</tt> command, which is part of the <tt>nlme</tt> R package. Thus, you will need to load this package before you can use the <tt>gls</tt> command:<br />
> library(nlme)<br />
<br />
Let's first do an ordinary linear regression for comparison:<br />
> m0 <- gls(y ~ x)<br />
> summary(m0) <br />
<br />
<div style="background-color:#ccccff"><br />
* What is the estimate of the intercept? {{title|-0.4|answer}}<br />
* What is the estimate of the slope? {{title|1.4|answer}}<br />
</div><br />
<br />
Let's plot the regression line on the original data:<br />
> plot(x, y, pch=19, xlim=c(0,6),ylim=c(0,8))<br />
> text(x, y, labels=c("A", "B", "C", "D", "E"), pos=4, offset=1)<br />
> segments(0, -0.4, 6, -0.4 + 1.4*6, lwd=2, lty="solid", col="blue")<br />
You will have noticed that the '''first line''' plots the points using a filled circle (pch=19) and specifying that the x-axis should go from 0 to 6 and the y-axis should extend from 0 to 8. The '''second line''' labels the points with the taxon to make it easier to interpret the plot. Here, pos=4 says to put the labels to the right of each point (pos = 1, 2, 3 means below, left, and above, respectively) and offset=1 specifies how far away from the point each label should be. The '''third line''' draws the regression line using the intercept and slope values provided by gls, making the line width 2 (lwd=2) and solid (lty="solid") and blue (col="blue").<br />
<br />
To do PGLS, we will need to enter the tree with edge lengths:<br />
> t <- read.tree(text="(((A:1,B:1)F:1,C:2)G:1,(D:0.5,E:0.5)H:2.5)I;")<br />
<br />
You are ready to estimate the parameters of the PGLS regression model:<br />
> m1 <- gls(y ~ x, correlation=corBrownian(1,t))<br />
> summary(m1) <br />
<br />
<div style="background-color:#ccccff"><br />
* What is the estimate of the intercept? {{title|1.7521186|answer}}<br />
* What is the estimate of the slope? {{title|0.7055085|answer}}<br />
* The <tt>corBrownian</tt> function specified for the correlation in the gls command comes from the APE package. What does <tt>corBrownian</tt> do? You might want to check out the excellent [https://cran.r-project.org/web/packages/ape/ape.pdf APE manual] {{title|it computes the variance-covariance matrix from the tree t assuming a Brownian motion model|answer}}<br />
* In <tt>corBrownian(1,t)</tt>, t is the tree, but what do you think the 1 signifies? {{title|It is the variance per unit time for the Brownian motion model|answer}}<br />
</div><br />
<br />
Assuming you still have the plot window available, let's add the PGLS regression line to the existing plot (if you've closed the plot window you will have to recreate the plot first):<br />
> segments(0, 1.7521186, 6, 1.7521186 + 0.7055085*6, lwd=2, lty="dotted", col="blue")<br />
<br />
== Literature Cited ==<br />
<references/><br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_APE_Lab&diff=38943Phylogenetics: APE Lab2018-04-03T17:54:10Z<p>Paul Lewis: </p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|This lab is an introduction to some of the capabilities of APE, a phylogenetic analysis package written for the R language. You may want to review the [[Phylogenetics: R Primer|R Primer]] lab if you've already forgotten everything you learned about R.<br />
|}<br />
<br />
== Installing APE and apTreeshape ==<br />
'''APE''' is a package largely written and maintained by Emmanuel Paradis, who has written a very nice book<ref>Paradis, E. 2006. Analysis of phylogenetics and evolution with R. Springer. ISBN: 0-387-32914-5</ref> explaining in detail how to use APE. APE is designed to be used inside the [http://www.r-project.org/ R] programming language, to which you were introduced earlier in the semester (see [[Phylogenetics: R Primer]]). APE can do an impressive array of analyses. For example, it is possible to estimate trees using neighbor-joining or maximum likelihood, estimate ancestral states (for either discrete or continuous data), perform Sanderson's penalized likelihood relaxed clock method to estimate divergence times, evaluate Felsenstein's independent contrasts, estimate birth/death rates, perform bootstrapping, and even automatically pull sequences from GenBank given a vector of accession numbers! APE also has impressive tree plotting capabilities, of which we will only scratch the surface today (flip through Chapter 4 of the Paradis book to see what more APE can do).<br />
<br />
'''apTreeshape''' is a different R package (written by Nicolas Bortolussi et al.) that we will also make use of today.<br />
<br />
To install APE and apTreeshape, start R and type the following at the R command prompt:<br />
> install.packages("ape")<br />
> install.packages("apTreeshape")<br />
Assuming you are connected to the internet, R should locate these packages and install them for you. After they are installed, you will need to load them into R in order to use them (note that no quotes are used this time):<br />
> library(ape)<br />
> library(apTreeshape)<br />
You should never again need to issue the <tt>install.packages</tt> command for APE and apTreeshape, but you will need to use the <tt>library</tt> command to load them whenever you want to use them.<br />
<br />
== Reading in trees from a file and exploring tree data structure ==<br />
Download [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/yule.tre this tree file] and save it as a file named <tt>yule.tre</tt> in a new folder somewhere on your computer. Tell R where this folder is using the <tt>setwd</tt> (set working directory) command. For example, I created a folder named <tt>apelab</tt> on my desktop, so I typed this to make that folder my working directory:<br />
> setwd("/Users/plewis/Desktop/apelab")<br />
Now you should be able to read in the tree using this ape command (the <tt>t</tt> is an arbitrary name I chose for the variable used to hold the tree; you could use <tt>tree</tt> if you want):<br />
> t <- read.nexus("yule.tre")<br />
We use <tt>read.nexus</tt> because the tree at hand is in NEXUS format, but APE has a variety of functions to read in different tree file types. If APR can't read your tree file, then give the package treeio a spin. APE stores trees as an object of type "phylo". <br />
<br />
==== Getting a tree summary ====<br />
Some basic information about the tree can be obtained by simply typing the name of the variable you used to store the tree:<br />
> t<br />
<br />
Phylogenetic tree with 20 tips and 19 internal nodes.<br />
<br />
Tip labels:<br />
B, C, D, E, F, G, ...<br />
<br />
Rooted; includes branch lengths.<br />
<br />
==== Obtaining vectors of tip and internal node labels ====<br />
The variable <tt>t</tt> has several attributes that can be queried by following the variable name with a dollar sign and then the name of the attribute. For example, the vector of tip labels can be obtained as follows:<br />
> t$tip.label<br />
[1] "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U"<br />
The internal node labels, if they exist, can be obtained this way:<br />
> t$node.label<br />
NULL<br />
The result above means that labels for the internal nodes were not stored with this tree.<br />
<br />
==== Obtaining the nodes attached to each edge ==== <br />
The nodes at the ends of all the edges in the tree can be had by asking for the edge attribute:<br />
> t$edge<br />
[,1] [,2]<br />
[1,] 21 22<br />
[2,] 22 23<br />
[3,] 23 1<br />
. . .<br />
. . .<br />
. . .<br />
[38,] 38 12 <br />
<br />
==== Obtaining a vector of edge lengths ==== <br />
The edge lengths can be printed thusly:<br />
> t$edge.length<br />
[1] 0.07193600 0.01755700 0.17661500 0.02632500 0.01009100 0.06893900 0.07126000 0.03970200 0.01912900<br />
[10] 0.01243000 0.01243000 0.03155800 0.05901300 0.08118600 0.08118600 0.00476400 0.14552600 0.07604800<br />
[19] 0.00070400 0.06877400 0.06877400 0.02423800 0.02848800 0.01675100 0.01675100 0.04524000 0.19417200<br />
[28] 0.07015000 0.12596600 0.06999200 0.06797400 0.00201900 0.00201900 0.12462600 0.07128300 0.00004969<br />
[37] 0.00004969 0.07133200<br />
<br />
==== About this tree ==== <br />
This tree in the file <tt>yule.tre</tt> was obtained using PAUP from 10,000 nucleotide sites simulated from a Yule tree. The model used to generate the simulated data (HKY model, kappa = 4, base frequencies = 0.3 A, 0.2 C, 0.2 G, and 0.3 T, no rate heterogeneity) was also used in the analysis by PAUP (the final ML tree was made ultrametric by enforcing the clock constraint).<br />
<!-- I analyzed these data in BEAST for part of a lecture. See slide 22 and beyond in [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/lectures/DivTimeBayesianBEAST.pdf this PDF file] for details.--><br />
<br />
== Fun with plotting trees in APE ==<br />
<br />
You can plot the tree using all defaults with this ape command:<br />
> plot(t)<br />
<br />
Let's try changing a few defaults and plot the tree in a variety of ways. All of the following change just one default option, but feel free to combine these to create the plot you want.<br />
<br />
==== Left-facing, up-facing, or down-facing trees ====<br />
> plot(t, direction="l")<br />
> plot(t, direction="u")<br />
> plot(t, direction="d")<br />
The default is to plot the tree right-facing (<tt>direction="r"</tt>).<br />
<br />
==== Hide the taxon names ====<br />
> plot(t, show.tip.label=FALSE)<br />
The default behavior is to show the taxon names.<br />
<br />
==== Make the edges thicker ====<br />
> plot(t, edge.width=4)<br />
An edge width of 1 is the default. If you specify several edge widths, APE will alternate them as it draws the tree:<br />
> plot(t, edge.width=c(1,2,3,4))<br />
<br />
==== Color the edges ====<br />
> plot(t, edge.color="red")<br />
Black edges are the default. If you specify several edge colors, APE will alternate them as it draws the tree:<br />
> plot(t, edge.color=c("black","red","green","blue"))<br />
<br />
==== Make taxon labels smaller or larger ====<br />
> plot(t, cex=0.5)<br />
The cex parameter governs relative scaling of the taxon labels, with 1.0 being the default. Thus, the command above makes the taxon labels half the default size. To double the size, use<br />
> plot(t, cex=2.0)<br />
<br />
==== Plot tree as an unrooted or radial tree ====<br />
> plot(t, type="u")<br />
The default type is "p" (phylogram), but "c" (cladogram), "u" (unrooted), "r" (radial) are other options. Some of these options (e.g. "r") create very funky looking trees, leading me to think there is something about the tree description in the file <tt>yule.tre</tt> that APE is not expecting.<br />
<br />
==== Labeling internal nodes ====<br />
> plot(t)<br />
> nodelabels()<br />
This is primarily useful if you want to annotate one of the nodes:<br />
> plot(t)<br />
> nodelabels("Clade A", 22)<br />
> nodelabels("Clade B", 35)<br />
To put the labels inside a circle rather than a rectangle, use <tt>frame="c"</tt> rather than the default (<tt>frame="r"</tt>). To use a background color of white rather than the default "lightblue", use <tt>bg="white"</tt>:<br />
> plot(t)<br />
> nodelabels("Clade A", 22, frame="c", bg="white")<br />
> nodelabels("Clade B", 35, frame="c", bg="yellow")<br />
<br />
==== Adding a scale bar ====<br />
> plot(t)<br />
> add.scale.bar(length=0.05)<br />
The above commands add a scale bar to the bottom left of the plot. To add a scale going all the way across the bottom of the plot, try this:<br />
> plot(t)<br />
> axisPhylo()<br />
<br />
== Diversification analyses ==<br />
APE can perform some lineage-through-time type analyses. The tree read in from the file <tt>yule.tre</tt> that you already have in memory is perfect for testing APE's diversification analyses because we know (since it is based on simulated data) that this tree was generated under a pure-birth (Yule) model.<br />
<br />
==== Lineage through time plots ====<br />
This is a rather small tree, so a lineage through time (LTT) plot will be rather crude, but let's go through the motions anyway.<br />
> ltt.plot(t)<br />
LTT plots usually have a log scale for the number of lineages (y-axis), and this can be easily accomplished:<br />
> ltt.plot(t, log = "y")<br />
Now add a line extending from the point (t = -0.265, N = 2) to the point (t = 0, N = 20) using the command <tt>segments</tt> (note that the &quot;segments&quot; command is a core R command, not something added by the APE package):<br />
> segments(-0.265, 2, 0, 20, lty="dotted")<br />
The slope of this line should (ideally) be equal to the birth rate of the yule process used to generate the tree, which was <math>\lambda=10</math>.<br />
<div style="background-color:#ccccff"><br />
* Calculate the slope of this line. Is it close to the birth rate 10? {{title|8.689|answer}}<br />
</div><br />
If you get something like 68 for the slope, then you forgot to take the natural log of 2 and 20. The plot uses a log scale for the y-axis, so the two endpoints of the dotted line are really (-0.265, log(2)) and (0, log(20)).<br />
<br />
==== Birth/death analysis ====<br />
Now let's perform a birth/death analysis. APE's <tt>birthdeath</tt> command estimates the birth and death rates using the node ages in a tree:<br />
> birthdeath(t)<br />
Estimation of Speciation and Extinction Rates<br />
with Birth-Death Models <br />
<br />
Phylogenetic tree: t <br />
Number of tips: 20 <br />
Deviance: -120.4538 <br />
Log-likelihood: 60.22689 <br />
Parameter estimates:<br />
d / b = 0 StdErr = 0 <br />
b - d = 8.674513 StdErr = 1.445897 <br />
(b: speciation rate, d: extinction rate)<br />
Profile likelihood 95% confidence intervals:<br />
d / b: [-1.193549, 0.5286254]<br />
b - d: [5.25955, 13.32028]<br />
The output indicates that it was correctly able to detect that the death rate (shown as deaths per birth) was 0, and the estimated birth rate should be very close to the value you calculated for the slope of the dotted line in the LTT plot.<br />
<div style="background-color:#ccccff"><br />
* What is the death rate estimated by APE? {{title|0.0|answer}}<br />
* Is the true diversification rate within one standard deviation of the estimated diversification rate? {{title|yes, one standard deviation each side of (10-0) is the interval from 8.554 to 11.446, and 8.674513 is inside that interval|answer}}<br />
* Are the true birth and death rates within the profile likelihood 95% confidence intervals? {{title|yes|answer}}<br />
</div><br />
A "profile" likelihood is obtained by varying one parameter in the model and re-estimating all the other parameters conditional on the current value of the focal parameter. This is, technically, not the correct way of getting a confidence interval, but is easier to compute and may be more stable for small samples than getting confidence intervals the correct way.<br />
<div style="background-color:#ccccff"><br />
* What is the correct way to interpret the 95% confidence interval for b - d: [5.25955, 13.32028]? Is it that there is 95% chance that the true value of b - d is in that interval? {{title|no, that is the definition of a Bayesian credible interval|answer}}<br />
* Or, does it mean that our estimate (8.674513) is within the middle 95% of values that would be produced if the true b - d value was in that interval? {{title|yes|answer}}<br />
</div><br />
<br />
==== Analyses involving tree shape ====<br />
The apTreeshape package (as the name applies) lets you perform analyses of tree shape (which measure how balanced or imbalanced a tree is). apTreeshape stores trees differently than APE, so you can't use a tree object that you created with APE in functions associated with apTreeshape. You can, however, convert a "phylo" object from APE to a "treeshape" object used by apTreeshape:<br />
> ts <- as.treeshape(t)<br />
Here, I'm assuming that <tt>t</tt> still refers to the tree you read in from the file <tt>yule.tre</tt> using the APE command <tt>read.nexus</tt>. We can now obtain a measure of '''tree imbalance''' known as Colless's index:<br />
> c <- colless(ts)<br />
> c<br />
[1] 44<br />
The formula for Colless's index is easy to understand. Each internal node branches into a left and right lineage. The absolute value of the difference between the number of left-hand descendants and right-hand descendants provides a measure of how imbalanced the tree is with respect to that particular node. Adding these imbalance measures up over all internal nodes yields Colless's overall tree imbalance index:<br />
<br />
<math>I_C = \sum_{j=1}^{n-1} |L_j - R_j|</math><br />
<br />
apTreeshape can do an analysis to assess whether the tree has the amount of imbalance one would expect from a Yule tree:<br />
> colless.test(ts, model = "yule", alternative="greater", n.mc = 1000)<br />
This generates 1000 trees from a Yule process and compares the Colless index from our tree (44) to the distribution of such indices obtained from the simulated trees. The p-value is the proportion of the 1000 trees generated from the null distribution that have indices greater than 44 (i.e. the proportion of Yule trees that are more ''im''balanced than our tree). If the p-value was 0.5, for example, then our tree would be right in the middle of the distribution expected for Yule trees. If the p-value was 0.01, however, it would mean that very few Yule trees are as imbalanced as our tree, which would make it hard to believe that our tree is a Yule tree.<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that a Yule process generated our tree? {{title|I got 0.288 so no, the Yule process can easily generate trees with the same level of imbalance as our tree|answer}}<br />
</div><br />
You can also test one other model with the <tt>colless</tt> function: the "proportional to distinguishable" (or PDA) model. This null model produces random trees by starting with three taxa joined to a single internal node, then building upon that by adding new taxa to randomly-chosen (discrete uniform distribution) edges that already exist in the (unrooted) tree. The edge to which a new taxon is added can be an internal edge as well as a terminal edge, which causes this process to produce trees with a different distribution of shapes than the Yule process, which only adds new taxa to the tips of a growing rooted tree.<br />
> colless.test(ts, model = "pda", alternative="greater", n.mc = 1000)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that our tree is a PDA tree? {{title|I got 0.912, so no, PDA trees are almost always less imbalanced (more balanced) than our tree|answer}}<br />
</div><br />
You might also wish to test whether our tree is more ''balanced'' than would be expected under the Yule or PDA models. apTreeshape let's you look at the other end of the distribution too:<br />
> colless.test(ts, model = "yule", alternative="less", n.mc = 1000)<br />
> colless.test(ts, model = "pda", alternative="less", n.mc = 1000)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value for the first test above indicate that our tree is more balanced than would be expected were it a Yule tree? {{title|I got 0.716, so no, most Yule trees are more balanced than our tree|answer}}<br />
* Does the p-value for the second test above indicate that our tree is more balanced than would be expected were it a PDA tree? {{title|I got 0.064, so no, our tree is not significantly more balanced than a PDA tree, but it is getting close|answer}}<br />
</div><br />
<br />
You might want to see a histogram of the Colless index like that used to determine the p-values for the tests above. apTreeshape lets you generate 10 trees each with 20 tips under a Yule model as follows:<br />
> rtreeshape(10,20,model="yule")<br />
That spits out a summary of the 10 trees created, but what we really wanted was to know the Colless index for each of the trees generated. To do this, use the R command <tt>sapply</tt> to call the apTreeshape command <tt>colless</tt> for each tree generated by the <tt>rtreeshape</tt> command:<br />
> sapply(rtreeshape(10,20,model="yule"),FUN=colless)<br />
[1] 38 92 85 91 73 71 94 75 72 93<br />
<div style="background-color:#ccccff"><br />
* Do you expect your Colless indices to be identical to the ones above? Why or why not?{{title|It is highly unlikely that they will be indentical. The indices are for trees generated under a Yule model| answer}}<br />
</div><br />
That's more like it! Now, generate 1000 Yule trees instead of just 10, and create a histogram using the standard R command <tt>hist</tt>:<br />
> yulecolless <- sapply(rtreeshape(1000,20,model="yule"),FUN=colless)<br />
> hist(yulecolless)<br />
Now create a histogram for PDA trees:<br />
> pdacolless <- sapply(rtreeshape(1000,20,model="pda"),FUN=colless)<br />
> hist(pdacolless)<br />
Use the following to compare the mean Colless index for the PDA trees to the Yule trees:<br />
> summary(yulecolless)<br />
> summary(pdacolless)<br />
<div style="background-color:#ccccff"><br />
* Which generates the most balanced trees, on average: Yule or PDA? {{title|Yule trees are more balanced, with mean Colless index 39 versus 81 for PDA|answer}}<br />
</div><br />
<br />
apTreeshape provides one more function (<tt>likelihood.test</tt>) that performs a likelihood ratio test of the PDA model against the Yule model null hypothesis. This test says that we cannot reject the null hypothesis of a Yule model in favor of the PDA model:<br />
> likelihood.test(ts)<br />
<div style="background-color:#ccccff"><br />
* Does the p-value indicate that we should reject the hypothesis that our tree is a Yule tree? {{title|I got 0.4095684, so no, our tree is consistent with a Yule tree|answer}}<br />
</div><br />
<br />
== Independent contrasts ==<br />
APE can compute Felsenstein's independent contrasts, as well as several other methods for assessing phylogenetically-corrected correlations between traits that I did not discuss in lecture (autocorrelation, generalized least squares, mixed models and variance partitioning, and the very interesting Ornstein-Uhlenbeck model, which can be used to assess the correlation between a continuous character and a discrete habitat variable).<br />
<br />
Today, however, we will just play with independent contrasts and phylogenetic generalized least squares (PGLS) regression. Let's try to use APE's <tt>pic</tt> command to reproduce the Independent Contrasts example from lecture:<br />
{| class="wikitable" style="text-align:center"<br />
|- style="background-color: #eeeeee"<br />
| <br />
| style="width:10%" | X<br />
| style="width:10%" | Y<br />
| style="width:10%" | Var<br />
| style="width:10%" | X*<br />
| style="width:10%" | Y*<br />
|-<br />
| style="width:10%" | A-B <br />
| -6<br />
| -2<br />
| 4<br />
| -3<br />
| -1<br />
|-<br />
| C-D <br />
| -4<br />
| -2<br />
| 4<br />
| -2<br />
| -1<br />
|-<br />
| E-F<br />
| 10<br />
| -4<br />
| 9<br />
| 3.333<br />
| -1.333<br />
|}<br />
In the table, X and Y denote the raw contrasts, while X* and Y* denote the rescaled contrasts (raw contrasts divided by the square root of the variance). The correlation among the rescaled contrasts was -0.98916.<br />
<br />
==== Enter the tree ====<br />
Start by entering the tree:<br />
> t <- read.tree(text="((A:2,B:2)E:3.5,(C:2,D:2)F:3.5)G;")<br />
The attribute <tt>text</tt> is needed because we are entering the Newick tree description in the form of a string, not supplying a file name. Note that I have labeled the root node G and the two interior nodes E (ancestor of A and B) and F (ancestor of C and D).<br />
<br />
Plot the tree to make sure the tree definition worked:<br />
> plot(t)<br />
<br />
==== Enter the data ====<br />
Now we must tell APE the X and Y values. Do this by supplying vectors of numbers. We will tell APE which tips these numbers are associated with in the next step:<br />
> x <- c(27, 33, 18, 22)<br />
> y <- c(122, 124, 126, 128)<br />
<br />
Here's how we tell APE what taxa the numbers belong to:<br />
> names(x) <- c("A","B","C","D")<br />
> names(y) <- c("A","B","C","D")<br />
If you want to avoid repetition, you can enter the names for both x and y simultaneously like this:<br />
> names(x) <- names(y) <- c("A","B","C","D")<br />
<br />
==== Compute independent contrasts ====<br />
Now compute the contrasts with the APE function <tt>pic</tt>:<br />
> cx <- pic(x,t)<br />
> cy <- pic(y,t)<br />
The variables cx and cy are arbitrary; you could use different names for these if you wanted. Let's see what values cx and cy hold:<br />
> cx<br />
G E F <br />
3.333333 -3.000000 -2.000000 <br />
> cy<br />
G E F <br />
-1.333333 -1.000000 -1.000000 <br />
The top row in each case holds the node name in the tree, the bottom row holds the rescaled contrasts.<br />
<br />
==== Label interior nodes with the contrasts ====<br />
APE makes it fairly easy to label the tree with the contrasts:<br />
> plot(t)<br />
> nodelabels(round(cx,3), adj=c(0,-1), frame="n")<br />
> nodelabels(round(cy,3), adj=c(0,+1), frame="n")<br />
In the nodelabels command, we supply the numbers with which to label the nodes. The vectors cx and cy contain information about the nodes to label, so APE knows from this which numbers to place at which nodes in the tree. The round command simply rounds the contrasts to 3 decimal places. The <tt>adj</tt> setting adjusts the spacing so that the contrasts for X are not placed directly on top of the contrasts for Y. The command <tt>adj=c(0,-1)</tt> causes the labels to be horizontally displaced 0 lines and vertically displaced one line up (the -1 means go up 1 line) from where they would normally be plotted. The contrasts for Y are displaced vertically one line down from where they would normally appear. Finally, the <tt>frame="n"</tt> just says to not place a box or circle around the labels.<br />
<br />
You should find that the contrasts are the same as those shown as X* and Y* in the table above (as well as the summary slide in the Independent Contrasts lecture). <br />
<br />
Computing the correlation coefficient is as easy as:<br />
> cor(cx, cy)<br />
[1] -0.9891585<br />
<br />
== Phylogenetic Generalized Least Squares (PGLS) regression ==<br />
<br />
Now let's reproduce the PGLS regression example given in lecture. Here are the data we used:<br />
{| class="wikitable" style="text-align:center"<br />
|- style="background-color: #eeeeee"<br />
| <br />
| style="width:10%" | X<br />
| style="width:10%" | Y<br />
|-<br />
| style="width:10%" | A<br />
| 2<br />
| 1<br />
|-<br />
| B<br />
| 3<br />
| 3<br />
|-<br />
| C<br />
| 1<br />
| 2<br />
|-<br />
| D<br />
| 4<br />
| 7<br />
|-<br />
| E<br />
| 5<br />
| 6<br />
|}<br />
<br />
Enter the data as we did for the Independent Contrasts example:<br />
> x <- c(2,3,1,4,5)<br />
> y <- c(1,3,2,7,6)<br />
> names(x) <- names(y) <- c("A","B","C","D","E")<br />
<br />
In order to carry out generalized least squares regression, we will need the <tt>gls</tt> command, which is part of the <tt>nlme</tt> R package. Thus, you will need to load this package before you can use the <tt>gls</tt> command:<br />
> library(nlme)<br />
<br />
Let's first do an ordinary linear regression for comparison:<br />
> m0 <- gls(y ~ x)<br />
> summary(m0) <br />
<br />
<div style="background-color:#ccccff"><br />
* What is the estimate of the intercept? {{title|-0.4|answer}}<br />
* What is the estimate of the slope? {{title|1.4|answer}}<br />
</div><br />
<br />
Let's plot the regression line on the original data:<br />
> plot(x, y, pch=19, xlim=c(0,6),ylim=c(0,8))<br />
> text(x, y, labels=c("A", "B", "C", "D", "E"), pos=4, offset=1)<br />
> segments(0, -0.4, 6, -0.4 + 1.4*6, lwd=2, lty="solid", col="blue")<br />
You will have noticed that the '''first line''' plots the points using a filled circle (pch=19) and specifying that the x-axis should go from 0 to 6 and the y-axis should extend from 0 to 8. The '''second line''' labels the points with the taxon to make it easier to interpret the plot. Here, pos=4 says to put the labels to the right of each point (pos = 1, 2, 3 means below, left, and above, respectively) and offset=1 specifies how far away from the point each label should be. The '''third line''' draws the regression line using the intercept and slope values provided by gls, making the line width 2 (lwd=2) and solid (lty="solid") and blue (col="blue").<br />
<br />
To do PGLS, we will need to enter the tree with edge lengths:<br />
> t <- read.tree(text="(((A:1,B:1)F:1,C:2)G:1,(D:0.5,E:0.5)H:2.5)I;")<br />
<br />
You are ready to estimate the parameters of the PGLS regression model:<br />
> m1 <- gls(y ~ x, correlation=corBrownian(1,t))<br />
> summary(m1) <br />
<br />
<div style="background-color:#ccccff"><br />
* What is the estimate of the intercept? {{title|1.7521186|answer}}<br />
* What is the estimate of the slope? {{title|0.7055085|answer}}<br />
* The <tt>corBrownian</tt> function specified for the correlation in the gls command comes from the APE package. What does <tt>corBrownian</tt> do? You might want to check out the excellent [https://cran.r-project.org/web/packages/ape/ape.pdf APE manual] {{title|it computes the variance-covariance matrix from the tree t assuming a Brownian motion model|answer}}<br />
* In <tt>corBrownian(1,t)</tt>, t is the tree, but what do you think the 1 signifies? {{title|It is the variance per unit time for the Brownian motion model|answer}}<br />
</div><br />
<br />
Assuming you still have the plot window available, let's add the PGLS regression line to the existing plot (if you've closed the plot window you will have to recreate the plot first):<br />
> segments(0, 1.7521186, 6, 1.7521186 + 0.7055085*6, lwd=2, lty="dotted", col="blue")<br />
<br />
== Literature Cited ==<br />
<references/><br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_BayesTraits_Lab&diff=38923Phylogenetics: BayesTraits Lab2018-03-30T18:54:23Z<p>Paul Lewis: </p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|In this lab you will learn how to use the program BayesTraits, written by Andrew Meade and Mark Pagel. BayesTraits can perform several analyses related to evaluating evolutionary correlation and ancestral state reconstruction in discrete morphological traits. This program is meant to replace the older programs Discrete and Multistate. In this lab, you will download and use BayesTraits entirely on your own laptop.<br />
|}<br />
<br />
== Download BayesTraits ==<br />
<br />
Download BayesTraits from [http://www.evolution.reading.ac.uk/ Mark Pagel's web site], click on the "Software" link, then click on the "Description and Downloads" link under "BayesTraits". Download the version specific to your platform. BayesTraits will unpack itself to a folder containing the program itself along with several tree and data files (e.g. <tt>Primates.txt</tt> and <tt>Primates.trees</tt>). I will hereafter refer to the folder containing these files as simply the '''BayesTraits folder'''. Go back to Mark Pagel's web site and '''download the manual''' for BayesTraits. This is a PDF file and should open in your browser window.<br />
<br />
== Download the tree and data files ==<br />
For this exercise, you will use data and trees used in the SIMMAP analyses presented in this paper (you should recognize the names of at least two of the authors of this paper):<br />
<br />
Jones C.S., Bakker F.T., Schlichting C.D., Nicotra A.B. 2009. Leaf shape evolution in the South African genus ''Pelargonium'' L'Her. (Geraniaceae). Evolution. 63:479–497.<br />
<br />
The data and trees were not made available in the online supplementary materials for this paper, but I have obtained permission to use them for this laboratory exercise.<br />
<!-- The links below are password-protected, so ask us for the username and password before clicking on the links: --><br />
<br />
:[http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/pelly.txt pelly.txt] This is the data file. It contains data for two traits (see below) for 154 taxa in the plant genus ''Pelargonium''.<br />
:[http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/pelly.tre pelly.tre] This is the tree file. It contains 99 trees sampled from an MCMC analysis of DNA sequences.<br />
<br />
<!--<br />
If you are using version 1 of BayesTraits, it will complain about basal polytomies in the trees. The following versions of the files provide a workaround (I deleted one taxon from both the data file and the tree file to eliminate the basal polytomy):<br />
:[http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/restricted/pellyrooted.txt pellyrooted.txt] This is the data file. It contains data for two traits (see below) for 154 taxa in the plant genus ''Pelargonium''.<br />
:[http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/restricted/pellyrooted.tre pellyrooted.tre] This is the tree file. It contains 99 trees sampled from an MCMC analysis of DNA sequences.<br />
--><br />
You should move these files to the aforementioned BayesTraits folder so that they can be easily found by the BayesTraits program.<br />
<br />
== Assessing the strength of association between two binary characters ==<br />
<br />
The first thing we will do is see if the two characters (leaf dissection and leaf venation) in <tt>pelly.txt</tt> are evolutionarily correlated. <br />
<br />
=== Trait 1: Leaf dissection ===<br />
The '''leaf dissection''' trait comprises two states (I've merged some states in the original data matrix to produce just 2 states): <br />
* 0 means leaves are ''entire'' (''unlobed'' or ''shallowly lobed'' in the original study), and <br />
* 1 means leaves are ''dissected'' (''lobed'', ''deeply lobed'', or ''dissected'' in the original study).<br />
<br />
=== Trait 2: Leaf venation ===<br />
The '''leaf venation''' trait comprises two states: <br />
* 0 means leaves are ''pinnately veined'' (one main vein runs down the long axis of the leaf blade), and <br />
* 1 means leaves are ''palmately veined'' (several major veins meet at the base of the leaf). <br />
<br />
To test whether these two traits are correlated, we will estimate the '''marginal likelihood''' under two models. The independence model assumes that the two traits are uncorrelated. The dependence model allows the two traits to be correlated in their evolution. The model with the higher marginal likelihood will be the preferred model. You will recall that we discussed both of these models in lecture, and also discussed the '''stepping-stone method''' that BayesTraits uses to evaluate models. You may wish to pull up those lectures to help answer the questions that you will encounter momentarily, as well as the BayesTraits manual.<br />
<br />
=== Maximum Likelihood: Independence model ===<br />
<br />
'''If you are using Windows''', start BayesTraits by opening a console window , navigate to the BayesTraits directory, and type the following to start the program:<br />
BayesTraitsV3 pelly.tre pelly.txt<br />
'''If you are using a Mac or Linux''', start BayesTraits by opening a terminal window, navigate to the BayesTraits directory, and type the following to start the program:<br />
./BayesTraitsV3 pelly.tre pelly.txt<br />
<br />
You should see this selection appear:<br />
Please select the model of evolution to use.<br />
1) MultiState<br />
2) Discrete: Independent<br />
3) Discrete: Dependant<br />
4) Continuous: Random Walk (Model A)<br />
5) Continuous: Directional (Model B)<br />
6) Continuous: Regression<br />
7) Independent Contrast <br />
8) Independent Contrast: Correlation <br />
9) Independent Contrast: Regression<br />
10) Discrete: Covarion<br />
Press the 2 key and hit enter to select the Independent model. Now you should see these choices appear:<br />
Please Select the analysis method to use.<br />
1) Maximum Likelihood.<br />
2) MCMC<br />
Press the 1 key and hit enter to select maximum likelihood. Now you should see some output showing the choices you explicitly (or implicitly) made:<br />
Options:<br />
Model: Discete Independant<br />
Tree File Name: pelly.tre<br />
Data File Name: pelly.txt<br />
Log File Name: pelly.txt.log.txt<br />
Save Initial Trees: False<br />
Save Trees: False<br />
Summary: False<br />
Seed 3162959925<br />
Analsis Type: Maximum Likelihood<br />
ML attempt per tree: 10<br />
ML Max Evaluations: 20000<br />
ML Tolerance: 0.000001<br />
ML Algorithm: BOBYQA<br />
Rate Range: 0.000000 - 100.000000<br />
Precision: 64 bits<br />
Cores: 1<br />
No of Rates: 4<br />
Base frequency (PI's) None<br />
Character Symbols: 00,01,10,11<br />
Using a covarion model: False<br />
Restrictions:<br />
alpha1 None<br />
beta1 None <br />
alpha2 None<br />
beta2 None <br />
Tree Information<br />
Trees: 99<br />
Taxa: 154<br />
Sites: 1<br />
States: 4<br />
Now type <tt>run</tt> and hit enter to perform the analysis, which will consist of estimating the parameters of the independent model on each of the 99 trees contained in the pelly.tre file.<br />
Tree No Lh alpha1 beta1 alpha2 beta2 Root - P(0,0) Root - P(0,1) Root - P(1,0) Root - P(1,1)<br />
1 -157.362972 53.767527 34.523176 35.319157 20.707416 0.249998 0.250002 0.249998 0.250002<br />
2 -158.179984 53.313539 34.182683 36.038859 20.997536 0.249999 0.250001 0.249999 0.250001<br />
.<br />
.<br />
.<br />
98 -156.647307 52.357626 36.749282 27.270771 13.086248 0.250244 0.249756 0.250244 0.249756 <br />
99 -156.532925 52.321467 36.641688 27.402067 13.200124 0.250234 0.249767 0.250233 0.249766<br />
You will notice that BayesTraits created a new file: <tt>pelly.txt.log.txt</tt>. '''Rename this file''' <tt>ml-independant.txt</tt> so that it will not be overwritten the next time you run BayesTraits.<br />
<br />
Try to answer these questions using the output you have generated (you'll need to consult the BayesTraits manual, but ask us if anything doesn't make sense after giving it the ol' college try):<br />
<div style="background-color:#ccccff"><br />
* ''Which occurs at a faster rate: pinnate to palmate, or palmate to pinnate?'' {{title|the 0 (pinnate) to 1 (palmate) transition occurs at a faster rate|answer}}<br />
* ''Which occurs at a faster rate: entire to dissected, or dissected to entire?'' {{title|the 0 (entire) to 1 (dissected) transition occurs at a faster rate|answer}}<br />
* ''What do you think Root - P(1,1) means (i.e. the last column of numbers)?'' {{title|this is the probability that leaves were both dissected and palmately veined at the root of the tree|answer}}<br />
</div><br />
<br />
=== Maximum Likelihood: Dependence model ===<br />
<br />
Run BayesTraits again, this time typing 3 on the first screen to choose the dependence model and again typing 1 on the second screen to select maximum likelihood. You should see this output showing the options selected:<br />
Options:<br />
Model: Discete Dependent<br />
Tree File Name: pelly.tre<br />
Data File Name: pelly.txt<br />
Log File Name: pelly.txt.log.txt<br />
Summary: False<br />
Seed 3601265953<br />
Analsis Type: Maximum Likelihood<br />
ML attempt per tree: 10<br />
Precision: 64 bits<br />
Cores: 1<br />
No of Rates: 8<br />
Base frequency (PI's) None<br />
Character Symbols: 00,01,10,11<br />
Using a covarion model: False<br />
Restrictions:<br />
q12 None<br />
q13 None<br />
q21 None<br />
q24 None<br />
q31 None<br />
q34 None<br />
q42 None<br />
q43 None<br />
Tree Information<br />
Trees: 99<br />
Taxa: 154<br />
Sites: 1<br />
States: 4<br />
Run the analysis. Here is an example of the output produced after you type <tt>run</tt> to start the analysis. The column headers don't quite line up with the columns, but you can fix this in a text editor or by copying and pasting the table-like output from the log file into a spreadsheet program:<br />
Tree No Lh q12 q13 q21 q24 q31 q34 q42 q43 Root - P(0,0) Root - P(0,1) Root - P(1,0) Root - P(1,1)<br />
1 -151.930254 66.451053 37.783888 0.000000 62.220033 23.997490 23.299393 46.110432 36.632979 0.24999 0.249981 0.250026 0.250000<br />
2 -152.925691 67.152271 38.611193 0.000000 60.925185 24.514488 23.937433 45.313366 37.199310 0.24999 0.249983 0.250023 0.250001<br />
.<br />
.<br />
.<br />
98 -150.816306 36.534843 27.359325 0.000000 66.563262 19.823546 24.944519 63.940577 31.074092 0.250048 0.249750 0.250304 0.249898<br />
99 -150.712705 37.316351 27.260833 0.000000 64.364694 20.107653 25.004246 60.945163 31.658536 0.250030 0.249779 0.250272 0.249919<br />
'''Before doing anything else, rename the file''' <tt>pelly.txt.log.txt</tt> to <tt>ml-dependant.txt</tt> so that it will not be overwritten the next time you run BayesTraits.<br />
<br />
Try to answer these questions using the output you have generated:<br />
<div style="background-color:#ccccff"><br />
* ''What type of joint evolutionary transitions seem to often have very low rates (look for an abundance of zeros in a column)?'' {{title|q21, which involves entire leaves changing from palmate to pinnate, and q43, which involves dissected leaves changing from palmate to pinnate|answer}}<br />
* ''What type of joint evolutionary transitions seem to often have very high rates (look for columns with rates in the hundreds)?'' {{title|q12, which involves entire leaves changing from pinnate to palmate, and q13, which involves pinnate leaves changing from entire to dissected|answer}}<br />
</div><br />
<br />
=== Bayesian MCMC: Dependence model ===<br />
<br />
Run BayesTraits again, typing 3 on the first screen to choose the dependence model and this time typing 2 on the second screen to select MCMC. You should see this output showing the options selected:<br />
Options:<br />
Model: Discete Dependent<br />
Tree File Name: pelly.tre<br />
Data File Name: pelly.txt<br />
Log File Name: pelly.txt.log.txt<br />
Summary: False<br />
Seed 3792635164<br />
Precision: 64 bits<br />
Cores: 1<br />
Analysis Type: MCMC<br />
Sample Period: 1000<br />
Iterations: 1010000<br />
Burn in: 10000<br />
MCMC ML Start: False<br />
Schedule File: pelly.txt.log.txt.Schedule.txt<br />
Rate Dev: AutoTune<br />
No of Rates: 8<br />
Base frequency (PI's) None<br />
Character Symbols: 00,01,10,11<br />
Using a covarion model: False<br />
Restrictions:<br />
q12 None<br />
q13 None<br />
q21 None<br />
q24 None<br />
q31 None<br />
q34 None<br />
q42 None<br />
q43 None <br />
Prior Information:<br />
Prior Categories: 100<br />
q12 uniform 0.00 100.00<br />
q13 uniform 0.00 100.00<br />
q21 uniform 0.00 100.00<br />
q24 uniform 0.00 100.00<br />
q31 uniform 0.00 100.00<br />
q34 uniform 0.00 100.00<br />
q42 uniform 0.00 100.00<br />
q43 uniform 0.00 100.00<br />
Tree Information<br />
Trees: 99<br />
Taxa: 154<br />
Sites: 1<br />
States: 4<br />
'''Before typing run''' type the following command, which tells BayesTraits to change all priors from the default Uniform(0,100) to an Exponential distribution with mean 30:<br />
pa exp 30<br />
Also type the following to ask BayesTraits to perform a stepping-stone analysis:<br />
stones 100 10000<br />
Now run the analysis. This will estimate 100 ratios to brook the gap between posterior and prior, using a sample size of 10000 for each &quot;stone&quot;.<br />
Here is an example of the output produced after you type <tt>run</tt> to start the analysis:<br />
Iteration Lh Tree No q12 q13 q21 q24 q31 q34 q42 q43 Root - P(0,0) Root - P(0,1) Root - P(1,0) Root - P(1,1)<br />
11000 -155.195365 78 14.423234 34.800270 8.845985 45.927148 12.622435 50.476188 52.844895 32.149168 0.250068 0.249969 0.249994 0.249968<br />
12000 -154.161705 82 64.601017 12.382781 9.259134 51.796365 12.002095 23.744903 30.316089 21.865930 0.249936 0.249957 0.250095 0.250012 .<br />
.<br />
.<br />
1009000 -154.343996 30 33.555198 50.086092 11.294490 38.518607 24.461032 47.295157 43.477964 21.726938 0.250057 0.249939 0.250045 0.249959<br />
1010000 -154.195259 87 29.584898 35.410909 2.003582 61.981073 16.976124 14.895266 49.111354 14.419644 0.251115 0.247854 0.252551 0.248480<br />
'''Before doing anything else, rename the file''' <tt>pelly.txt.log.txt</tt> to <tt>mcmc-dependent.txt</tt>, and <tt>pelly.txt.log.Stones.txt</tt> to <tt>mcmc-dependent.Stones.txt</tt> so that they will not be overwritten the next time you run BayesTraits.<br />
<br />
You will notice a column not present in the likelihood analysis named '''Tree No'' that shows which of the 99 trees in the supplied <tt>pelly.tre</tt>> treefile was chosen at random to be used for that particular sample point. BayesTraits is trying to ''mimic'' sampling trees from the posterior distribution here; it cannot ''actually'' sample trees from the posterior because we have given it only data for two morphological characters, which would not provide nearly enough information to estimate the phylogeny for 154 taxa.<br />
<br />
Try to answer these questions using the output you have generated:<br />
<div style="background-color:#ccccff"><br />
* ''What is the log marginal likelihood estimated using the stepping-stone method? This value is listed on the last line of the file <tt>mcmc-dependent.Stones.txt</tt> (your value may differ from mine slightly)'' {{title|I got -160.567444 |answer}}<br />
</div><br />
<br />
=== Bayesian MCMC: Independence model ===<br />
<br />
Run BayesTraits again, this time specifying the Independent model, and again using MCMC, <tt>pa exp 30</tt>, and <tt>stones 100 10000</tt>. Rename the output file from <tt>pelly.txt.log.txt</tt> to <tt>mcmc-independent.txt</tt>. Also rename <tt>pelly.txt.log.Stones.txt</tt> to <tt>mcmc-independent.Stones.txt</tt>.<br />
<div style="background-color:#ccccff"><br />
* ''What is the estimated log marginal likelihood for this analysis using the stepping-stone method?'' {{title|I got -162.693620|answer}}<br />
* ''Which is the better model (dependent or independent) according to these estimates of marginal likelihood?'' {{title|the dependent model has a slightly higher marginal likelihood and is thus preferred|answer}}<br />
</div><br />
<br />
=== Bayesian Reversible-jump MCMC ===<br />
<br />
Run BayesTraits again, specifying Dependent model, MCMC and, this time, specify the reversible-jump approach using the command<br />
rj exp 30<br />
The previous command also sets the prior. Type <tt>run</tt> to start, then when it finishes rename the output file <tt>rjmcmc-dependent.txt</tt>. <br />
<br />
The reversible-jump approach carries out an MCMC analysis in which the number of model parameters (the dimension of the model) potentially changes from one iteration to the next. The full model allows each of the 8 rate parameters to be estimated separately, while other models restrict the values of some rate parameters to equal the values of other rate parameters. The output contains a column titled '''Model string''' that summarizes the model in a string of 8 symbols corresponding to the 8 rate parameters q12, q13, q21, q24, q31, q34, q42, and q43. For example, the model string "'1 0 Z 0 1 1 0 2" sets q21 to zero (Z), q13=q24=q42 (parameter group 0), q12=q31=q34 (parameter group 1), and q43 has its own non-zero value distinct from parameter groups 0 and 1. <br />
<br />
You could copy the "spreadsheet" part of the output file into Excel and sort by the model string column, but let's instead use Python to summarize the output file. Download the file [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/btsummary.py btsummary.py] file and run it as follows:<br />
python btsummary.py<br />
This should produce counts of model strings. (If it doesn't, check to make sure your output file is named <tt>rjmcmc-dependent.txt</tt> because btsummary.py tries to open a file by that name.) Answer the following questions using the counts provided by btsummary.py.<br />
<div style="background-color:#ccccff"><br />
* ''Which model string is most common?'' {{title|I got 0 0 Z 0 0 0 0 0 with count 979|answer}}<br />
* ''What does this model imply?'' {{title|all rates are the same except q21, which is forced to have rate zero. q21 equals 0 implies that entire,palmate leaves never change to entire,pinnate|answer}}<br />
</div><br />
<br />
Notice that many (but not all) model strings have Z for q21. One way to estimate the marginal posterior probability of the hypothesis that q21=0 is to sum the counts for all model strings that have Z in that third position corresponding to q21. It is easy to modify btsummary.py to do this for us: open btsummary.py and locate the line containing the [https://en.wikipedia.org/wiki/Regular_expression regular expression] search that pulls out all the model strings from the BayesTrait output file:<br />
model_list = re.findall("'[Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9]", stuff, re.M | re.S)<br />
The re.findall function performs a regular expression search of the text stored in the variable stuff looking for strings that have a series of 8 space-separated characters, each of which is either the character Z or a digit between 0 and 9 (inclusive). Copy this line, then comment out one copy by starting the line with the hash (#) character:<br />
#model_list = re.findall("'[Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9]", stuff, re.M | re.S)<br />
model_list = re.findall("'[Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9]", stuff, re.M | re.S)<br />
Now modify the copy such that it counts only models with Z in the third position of the model string.<br />
Rerun btsummary.py, and now the total matches should equal the number of model strings sampled in which q21=0.<br />
<div style="background-color:#ccccff"><br />
* ''So what is the estimated marginal posterior probability that q21=0?'' {{title|I got 0.995|answer}}<br />
* ''Why is the term marginal appropriate here (as in marginal posterior probability)?'' {{title|We are estimating the sum of all joint posteriors in which q21 equals 0|answer}}<br />
</div><br />
<br />
== Estimating ancestral states ==<br />
<br />
[[File:Xerophytevenation.png|right]] The Jones et al. 2009 study estimated ancestral states using SIMMAP. In particular, they found that the most recent common ancestor (MRCA) of the xerophytic (dry-adapted) clade of pelargoniums almost certainly had pinnate venation (see red circle in figure on right). Let's see what BayesTraits says.<br />
<br />
Start BayesTraits in the usual way, specifying 1 (Multistate) on the first screen and 2 (MCMC) on the second. After the options are output, type the following commands in, one line at a time, finishing with the run command:<br />
pa exp 30<br />
addtag xero alternans104 rapaceum130<br />
addmrca xero xero<br />
run<br />
The addmrca command tells BayesTraits to add columns of numbers to the output that display the probabilities of each state for each character in the most recent common ancestor of the taxa listed in the addtag command (2 taxa are sufficient to define the MRCA, but more taxa may be included). The column headers for the last four columns of output should be<br />
xero - S(0) - P(0) <-- character 0 (dissection), probability of state 0 (unlobed)<br />
xero - S(0) - P(1) <-- character 0 (dissection), probability of state 1 (dissected)<br />
xero - S(1) - P(0) <-- character 1 (venation), probability of state 0 (pinnate)<br />
xero - S(1) - P(1) <-- character 1 (venation), probability of state 1 (palmate)<br />
<div style="background-color:#ccccff"><br />
* ''Which state is most common at the xerophyte MRCA node for leaf venation?'' {{title|pinnate venation; xero - S(1) - P(0)|answer}}<br />
* ''Which state is most common at the xerophyte MRCA node for leaf dissection?'' {{title|dissected; xero - S(0) - P(1)|answer}}<br />
</div><br />
<br />
That concluded the introduction to BayesTraits. A glance through the manual will convince you that there is much more to this program than we have time to cover in a single lab period, but you should know enough now to explore the rest on your own if you need these features. <br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_BayesTraits_Lab&diff=38922Phylogenetics: BayesTraits Lab2018-03-30T18:53:22Z<p>Paul Lewis: </p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|In this lab you will learn how to use the program BayesTraits, written by Andrew Meade and Mark Pagel. BayesTraits can perform several analyses related to evaluating evolutionary correlation and ancestral state reconstruction in discrete morphological traits. This program is meant to replace the older programs Discrete and Multistate. In this lab, you will download and use BayesTraits entirely on your own laptop.<br />
|}<br />
<br />
== Download BayesTraits ==<br />
<br />
Download BayesTraits from [http://www.evolution.reading.ac.uk/ Mark Pagel's web site], click on the "Software" link, then click on the "Description and Downloads" link under "BayesTraits". Download the version specific to your platform. BayesTraits will unpack itself to a folder containing the program itself along with several tree and data files (e.g. <tt>Primates.txt</tt> and <tt>Primates.trees</tt>). I will hereafter refer to the folder containing these files as simply the '''BayesTraits folder'''. Go back to Mark Pagel's web site and '''download the manual''' for BayesTraits. This is a PDF file and should open in your browser window.<br />
<br />
== Download the tree and data files ==<br />
For this exercise, you will use data and trees used in the SIMMAP analyses presented in this paper (you should recognize the names of at least two of the authors of this paper):<br />
<br />
Jones C.S., Bakker F.T., Schlichting C.D., Nicotra A.B. 2009. Leaf shape evolution in the South African genus ''Pelargonium'' L'Her. (Geraniaceae). Evolution. 63:479–497.<br />
<br />
The data and trees were not made available in the online supplementary materials for this paper, but I have obtained permission to use them for this laboratory exercise.<br />
<!-- The links below are password-protected, so ask us for the username and password before clicking on the links: --><br />
<br />
:[http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/pelly.txt pelly.txt] This is the data file. It contains data for two traits (see below) for 154 taxa in the plant genus ''Pelargonium''.<br />
:[http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/pelly.tre pelly.tre] This is the tree file. It contains 99 trees sampled from an MCMC analysis of DNA sequences.<br />
<br />
<!--<br />
If you are using version 1 of BayesTraits, it will complain about basal polytomies in the trees. The following versions of the files provide a workaround (I deleted one taxon from both the data file and the tree file to eliminate the basal polytomy):<br />
:[http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/restricted/pellyrooted.txt pellyrooted.txt] This is the data file. It contains data for two traits (see below) for 154 taxa in the plant genus ''Pelargonium''.<br />
:[http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/restricted/pellyrooted.tre pellyrooted.tre] This is the tree file. It contains 99 trees sampled from an MCMC analysis of DNA sequences.<br />
--><br />
You should move these files to the aforementioned BayesTraits folder so that they can be easily found by the BayesTraits program.<br />
<br />
== Assessing the strength of association between two binary characters ==<br />
<br />
The first thing we will do is see if the two characters (leaf dissection and leaf venation) in <tt>pelly.txt</tt> are evolutionarily correlated. <br />
<br />
=== Trait 1: Leaf dissection ===<br />
The '''leaf dissection''' trait comprises two states (I've merged some states in the original data matrix to produce just 2 states): <br />
* 0 means leaves are ''entire'' (''unlobed'' or ''shallowly lobed'' in the original study), and <br />
* 1 means leaves are ''dissected'' (''lobed'', ''deeply lobed'', or ''dissected'' in the original study).<br />
<br />
=== Trait 2: Leaf venation ===<br />
The '''leaf venation''' trait comprises two states: <br />
* 0 means leaves are ''pinnately veined'' (one main vein runs down the long axis of the leaf blade), and <br />
* 1 means leaves are ''palmately veined'' (several major veins meet at the base of the leaf). <br />
<br />
To test whether these two traits are correlated, we will estimate the '''marginal likelihood''' under two models. The independence model assumes that the two traits are uncorrelated. The dependence model allows the two traits to be correlated in their evolution. The model with the higher marginal likelihood will be the preferred model. You will recall that we discussed both of these models in lecture, and also discussed the '''stepping-stone method''' that BayesTraits uses to evaluate models. You may wish to pull up those lectures to help answer the questions that you will encounter momentarily, as well as the BayesTraits manual.<br />
<br />
=== Maximum Likelihood: Independence model ===<br />
<br />
'''If you are using Windows''', start BayesTraits by opening a console window , navigate to the BayesTraits directory, and type the following to start the program:<br />
BayesTraitsV3 pelly.tre pelly.txt<br />
'''If you are using a Mac or Linux''', start BayesTraits by opening a terminal window, navigate to the BayesTraits directory, and type the following to start the program:<br />
./BayesTraitsV3 pelly.tre pelly.txt<br />
<br />
You should see this selection appear:<br />
Please select the model of evolution to use.<br />
1) MultiState<br />
2) Discrete: Independent<br />
3) Discrete: Dependant<br />
4) Continuous: Random Walk (Model A)<br />
5) Continuous: Directional (Model B)<br />
6) Continuous: Regression<br />
7) Independent Contrast <br />
8) Independent Contrast: Correlation <br />
9) Independent Contrast: Regression<br />
10) Discrete: Covarion<br />
Press the 2 key and hit enter to select the Independent model. Now you should see these choices appear:<br />
Please Select the analysis method to use.<br />
1) Maximum Likelihood.<br />
2) MCMC<br />
Press the 1 key and hit enter to select maximum likelihood. Now you should see some output showing the choices you explicitly (or implicitly) made:<br />
Options:<br />
Model: Discete Independant<br />
Tree File Name: pelly.tre<br />
Data File Name: pelly.txt<br />
Log File Name: pelly.txt.log.txt<br />
Save Initial Trees: False<br />
Save Trees: False<br />
Summary: False<br />
Seed 3162959925<br />
Analsis Type: Maximum Likelihood<br />
ML attempt per tree: 10<br />
ML Max Evaluations: 20000<br />
ML Tolerance: 0.000001<br />
ML Algorithm: BOBYQA<br />
Rate Range: 0.000000 - 100.000000<br />
Precision: 64 bits<br />
Cores: 1<br />
No of Rates: 4<br />
Base frequency (PI's) None<br />
Character Symbols: 00,01,10,11<br />
Using a covarion model: False<br />
Restrictions:<br />
alpha1 None<br />
beta1 None <br />
alpha2 None<br />
beta2 None <br />
Tree Information<br />
Trees: 99<br />
Taxa: 154<br />
Sites: 1<br />
States: 4<br />
Now type <tt>run</tt> and hit enter to perform the analysis, which will consist of estimating the parameters of the independent model on each of the 99 trees contained in the pelly.tre file.<br />
Tree No Lh alpha1 beta1 alpha2 beta2 Root - P(0,0) Root - P(0,1) Root - P(1,0) Root - P(1,1)<br />
1 -157.362972 53.767527 34.523176 35.319157 20.707416 0.249998 0.250002 0.249998 0.250002<br />
2 -158.179984 53.313539 34.182683 36.038859 20.997536 0.249999 0.250001 0.249999 0.250001<br />
.<br />
.<br />
.<br />
98 -156.647307 52.357626 36.749282 27.270771 13.086248 0.250244 0.249756 0.250244 0.249756 <br />
99 -156.532925 52.321467 36.641688 27.402067 13.200124 0.250234 0.249767 0.250233 0.249766<br />
You will notice that BayesTraits created a new file: <tt>pelly.txt.log.txt</tt>. '''Rename this file''' <tt>ml-independant.txt</tt> so that it will not be overwritten the next time you run BayesTraits.<br />
<br />
Try to answer these questions using the output you have generated (you'll need to consult the BayesTraits manual, but ask us if anything doesn't make sense after giving it the ol' college try):<br />
<div style="background-color:#ccccff"><br />
* ''Which occurs at a faster rate: pinnate to palmate, or palmate to pinnate?'' {{title|the 0 (pinnate) to 1 (palmate) transition occurs at a faster rate|answer}}<br />
* ''Which occurs at a faster rate: entire to dissected, or dissected to entire?'' {{title|the 0 (entire) to 1 (dissected) transition occurs at a faster rate|answer}}<br />
* ''What do you think Root - P(1,1) means (i.e. the last column of numbers)?'' {{title|this is the probability that leaves were both dissected and palmately veined at the root of the tree|answer}}<br />
</div><br />
<br />
=== Maximum Likelihood: Dependence model ===<br />
<br />
Run BayesTraits again, this time typing 3 on the first screen to choose the dependence model and again typing 1 on the second screen to select maximum likelihood. You should see this output showing the options selected:<br />
Options:<br />
Model: Discete Dependent<br />
Tree File Name: pelly.tre<br />
Data File Name: pelly.txt<br />
Log File Name: pelly.txt.log.txt<br />
Summary: False<br />
Seed 3601265953<br />
Analsis Type: Maximum Likelihood<br />
ML attempt per tree: 10<br />
Precision: 64 bits<br />
Cores: 1<br />
No of Rates: 8<br />
Base frequency (PI's) None<br />
Character Symbols: 00,01,10,11<br />
Using a covarion model: False<br />
Restrictions:<br />
q12 None<br />
q13 None<br />
q21 None<br />
q24 None<br />
q31 None<br />
q34 None<br />
q42 None<br />
q43 None<br />
Tree Information<br />
Trees: 99<br />
Taxa: 154<br />
Sites: 1<br />
States: 4<br />
Run the analysis. Here is an example of the output produced after you type <tt>run</tt> to start the analysis. The column headers don't quite line up with the columns, but you can fix this in a text editor or by copying and pasting the table-like output from the log file into a spreadsheet program:<br />
Tree No Lh q12 q13 q21 q24 q31 q34 q42 q43 Root - P(0,0) Root - P(0,1) Root - P(1,0) Root - P(1,1)<br />
1 -151.930254 66.451053 37.783888 0.000000 62.220033 23.997490 23.299393 46.110432 36.632979 0.24999 0.249981 0.250026 0.250000<br />
2 -152.925691 67.152271 38.611193 0.000000 60.925185 24.514488 23.937433 45.313366 37.199310 0.24999 0.249983 0.250023 0.250001<br />
.<br />
.<br />
.<br />
98 -150.816306 36.534843 27.359325 0.000000 66.563262 19.823546 24.944519 63.940577 31.074092 0.250048 0.249750 0.250304 0.249898<br />
99 -150.712705 37.316351 27.260833 0.000000 64.364694 20.107653 25.004246 60.945163 31.658536 0.250030 0.249779 0.250272 0.249919<br />
'''Before doing anything else, rename the file''' <tt>pelly.txt.log.txt</tt> to <tt>ml-dependant.txt</tt> so that it will not be overwritten the next time you run BayesTraits.<br />
<br />
Try to answer these questions using the output you have generated:<br />
<div style="background-color:#ccccff"><br />
* ''What type of joint evolutionary transitions seem to often have very low rates (look for an abundance of zeros in a column)?'' {{title|q21, which involves entire leaves changing from palmate to pinnate, and q43, which involves dissected leaves changing from palmate to pinnate|answer}}<br />
* ''What type of joint evolutionary transitions seem to often have very high rates (look for columns with rates in the hundreds)?'' {{title|q12, which involves entire leaves changing from pinnate to palmate, and q13, which involves pinnate leaves changing from entire to dissected|answer}}<br />
</div><br />
<br />
=== Bayesian MCMC: Dependence model ===<br />
<br />
Run BayesTraits again, typing 3 on the first screen to choose the dependence model and this time typing 2 on the second screen to select MCMC. You should see this output showing the options selected:<br />
Options:<br />
Model: Discete Dependent<br />
Tree File Name: pelly.tre<br />
Data File Name: pelly.txt<br />
Log File Name: pelly.txt.log.txt<br />
Summary: False<br />
Seed 3792635164<br />
Precision: 64 bits<br />
Cores: 1<br />
Analysis Type: MCMC<br />
Sample Period: 1000<br />
Iterations: 1010000<br />
Burn in: 10000<br />
MCMC ML Start: False<br />
Schedule File: pelly.txt.log.txt.Schedule.txt<br />
Rate Dev: AutoTune<br />
No of Rates: 8<br />
Base frequency (PI's) None<br />
Character Symbols: 00,01,10,11<br />
Using a covarion model: False<br />
Restrictions:<br />
q12 None<br />
q13 None<br />
q21 None<br />
q24 None<br />
q31 None<br />
q34 None<br />
q42 None<br />
q43 None <br />
Prior Information:<br />
Prior Categories: 100<br />
q12 uniform 0.00 100.00<br />
q13 uniform 0.00 100.00<br />
q21 uniform 0.00 100.00<br />
q24 uniform 0.00 100.00<br />
q31 uniform 0.00 100.00<br />
q34 uniform 0.00 100.00<br />
q42 uniform 0.00 100.00<br />
q43 uniform 0.00 100.00<br />
Tree Information<br />
Trees: 99<br />
Taxa: 154<br />
Sites: 1<br />
States: 4<br />
'''Before typing run''' type the following command, which tells BayesTraits to change all priors from the default Uniform(0,100) to an Exponential distribution with mean 30:<br />
pa exp 30<br />
Also type the following to ask BayesTraits to perform a stepping-stone analysis:<br />
stones 100 10000<br />
Now run the analysis. This will estimate 100 ratios to brook the gap between posterior and prior, using a sample size of 10000 for each &quot;stone&quot;.<br />
Here is an example of the output produced after you type <tt>run</tt> to start the analysis:<br />
Iteration Lh Tree No q12 q13 q21 q24 q31 q34 q42 q43 Root - P(0,0) Root - P(0,1) Root - P(1,0) Root - P(1,1)<br />
11000 -155.195365 78 14.423234 34.800270 8.845985 45.927148 12.622435 50.476188 52.844895 32.149168 0.250068 0.249969 0.249994 0.249968<br />
12000 -154.161705 82 64.601017 12.382781 9.259134 51.796365 12.002095 23.744903 30.316089 21.865930 0.249936 0.249957 0.250095 0.250012 .<br />
.<br />
.<br />
1009000 -154.343996 30 33.555198 50.086092 11.294490 38.518607 24.461032 47.295157 43.477964 21.726938 0.250057 0.249939 0.250045 0.249959<br />
1010000 -154.195259 87 29.584898 35.410909 2.003582 61.981073 16.976124 14.895266 49.111354 14.419644 0.251115 0.247854 0.252551 0.248480<br />
'''Before doing anything else, rename the file''' <tt>pelly.txt.log.txt</tt> to <tt>mcmc-dependent.txt</tt>, and <tt>pelly.txt.log.Stones.txt</tt> to <tt>mcmc-dependent.Stones.txt</tt> so that they will not be overwritten the next time you run BayesTraits.<br />
<br />
You will notice a column not present in the likelihood analysis named '''Tree No'' that shows which of the 99 trees in the supplied <tt>pelly.tre</tt>> treefile was chosen at random to be used for that particular sample point. BayesTraits is trying to ''mimic'' sampling trees from the posterior distribution here; it cannot ''actually'' sample trees from the posterior because we have given it only data for two morphological characters, which would not provide nearly enough information to estimate the phylogeny for 154 taxa.<br />
<br />
Try to answer these questions using the output you have generated:<br />
<div style="background-color:#ccccff"><br />
* ''What is the log marginal likelihood estimated using the stepping-stone method? This value is listed on the last line of the file <tt>mcmc-dependent.Stones.txt</tt> (your value may differ from mine slightly)'' {{title|I got -160.567444 |answer}}<br />
</div><br />
<br />
=== Bayesian MCMC: Independence model ===<br />
<br />
Run BayesTraits again, this time specifying the Independent model, and again using MCMC, <tt>pa exp 30</tt>, and <tt>stones 100 10000</tt>. Rename the output file from <tt>pelly.txt.log.txt</tt> to <tt>mcmc-independent.txt</tt>. Also rename <tt>pelly.txt.log.Stones.txt</tt> to <tt>mcmc-independent.Stones.txt</tt>.<br />
<div style="background-color:#ccccff"><br />
* ''What is the estimated log marginal likelihood for this analysis using the stepping-stone method?'' {{title|I got -162.693620|answer}}<br />
* ''Which is the better model (dependent or independent) according to these estimates of marginal likelihood?'' {{title|the dependent model has a slightly higher marginal likelihood, estimated by either method, and is thus preferred|answer}}<br />
</div><br />
<br />
=== Bayesian Reversible-jump MCMC ===<br />
<br />
Run BayesTraits again, specifying Dependent model, MCMC and, this time, specify the reversible-jump approach using the command<br />
rj exp 30<br />
The previous command also sets the prior. Type <tt>run</tt> to start, then when it finishes rename the output file <tt>rjmcmc-dependent.txt</tt>. <br />
<br />
The reversible-jump approach carries out an MCMC analysis in which the number of model parameters (the dimension of the model) potentially changes from one iteration to the next. The full model allows each of the 8 rate parameters to be estimated separately, while other models restrict the values of some rate parameters to equal the values of other rate parameters. The output contains a column titled '''Model string''' that summarizes the model in a string of 8 symbols corresponding to the 8 rate parameters q12, q13, q21, q24, q31, q34, q42, and q43. For example, the model string "'1 0 Z 0 1 1 0 2" sets q21 to zero (Z), q13=q24=q42 (parameter group 0), q12=q31=q34 (parameter group 1), and q43 has its own non-zero value distinct from parameter groups 0 and 1. <br />
<br />
You could copy the "spreadsheet" part of the output file into Excel and sort by the model string column, but let's instead use Python to summarize the output file. Download the file [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/btsummary.py btsummary.py] file and run it as follows:<br />
python btsummary.py<br />
This should produce counts of model strings. (If it doesn't, check to make sure your output file is named <tt>rjmcmc-dependent.txt</tt> because btsummary.py tries to open a file by that name.) Answer the following questions using the counts provided by btsummary.py.<br />
<div style="background-color:#ccccff"><br />
* ''Which model string is most common?'' {{title|I got 0 0 Z 0 0 0 0 0 with count 979|answer}}<br />
* ''What does this model imply?'' {{title|all rates are the same except q21, which is forced to have rate zero. q21 equals 0 implies that entire,palmate leaves never change to entire,pinnate|answer}}<br />
</div><br />
<br />
Notice that many (but not all) model strings have Z for q21. One way to estimate the marginal posterior probability of the hypothesis that q21=0 is to sum the counts for all model strings that have Z in that third position corresponding to q21. It is easy to modify btsummary.py to do this for us: open btsummary.py and locate the line containing the [https://en.wikipedia.org/wiki/Regular_expression regular expression] search that pulls out all the model strings from the BayesTrait output file:<br />
model_list = re.findall("'[Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9]", stuff, re.M | re.S)<br />
The re.findall function performs a regular expression search of the text stored in the variable stuff looking for strings that have a series of 8 space-separated characters, each of which is either the character Z or a digit between 0 and 9 (inclusive). Copy this line, then comment out one copy by starting the line with the hash (#) character:<br />
#model_list = re.findall("'[Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9]", stuff, re.M | re.S)<br />
model_list = re.findall("'[Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9] [Z0-9]", stuff, re.M | re.S)<br />
Now modify the copy such that it counts only models with Z in the third position of the model string.<br />
Rerun btsummary.py, and now the total matches should equal the number of model strings sampled in which q21=0.<br />
<div style="background-color:#ccccff"><br />
* ''So what is the estimated marginal posterior probability that q21=0?'' {{title|I got 0.995|answer}}<br />
* ''Why is the term marginal appropriate here (as in marginal posterior probability)?'' {{title|We are estimating the sum of all joint posteriors in which q21 equals 0|answer}}<br />
</div><br />
<br />
== Estimating ancestral states ==<br />
<br />
[[File:Xerophytevenation.png|right]] The Jones et al. 2009 study estimated ancestral states using SIMMAP. In particular, they found that the most recent common ancestor (MRCA) of the xerophytic (dry-adapted) clade of pelargoniums almost certainly had pinnate venation (see red circle in figure on right). Let's see what BayesTraits says.<br />
<br />
Start BayesTraits in the usual way, specifying 1 (Multistate) on the first screen and 2 (MCMC) on the second. After the options are output, type the following commands in, one line at a time, finishing with the run command:<br />
pa exp 30<br />
addtag xero alternans104 rapaceum130<br />
addmrca xero xero<br />
run<br />
The addmrca command tells BayesTraits to add columns of numbers to the output that display the probabilities of each state for each character in the most recent common ancestor of the taxa listed in the addtag command (2 taxa are sufficient to define the MRCA, but more taxa may be included). The column headers for the last four columns of output should be<br />
xero - S(0) - P(0) <-- character 0 (dissection), probability of state 0 (unlobed)<br />
xero - S(0) - P(1) <-- character 0 (dissection), probability of state 1 (dissected)<br />
xero - S(1) - P(0) <-- character 1 (venation), probability of state 0 (pinnate)<br />
xero - S(1) - P(1) <-- character 1 (venation), probability of state 1 (palmate)<br />
<div style="background-color:#ccccff"><br />
* ''Which state is most common at the xerophyte MRCA node for leaf venation?'' {{title|pinnate venation; xero - S(1) - P(0)|answer}}<br />
* ''Which state is most common at the xerophyte MRCA node for leaf dissection?'' {{title|dissected; xero - S(0) - P(1)|answer}}<br />
</div><br />
<br />
That concluded the introduction to BayesTraits. A glance through the manual will convince you that there is much more to this program than we have time to cover in a single lab period, but you should know enough now to explore the rest on your own if you need these features. <br />
<br />
[[Category: Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Morphology_and_Partitioning_in_MrBayes&diff=38817Phylogenetics: Morphology and Partitioning in MrBayes2018-03-23T17:19:16Z<p>Paul Lewis: /* Estimating the marginal likelihood using the stepping-stone method */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab is to learn how to analyze discrete morphological character data in MrBayes, to learn how to combine morphological with molecular data in a partitioned analysis in which each data type is assigned an appropriate model of evolution, and learn how to estimate the marginal likelihood of a model for purposes of model comparison and selection.<br />
|}<br />
<br />
== The Nylander et al. study ==<br />
<br />
The data for this lab comes from a paper by Nylander et al. (2004) that has already become a landmark study in combining data within a Bayesian framework. The full citation is:<br />
<br />
''Nylander, J., F. Ronquist, J. P. Huelsenbeck, and J. Nieves-Aldrey. 2004. Bayesian phylogenetic analysis of combined data. Systematic Biology 53:47-67''<br />
<br />
If you have access, you can [http://sysbio.oxfordjournals.org/content/53/1/47.full.pdf+html download the pdf of this paper].<br />
<br />
== Downloading the data file ==<br />
The data from the paper is available from [http://www.treebase.org/treebase-web/search/studySearch.html TreeBase]. While you can download a nexus file containing all the data (search by Study Accession number for 1070 or by Author for Nylander), you will have trouble executing this file in MrBayes data file due to the use of CHARACTER blocks rather than DATA blocks, so I have done the transformation for you. Download the file by [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/nylander.nex clicking here] and save a copy to your local hard drive. This <tt>nylander.nex</tt> data file contains three of the five data sets analyzed by Nylander et al. (2004).<br />
<br />
== Creating a command file ==<br />
Rather than adding a MRBAYES block to the data file, create a separate file (on your local computer) named '''go.nex''' that contains a MRBAYES block:<br />
<br />
#nexus<br />
<br />
begin mrbayes;<br />
exe nylander.nex;<br />
charset morph = 1-166;<br />
charset 18S = 167-1320;<br />
charset COI = 1321-2397; <br />
end;<br />
<br />
The first line of the MRBAYES block executes the data file <tt>nylander.nex</tt>. Each of the last three lines defines a charset, a meaningful set of characters. Each charset identifies one of the sources of data used in the Nylander et al. (2004) study. The '''first charset''' is named <tt>morph</tt>. While you can use any name you like to identify charsets, this name is appropriate because these are the discrete morphological characters. The '''second charset''' is named <tt>18S</tt> because it contains 18S rRNA gene sequences. The '''third charset''' is named <tt>COI</tt>. This is a protein-coding gene in the mitochondrion that encodes part of the electron transport chain. <br />
<br />
In this lab, you will build on this MRBAYES block, but be sure to keep the three charset lines above any other commands you add. Eventually you will perform an analysis on the cluster, but '''keep working on your own laptop''' until you have built the entire MRBAYES block.<br />
<br />
== Starting a log file ==<br />
<br />
It is always a good idea to start a log file so that you can review the output produced by MrBayes at a later time. Add the following line to your MRBAYES block (after the charset definitions:<br />
log start file=output.txt replace;<br />
Here, I've named my log file <tt>output.txt</tt> and instructed MrBayes to replace this file without asking if it already exists.<br />
<br />
== Reducing the size of the data set ==<br />
<br />
In order to have time for an analysis during the lab period, we will need to delete some taxa:<br />
delete 1-11 19-21 26-32;<br />
In addition, three morphological characters seem to cause problems for some models, so we will exclude these three characters:<br />
exclude 29 56 58;<br />
<br />
== Creating a data partition ==<br />
<br />
Your next task is to tell MrBayes how to partition the data. Used as a verb, partition means to erect walls or dividers. The correct term for the data between two partitions (i.e. dividers) is subset, but data subsets are often, confusingly, also called partitions! Even more confusing, the entire collection of subsets is known as a partition! Add the following lines to the end of your MRBAYES block to tell MrBayes that you want each of the 3 defined charsets to be a separate component (subset) of the partition:<br />
<br />
partition mine = 3:morph,18S,COI;<br />
<br />
The first number (3) is the number of subsets composing the partition. The name by which the partition is known in MrBayes is up to you: here I've chosen the name "mine".<br />
<br />
Just defining a partition is not enough to get MrBayes to use it! You must tell MrBayes that you want to use the partition named "mine". This seems a little redundant, but the idea is that you can set up several different partitions and then easily turn them on or off simply by changing the partition named in the set command:<br />
<br />
set partition=mine;<br />
<br />
== Specifying a model for the morphological data ==<br />
<br />
The main reason for creating a partition is so that we can apply a different model to each subset. The first subset corresponds to the morph charset, so we need to apply a model appropriate to morphology to this subset.<br />
<br />
=== The lset command ===<br />
<br />
For each subset, you will create an <tt>lset</tt> and <tt>prset</tt> command. At the end of your existing mrbayes block, type in this <tt>lset</tt> command:<br />
<br />
lset applyto=(1) coding=variable; <br />
<br />
The <tt>applyto=(1)</tt> statement says that these settings apply only to the first (1) subset. The <tt>coding=variable</tt> statement instructs MrBayes to make the likelihood conditional on character variability, rather than the ordinary likelihood, which assumes that there will be an appropriate number of constant characters present.<br />
<br />
=== The prset command ===<br />
<br />
Now, we need to specify the priors for the parameters in the morphology model. The morphology model we are using is quite simple, so the only parameters are really just the branch lengths and the state frequencies.<br />
<br />
We will use a simple symmetric model in which the frequency of each state is equal. The state frequencies are parameters that are present in the likelihood function. MrBayes allows you to assume that state frequencies follow a discrete Dirichlet distribution, much like relative rates are assumed to follow a discrete Gamma distribution in the "+G" models of among-site rate heterogeneity. Note that if a character has only two states (e.g. 0 = absent and 1 = present), this discrete Dirichlet distribution is equivalent to a discrete Beta distribution. To force the state frequencies to be equal, the Dirichlet density should be a spike concentrated right at the value 1/(number of states). To specify this, set <tt>symdirihyperpr</tt> to <tt>fixed(infinity)</tt>:<br />
prset applyto=(1) symdirihyperpr=fixed(infinity) ratepr=variable;<br />
<!-- To allow more flexibility, create an assymetric model using a discrete Beta distribution to allow heterogeneity across characters in the state frequencies. (Note that this will be a Dirichlet distribution rather than a Beta distribution if a character has more than 2 states.) The Beta distribution used is always symmetrical, but you have some control over the variance. Recall that a Beta(2,2) has its mode at 0.5 but gently arcs from 0 to 1, whereas a Beta(100,100) distribution is peaked strongly over the value 0.5. We will create a model in which there is a hyperparameter governing the parameters of the Beta distribution and we will place an Exponential(1.0) hyperprior on that hyperparameter.<br />
[prset applyto=(1) symdirihyperpr=exponential(1.0) ratepr=variable;]<br />
Note that I have commented out this second <tt>prset</tt> command. This is because our first run will explore the pure symmetric model (where state frequencies are always exactly equal to 0.5 for 2-state characters), then we will do a second run to explore the asymmetric model (which will involve commenting out the first <tt>prset</tt> command and uncommenting the second).--><br />
<br />
You may be curious about the <tt>ratepr=variable</tt> statement. All data subsets will be using the same branch lengths, and because morphological characters are probably evolving at a different rate than DNA sequences, this statement causes MrBayes to give this subset its own relative rate. If the estimated value of this parameter ends up being 1.5, it means that the morphological characters evolve about one and a half times faster than the average character/site in the entire data set. The average substitution rate is what is reflected in the estimated branch lengths. '''It is always a good idea to include the <tt>ratepr=variable</tt> statement in each subset of a partitioned model when using MrBayes'''.<br />
<br />
== Specifying a model for the 18S and COI data ==<br />
<br />
For the 18S sequences, ideally we would use a secondary structure model, but unfortunately I don't know which sites form stems and loops for these data. Thus, we will have to make do with using a standard DNA sequence model (the GTR+I+G model) for this subset. For the COI data, we would (again, in an ideal world) specify a codon model because this is a protein-coding gene. However, we don't have the time in a 2 hour lab to allow a codon model to run to convergence, so we will also be applying the standard GTR+I+G model to the COI data. We will, however, allow the 18S and COI to occupy different subsets of the partition so that they can have different relative rates of substitution.<br />
<br />
=== The lset command ===<br />
<br />
The lset command is similar to ones you have used before with MrBayes:<br />
<br />
lset applyto=(2,3) nst=6 rates=invgamma;<br />
<br />
The <tt>nst=6</tt> tells MrBayes to use a 6-substitution-rate model (i.e. the GTR model) and the <tt>rates=invgamma</tt> part says to add invariable sites and discrete gamma rate heterogeneity onto the GTR base model. The <tt>applyto=(2,3)</tt> statement means that the remainder of the lset command only applies to the second (18S) and third (COI) subsets.<br />
<br />
=== The prset command ===<br />
<br />
Priors need to be specified for all the parameters of the GTR+I+G model except branch lengths and the gamma shape parameter. Because these (branch length and gamma shape priors) are being applied to all subsets, we'll specify these later.<br />
<br />
prset applyto=(2,3) revmatpr=dirichlet(1,2,1,1,2,1) statefreqpr=dirichlet(2,2,2,2) pinvarpr=uniform(0,1) ratepr=variable;<br />
<br />
This specifies a Dirichlet(1,2,1,1,2,1) prior for the 6 GTR relative substitution rate parameters. This prior is not flat, but it also is not very informative: it specifies a slight preference for transitions over transversions. For the base frequencies, we're specifying a vague Dirichlet(2,2,2,2) prior. Again, slightly informative, attaching very weak rubber bands to each relative base frequency, keeping the relative frequencies from straying too far from 0.25. The <tt>pinvarpr=uniform(0,1)</tt> applies a flat, uniform prior to the pinvar parameter. The <tt>applyto=(2,3)</tt> statement applies its prior settings to the second (18S data) and third (COI) data subsets. The last <tt>ratepr=variable</tt> statement in each case ensures that each subset will have a relative rate parameter that allows it to evolve at different rate (potentially) than other subsets.<br />
<!--== Specifying a model for the COI data ==<br />
<br />
The COI gene is protein-coding, which means we can apply a codon model here if we like. Groups 1 and 2 will use a codon model for this subset, while group 3 will try using GTR+I+G for comparison. <br />
<br />
Using a codon model implies that the data begin at the first codon position. This wasn't true for the COI data I downloaded from TreeBase (Nylander et al. did not apply codon models, so they didn't have to worry about this). You'll notice that I've commented out the first nucleotide site for all the COI sequences because that represents the third codon position (the first two positions of this codon were not provided). Now the first nucleotide site corresponds to the first codon position of the first codon in the dataset. Another tricky thing here is that this gene does not use the universal genetic code. Instead, it uses the invertebrate mitochondrial code!<br />
<br />
=== An important aside about NCBI resources ===<br />
<br />
How do you find out about these things? The [http://www.ncbi.nlm.nih.gov/ NCBI (National Center for Biotechnology Information) web site] is invaluable for these sorts of things. Go to this web site now and try the following: Under ''Search'', choose <tt>'''nucleotide'''</tt>, and in the ''for'' box type <tt>'''Synophromorpha'''</tt> (one of the organisms for which we have COI data). After clicking the ''Go'' button, you should see three entries, the last of which is for the COI gene sequence. Click on the link labeled <tt>'''AY368911'''</tt> to bring up the GenBank record for this sequence. At the bottom of the page that appears, you will see the DNA sequence and, just above this, and indented, you will find the amino acid sequence. Note the line that is labeled <tt>/transl_table=5</tt>. This tells you which genetic code applies. Clicking the link labeled with the number 5 will take you to some information about the invertebrate mitochondrial code (translation table 5) used here. Note also the line labeled <tt>/codon_start=2</tt>. This tells you that the sequence of codons in the DNA sequence begins at the second site. This is how I knew to comment out the first site.<br />
<br />
=== The lset and prset commands ===<br />
<br />
For groups 1 and 2, use these commands to specify the model for subset 3:<br />
<br />
lset applyto=(3) nucmodel=codon nst=2 code=metmt rates=gamma;<br />
prset applyto=(3) tratiopr=beta(2,1) statefreqpr=dirichlet(2) ratepr=variable;<br />
<br />
Some of this (the <tt>applyto=(3)</tt>, <tt>rates=gamma</tt>, and <tt>ratepr=variable</tt> parts at least) should look familiar by now. Let's go through the remaining statements one by one. The <tt>nucmodel=codon</tt> says to use a codon model. The <tt>nst=2</tt> says to use an HKY-like codon model (one that has a kappa parameter governing the ratio of the transition rate to the transversion rate in addition to the omega parameter that governs the ratio of the nonsynonymous to synonymous rates). The <tt>code=metmt</tt> says to use the invertebrate mitochondrial genetic code when interpreting codons. <br />
<br />
In the prset command, the <tt>tratio=beta(2,1)</tt> applies a beta(2,1) prior to the kappa parameter. This is again a vague prior that slightly prefers a higher rate of transitions than transversions. The <tt>statefreqpr=dirichlet(2)</tt> applies a vague Dirichlet prior to the base frequencies (the single 2 is a sort of shorthand for the lazy allowed by MrBayes; we could have specified <tt>statefreqpr=Dirichlet(2,2,2,2)</tt>). Finally, <tt>ratepr=variable</tt> allows this subset to have its own relative rate.<br />
<br />
If you are in group 3, you should apply the same lset and prset commands we used for subset 2:<br />
<br />
lset applyto=(3) nst=6 rates=invgamma;<br />
prset applyto=(3) revmatpr=dirichlet(1,2,1,1,2,1) <br />
statefreqpr=dirichlet(2,2,2,2) pinvarpr=uniform(0,1) <br />
ratepr=variable;--><br />
<br />
== Specifying branch length and gamma shape priors ==<br />
<br />
The branch length and gamma shape priors apply to all subsets, so we can specify them like this:<br />
<br />
prset applyto=(all) brlenspr=unconstrained:exponential(1.0) shapepr=exponential(1.0);<br />
<br />
This says to apply an exponential distribution with mean 1 to all branch lengths and to all shape parameters.<br />
<br />
== Unlinking parameters ==<br />
<br />
By default, MrBayes tries to create the simplest model possible. If you have specified the same model parameters (and the same priors) for more than one subset, MrBayes will "link" these parameters across subsets. This means that it will assume that the values of these linked parameters apply to all subsets involved in the linking. If you want the gamma shape parameter (for example) to take on potentially different values for the morphology, 18S and COI subsets, you have to make sure MrBayes does not automatically link the shape parameter across all three subsets.<br />
<br />
unlink shape=(all) statefreq=(all) revmat=(all);<br />
<br />
The above command ensures that different values will be used for the shape parameter, state frequencies and GTR relative rates for all data subsets.<br />
<!-- == Specifying a starting tree ==<br />
<br />
Ordinarily, MrBayes runs begin with a random tree topology. If several independent runs are performed, this allows you to see if all of them converged on similar parameter values. Because time is limited in this lap, however, we will be skimping somewhat. For example, you will notice (below) that we will not be using heated chains or multiple runs, and we will specify (something close to) the maximum likelihood tree as the starting tree. To do this, add the following line to your MrBayes block:<br />
<br />
usertree = (1,2,(((((((3,5),11),((6,8),(7,9))),((29,30),31)),(((((12,13),(14,15)),17),16),(18,((22,(23,24)),25)))),((4,10),((19,21),(20,(26,(27,28)))))),32));<br />
<br />
The tree above has a log-likelihood equal to -18708 and was found by running PAUP* on just the sequence data. I used one TBR search based on HKY+G to get a good starting tree for a second TBR search using the GTR+I+G model. The first, preliminary search was allowed to explore 5000 rearrangements, with a reconnection limit of only 4. The second search started with the best tree from the first search, also limiting the search to 5000 rearrangements and a reconlimit of 4. My intention was not to find the maximum likelihood tree, but to get reasonably close so that our MrBayes runs could start in the heart of the posterior rather than being forced to wander in the wilderness for a long time at the start of the run.<br />
<br />
Note that MrBayes ignores branch lengths in your usertree (which is why I didn't bother providing them in the tree description above).--><br />
<br />
== Finishing the MrBayes block ==<br />
<br />
Finish off the MrBayes block by adding <tt>mcmc</tt>, <tt>sump</tt> commands and a <tt>quit</tt> command as follows:<br />
<br />
mcmc ngen=1000000 samplefreq=1000 printfreq=10000 nchains=1 nruns=1;<br />
sump;<br />
quit;<br />
<br />
This says to run a single Markov chain for 1,000,000 generations, saving a sample of the tree and model parameters every 1000 generations and showing progress every 10000 generations. Ordinarily, it would be a good idea to let MrBayes use the default values for nchains and nruns (4 and 2, respectively), but I am skimping here due to the time constraints imposed by a 2 hour lab period.<br />
<br />
== Running MrBayes on the cluster ==<br />
First, login to <tt>bbcsrv3.biotech.uconn.edu</tt> and '''upload''' the <tt>go.nex</tt> and <tt>nylander.nex</tt> files.<br />
<br />
Second, use the <tt>mkdir</tt> command to create a directory named <tt>mbpart</tt>, and use the <tt>mv</tt> command to move <tt>go.nex</tt> and <tt>nylander.nex</tt> into that directory:<br />
mkdir mbpart<br />
mv go.nex mbpart<br />
mv nylander.nex mbpart<br />
<br />
Third, create (in this newly-created <tt>mbpart</tt> directory) a file named <tt>runmb.sh</tt> that looks like this:<br />
#$ -S /bin/bash<br />
#$ -cwd<br />
#$ -m ea<br />
#$ -M change.me@uconn.edu <br />
#$ -N nylander<br />
mb go.nex<br />
'''Be sure to change the email address!''' Otherwise some person at UConn named Change Me will be getting a lot of strange emails!<br />
<br />
Feel free to also change the job name. I chose <tt>nylander</tt> as the name of the job, but you may want to see something else showing up as the name of the job when the email comes in (or when you use <tt>qstat</tt> to check on your run).<br />
<br />
Fourth, <tt>cd</tt> into the <tt>mbpart</tt> directory and start the run using the <tt>qsub</tt> command:<br />
<br />
cd mbpart<br />
qsub runmb.sh<br />
<br />
You can periodically check the status of your run using <br />
<br />
qstat<br />
<br />
You will find the output files that MrBayes generates in the <tt>mbpart</tt> directory.<br />
<br />
== Analyzing the output ==<br />
Try to answer these questions based on the file <tt>output.txt</tt>. As usual, you can peek at my answer by hovering over the "answer" link.<br />
<div style="background-color: #ccccff"> <br />
* ''Find the section labeled "Active parameters" Can you tell from this table that separate gamma shape parameters will be used for 18S and COI genes?'' {{title|yes, parameter 6 is the shape parameter for 18S and parameter 7 is the shape parameter for COI|answer}}<br />
* ''Can you tell that the same topology will be applied to all three partition subsets?'' {{title|yes, parameter 10 is the tree topology that applies to all 3 subsets|answer}}<br />
* ''Why is there a warning that "There are 25 characters incompatible with the specified coding bias"?'' {{title|25 morphological characters are constant across all taxa as a result of taxon deletion and we assured the model that it could assume all morphological characters are variable|answer}}<br />
* ''What is the log of the harmonic mean of the likelihood for this model? (write this down for later reference)'' {{title|-8558.91|answer}}<br />
* ''What is the AT content (sum of the nucleotide frequencies of A and T) in the COI gene?'' {{title|The very high AT content 0.77 is typical of mitochondrial and chloroplast genes and provides a good reason to put such genes in their own partition subset|answer}}<br />
* ''Which subset of characters evolves the fastest on average? Which evolves most slowly? (hint: look at the subset relative rate parameters m{1}, m{2} and m{3})'' {{title|morphology evolves fastest (relative rate 3.16), 18S evolves slowest (relative rate 0.29) and COI is intermediate (relative rate 1.48)|answer}}<br />
* ''Approximately how much faster does a typical morphological character evolve compared to a typical 18S site?'' {{title|3.16/0.29, or about 10.9 times faster|answer}}<br />
* ''Why does the simple average of the three relative rates equal 1.64 and not 1.0?'' {{title|The mean relative rate is indeed 1, but you must use a weighted average rather than a simple average because the three subsets contain different numbers of characters: 141 morphological characters, 1154 18S sites and 1077 COI sites. Compute the weighted average yourself to check.|answer}}<br />
</div><br />
<br />
Now open Tracer and load your nylander.nex.p file into the program. Click on the Trace tab to see a plot of parameter values against State (which is how Tracer refers to the generation or iteration).<br />
<div style="background-color: #ccccff"> <br />
* ''On the left, in the ESS (Effective Sample Size) column, are any values red (<100) or yellow (<200)?'' {{title|No, but some are close, for example the C to G GTR exchangeability, and the shape and pinvar parameters for 18S and COI are all nearly yellow|answer}}<br />
* ''Looking at the trace plot, can you suggest something you might do in a future run to increase the EES of this parameter?'' {{title|all of these exhibit long waves in the trace plot, suggesting that proposals are not bold enough|answer}}<br />
* ''On the left, select m{1}, m{2} and m{3} at the same time (hold down the shift key as you click them), then click the "Marginal Prob Distribution" tab and, finally, use the "Legend" drop-down control to place a legend at the upper right. Which of the 3 subset relative rates are you least certain about (i.e. has the largest marginal posterior variance)'' {{title|clearly the morphological relative rate is most variable and 18S is least variable|answer}}<br />
* ''Now do the same for pi(A){3}, pi(C){3}, pi(G){3}, and pi(T){3}. This should make the AT bias at the COI gene clear.<br />
</div><br />
<br />
<!-- <br />
== Asymmetric model ==<br />
<br />
Now move your output files into a new directory (to keep them from being overwritten) and start a second run after changing the prset command for subset 1 (the morphology partition) to the one commented out before. This will enable an asymmetric morphology model in which each character is allowed to have a different forward/reverse substitution rate (which entails also having unequal state frequencies). Once the run is finished, try answering the following questions using the log file and Tracer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the log of the harmonic mean of the likelihood for this model?'' {{title|I got -8556.26, but your answer may vary because you are using a different random number seed|answer}}<br />
* ''According to the harmonic mean method, does this model fit better than the symmetric model on average?'' {{title|I found that this model fits better because -8556.26 is larger than -8558.91|answer}}<br />
* ''In Tracer, are there more or fewer quantities with low (red or yellow) ESS values?'' {{title|I found much more yellow and red, in part because the run was the same length despite having |answer}}<br />
* ''Take a look at the trace plot of pi_166(0) in Tracer. Can you diagnose the problem here? What could you do to improve your next run?'' {{title|Problem is that steps are too bold and chain is getting stuck in one place for long periods of time. Answer is to make this move less bold using the MrBayes propset command|answer}}<br />
</div><br />
--><br />
<br />
== Using the marginal likelihood to choose a model ==<br />
In your first run, you wrote down the harmonic mean of the likelihoods sampled during the MCMC analysis. This value is an estimate of the (log of the) '''marginal likelihood''' (the denominator on the right side of Bayes' Rule). It turns out that the harmonic mean estimate is always an overestimate of the quantity it is supposed to be estimating, and a variety of better ways of estimating marginal likelihoods have been invented recently. MrBayes provides one of these better methods, known as the '''stepping-stone''' method, which you have heard (will hear) about in lecture.<br />
<br />
Why estimate the marginal likelihood, you ask? The marginal likelihood turns out to be one of the primary ways to compare models in Bayesian statistics. In the Bayesian framework, the effects of the prior have to be included because model performance is affected by the choice of prior distributions: if you choose a prior that presents an opinion very different than the opinion provided by your data, the resulting tug-of-war between prior and likelihood must somehow be reflected in the criterion you are using to choose a model. If you use AIC or BIC to choose a model, you are only getting the opinion of the data because these methods (as well as likelihood ratio tests) base their results only on the peak of the likelihood curve and the number of estimated parameters. <br />
<br />
The marginal likelihood is a weighted average of the likelihood over all combinations of parameter values allowed by the prior. The weights for the weighted average are provided by the prior. Thus, a model will do very well if both prior and likelihood are high for the same parameter combinations, and will be lower for models in which the prior is high where the likelihood is low, or vice versa. <br />
<br />
== Estimating the marginal likelihood using the stepping-stone method ==<br />
<br />
Let's ask MrBayes to estimate the marginal likelihood using the stepping-stone method, and then compare that estimate with the one provided by the harmonic mean method. Like likelihoods, marginal likelihoods are always used on the log scale, so whenever I say ''marginal likelihood'' I probably meant to say ''log marginal likelihood''. We need to make very few changes to our MrBayes block to perform a stepping-stone analysis.<br />
<br />
First, copy your go.nex file into a new directory so that this run will not overwrite files from your previous run: I've chosen to call the new directory mbss, but you are free to use any name you like:<br />
cd # this will take you back to your home directory<br />
mkdir mbss # create a new directory<br />
cp mbpart/go.nex mbss # copy go.nex to the new directory<br />
cp mbpart/nylander.nex mbss # copy the data file to the new directory<br />
cp mbpart/runmb.sh mbss # copy the qsub script into the new directory<br />
cd mbss # move into the new directory<br />
<br />
Second, replace your existing <tt>mcmc</tt> command in ~/mbss/go.nex with the following two commands:<br />
<br />
mcmc ngen=1100000 samplefreq=100 printfreq=10000 nchains=1 nruns=1;<br />
ss alpha=0.3 nsteps=10;<br />
<br />
The only thing I've done to the <tt>mcmc</tt> command is increase <tt>ngen</tt> from 1000000 up to 1100000 and reduce <tt>samplefreq</tt> from 1000 down to 100. <br />
<br />
Go ahead and submit this to the cluster:<br />
<br />
cd ~/mbss<br />
qsub runmb.sh<br />
<br />
MrBayes is now running an MCMC analysis that begins exploring the posterior distribution but ends up exploring the prior distribution. The target distribution changes in a series of steps. The name stepping-stone arises from a stream-crossing metaphor: if a stream is too wide to jump, you can keep your feet dry by making smaller jumps to successive "stepping stones" until the stream is crossed. In marginal likelihood estimation, we are trying to estimate the area under the posterior kernel (posterior kernel is statistical jargon for unnormalized posterior density). We know that the area under the prior density is 1, so if we estimated the ratio of the area under the posterior kernel to the area under the prior, that would be identical to the marginal likelihood. Unfortunately, there is a huge difference between these two areas, and the ratio is thus difficult to estimate directly. Instead we estimate a series of ratios that, when multiplied together, equal the desired ratio. These intermediate ratios represent smaller differences and are thus easier to estimate accurately. To estimate these "stepping-stone" ratios, MrBayes needs to obtain samples from distributions intermediate bewteen the prior and posterior.<br />
<br />
MrBayes will, in our case, divide the 11000 samples (1100000 generations divided by 100 generations/sample) by 11 steps to yield 1000 samples per step. This is a little confusing because we specified <tt>nsteps=10</tt>, not <tt>nsteps=11</tt>. By default, MrBayes uses the first step as burnin, so you need to specify <tt>ngen</tt> and <tt>samplefreq</tt> in such a way that you get the desired number of samples per step. The formula is ngen = (nsteps+1)*(samplefreq)*(samples/step). In our case, ngen = (10+1)*(100)*(1000) = 1100000.<br />
<br />
=== Spacing the stepping stones ===<br />
[[File:Alpha-1.0.png|thumb|right|Evenly spaced powers (here, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0) result in a big gap between the prior and the next power posterior density]]<br />
[[File:Alpha-0.17.png|thumb|right|A strongly skewed spacing of powers (here, 0.0, 0.000077, 0.0046, 0.050, 0.27, 1.0) result in a more even spacing of power posterior densities (and better estimates)]]<br />
The setting <tt>alpha=0.3</tt> has to do with how evenly spaced the stepping-stones are placed in the stream: if alpha were set to 1, the beta values used to modify the likelihood would all be the same distance apart. This does not translate into evenly-spaced stepping stones, as you can see in the top figure on the right, and setting alpha to be smaller than 1 (the bottom figure on the right uses alpha = 0.17) usually results in more evenly spaced stones (i.e. power posterior density curves) and thus more accurate estimates of ratios, especially those close to the prior end of the spectrum.<br />
<br />
=== Interpreting the results ===<br />
Once the run finishes, try to answer the following questions based on what you find in the <tt>outout.txt</tt> file:<br />
<div style="background-color: #ccccff"> <br />
* ''What is the estimated log marginal likelihood using the stepping-stone method? Search for "Marginal likelihood (in natural log units) estimated using stepping-stone sampling..."'' {{title|I got -8665.57|answer}}<br />
* ''How does this estimate compare to that obtained using the harmonic mean approach?'' {{title|it is 106.66 log units smaller|answer}}<br />
</div><br />
Assuming that the stepping-stone estimate is closer to reality, it is clear that the harmonic mean estimator greatly overestimates how well the model is performing.<br />
<br />
== Challenge ==<br />
Your challenge for the last part of the lab is to estimate the marginal likelihood using the stepping-stone method for a "combined" model that partitions the data into only 2 subsets: morphology (1-166) and molecular (167-2397). You should create a new directory for this so that you do not overwrite files you've already created. <br />
<div style="background-color: #ccccff"> <br />
* ''What is the log marginal likelihood for this "combined" partition scheme?'' {{title|I got -8875.89|answer}}<br />
* ''Based on marginal likelihoods, is it better to divide the molecular data into separate 18S and COI subsets, or is it better to treat them as one combined subset?'' {{title|Yes, the log marginal likelihood for the separate scheme is -8665.57, which is 210.32 log units better than the log marginal likelihood for the combined scheme, -8875.89|answer}}<br />
* ''Assuming you found that the 3-subset partition scheme is better than the 2-subset partitioning, why do you think applying a separate model to 18S and COI improves the marginal likelihood?'' {{title|For one thing, the strong AT bias in COI compared to 18S argues for allowing these two data subsets to have models that differ, at least in base frequency. Also, 18S clearly evolves much more slowly than COI, so using the same relative rate for both leads to a model that doesn't fit either one particularly well.|answer}}<br />
</div><br />
<br />
== That's it for today ==<br />
<br />
This has been an unusual lab because we have not even looked at any phylogenetic trees resulting from these analyses! My purpose here is to show you how to set up a partitioned run with a mixture of morphological characters and sequence data, and how to estimate the marginal likelihood for a model accurately using the stepping-stone method. The [http://sysbio.oxfordjournals.org/content/53/1/47.full.pdf+html original paper] is worth reading if you end up running your own Bayesian analysis of combined morphological and molecular data. Nylander et al. do a great job of discussing the many issues that arise from partitioning the data and combining morphological characters with DNA sequences, and you should now have the background to understand a paper like this. The stepping-stone method is described in [http://sysbio.oxfordjournals.org/content/60/2/150.short this paper].<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_Morphology_and_Partitioning_in_MrBayes&diff=38815Phylogenetics: Morphology and Partitioning in MrBayes2018-03-23T17:01:29Z<p>Paul Lewis: /* The prset command */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab is to learn how to analyze discrete morphological character data in MrBayes, to learn how to combine morphological with molecular data in a partitioned analysis in which each data type is assigned an appropriate model of evolution, and learn how to estimate the marginal likelihood of a model for purposes of model comparison and selection.<br />
|}<br />
<br />
== The Nylander et al. study ==<br />
<br />
The data for this lab comes from a paper by Nylander et al. (2004) that has already become a landmark study in combining data within a Bayesian framework. The full citation is:<br />
<br />
''Nylander, J., F. Ronquist, J. P. Huelsenbeck, and J. Nieves-Aldrey. 2004. Bayesian phylogenetic analysis of combined data. Systematic Biology 53:47-67''<br />
<br />
If you have access, you can [http://sysbio.oxfordjournals.org/content/53/1/47.full.pdf+html download the pdf of this paper].<br />
<br />
== Downloading the data file ==<br />
The data from the paper is available from [http://www.treebase.org/treebase-web/search/studySearch.html TreeBase]. While you can download a nexus file containing all the data (search by Study Accession number for 1070 or by Author for Nylander), you will have trouble executing this file in MrBayes data file due to the use of CHARACTER blocks rather than DATA blocks, so I have done the transformation for you. Download the file by [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/nylander.nex clicking here] and save a copy to your local hard drive. This <tt>nylander.nex</tt> data file contains three of the five data sets analyzed by Nylander et al. (2004).<br />
<br />
== Creating a command file ==<br />
Rather than adding a MRBAYES block to the data file, create a separate file (on your local computer) named '''go.nex''' that contains a MRBAYES block:<br />
<br />
#nexus<br />
<br />
begin mrbayes;<br />
exe nylander.nex;<br />
charset morph = 1-166;<br />
charset 18S = 167-1320;<br />
charset COI = 1321-2397; <br />
end;<br />
<br />
The first line of the MRBAYES block executes the data file <tt>nylander.nex</tt>. Each of the last three lines defines a charset, a meaningful set of characters. Each charset identifies one of the sources of data used in the Nylander et al. (2004) study. The '''first charset''' is named <tt>morph</tt>. While you can use any name you like to identify charsets, this name is appropriate because these are the discrete morphological characters. The '''second charset''' is named <tt>18S</tt> because it contains 18S rRNA gene sequences. The '''third charset''' is named <tt>COI</tt>. This is a protein-coding gene in the mitochondrion that encodes part of the electron transport chain. <br />
<br />
In this lab, you will build on this MRBAYES block, but be sure to keep the three charset lines above any other commands you add. Eventually you will perform an analysis on the cluster, but '''keep working on your own laptop''' until you have built the entire MRBAYES block.<br />
<br />
== Starting a log file ==<br />
<br />
It is always a good idea to start a log file so that you can review the output produced by MrBayes at a later time. Add the following line to your MRBAYES block (after the charset definitions:<br />
log start file=output.txt replace;<br />
Here, I've named my log file <tt>output.txt</tt> and instructed MrBayes to replace this file without asking if it already exists.<br />
<br />
== Reducing the size of the data set ==<br />
<br />
In order to have time for an analysis during the lab period, we will need to delete some taxa:<br />
delete 1-11 19-21 26-32;<br />
In addition, three morphological characters seem to cause problems for some models, so we will exclude these three characters:<br />
exclude 29 56 58;<br />
<br />
== Creating a data partition ==<br />
<br />
Your next task is to tell MrBayes how to partition the data. Used as a verb, partition means to erect walls or dividers. The correct term for the data between two partitions (i.e. dividers) is subset, but data subsets are often, confusingly, also called partitions! Even more confusing, the entire collection of subsets is known as a partition! Add the following lines to the end of your MRBAYES block to tell MrBayes that you want each of the 3 defined charsets to be a separate component (subset) of the partition:<br />
<br />
partition mine = 3:morph,18S,COI;<br />
<br />
The first number (3) is the number of subsets composing the partition. The name by which the partition is known in MrBayes is up to you: here I've chosen the name "mine".<br />
<br />
Just defining a partition is not enough to get MrBayes to use it! You must tell MrBayes that you want to use the partition named "mine". This seems a little redundant, but the idea is that you can set up several different partitions and then easily turn them on or off simply by changing the partition named in the set command:<br />
<br />
set partition=mine;<br />
<br />
== Specifying a model for the morphological data ==<br />
<br />
The main reason for creating a partition is so that we can apply a different model to each subset. The first subset corresponds to the morph charset, so we need to apply a model appropriate to morphology to this subset.<br />
<br />
=== The lset command ===<br />
<br />
For each subset, you will create an <tt>lset</tt> and <tt>prset</tt> command. At the end of your existing mrbayes block, type in this <tt>lset</tt> command:<br />
<br />
lset applyto=(1) coding=variable; <br />
<br />
The <tt>applyto=(1)</tt> statement says that these settings apply only to the first (1) subset. The <tt>coding=variable</tt> statement instructs MrBayes to make the likelihood conditional on character variability, rather than the ordinary likelihood, which assumes that there will be an appropriate number of constant characters present.<br />
<br />
=== The prset command ===<br />
<br />
Now, we need to specify the priors for the parameters in the morphology model. The morphology model we are using is quite simple, so the only parameters are really just the branch lengths and the state frequencies.<br />
<br />
We will use a simple symmetric model in which the frequency of each state is equal. The state frequencies are parameters that are present in the likelihood function. MrBayes allows you to assume that state frequencies follow a discrete Dirichlet distribution, much like relative rates are assumed to follow a discrete Gamma distribution in the "+G" models of among-site rate heterogeneity. Note that if a character has only two states (e.g. 0 = absent and 1 = present), this discrete Dirichlet distribution is equivalent to a discrete Beta distribution. To force the state frequencies to be equal, the Dirichlet density should be a spike concentrated right at the value 1/(number of states). To specify this, set <tt>symdirihyperpr</tt> to <tt>fixed(infinity)</tt>:<br />
prset applyto=(1) symdirihyperpr=fixed(infinity) ratepr=variable;<br />
<!-- To allow more flexibility, create an assymetric model using a discrete Beta distribution to allow heterogeneity across characters in the state frequencies. (Note that this will be a Dirichlet distribution rather than a Beta distribution if a character has more than 2 states.) The Beta distribution used is always symmetrical, but you have some control over the variance. Recall that a Beta(2,2) has its mode at 0.5 but gently arcs from 0 to 1, whereas a Beta(100,100) distribution is peaked strongly over the value 0.5. We will create a model in which there is a hyperparameter governing the parameters of the Beta distribution and we will place an Exponential(1.0) hyperprior on that hyperparameter.<br />
[prset applyto=(1) symdirihyperpr=exponential(1.0) ratepr=variable;]<br />
Note that I have commented out this second <tt>prset</tt> command. This is because our first run will explore the pure symmetric model (where state frequencies are always exactly equal to 0.5 for 2-state characters), then we will do a second run to explore the asymmetric model (which will involve commenting out the first <tt>prset</tt> command and uncommenting the second).--><br />
<br />
You may be curious about the <tt>ratepr=variable</tt> statement. All data subsets will be using the same branch lengths, and because morphological characters are probably evolving at a different rate than DNA sequences, this statement causes MrBayes to give this subset its own relative rate. If the estimated value of this parameter ends up being 1.5, it means that the morphological characters evolve about one and a half times faster than the average character/site in the entire data set. The average substitution rate is what is reflected in the estimated branch lengths. '''It is always a good idea to include the <tt>ratepr=variable</tt> statement in each subset of a partitioned model when using MrBayes'''.<br />
<br />
== Specifying a model for the 18S and COI data ==<br />
<br />
For the 18S sequences, ideally we would use a secondary structure model, but unfortunately I don't know which sites form stems and loops for these data. Thus, we will have to make do with using a standard DNA sequence model (the GTR+I+G model) for this subset. For the COI data, we would (again, in an ideal world) specify a codon model because this is a protein-coding gene. However, we don't have the time in a 2 hour lab to allow a codon model to run to convergence, so we will also be applying the standard GTR+I+G model to the COI data. We will, however, allow the 18S and COI to occupy different subsets of the partition so that they can have different relative rates of substitution.<br />
<br />
=== The lset command ===<br />
<br />
The lset command is similar to ones you have used before with MrBayes:<br />
<br />
lset applyto=(2,3) nst=6 rates=invgamma;<br />
<br />
The <tt>nst=6</tt> tells MrBayes to use a 6-substitution-rate model (i.e. the GTR model) and the <tt>rates=invgamma</tt> part says to add invariable sites and discrete gamma rate heterogeneity onto the GTR base model. The <tt>applyto=(2,3)</tt> statement means that the remainder of the lset command only applies to the second (18S) and third (COI) subsets.<br />
<br />
=== The prset command ===<br />
<br />
Priors need to be specified for all the parameters of the GTR+I+G model except branch lengths and the gamma shape parameter. Because these (branch length and gamma shape priors) are being applied to all subsets, we'll specify these later.<br />
<br />
prset applyto=(2,3) revmatpr=dirichlet(1,2,1,1,2,1) statefreqpr=dirichlet(2,2,2,2) pinvarpr=uniform(0,1) ratepr=variable;<br />
<br />
This specifies a Dirichlet(1,2,1,1,2,1) prior for the 6 GTR relative substitution rate parameters. This prior is not flat, but it also is not very informative: it specifies a slight preference for transitions over transversions. For the base frequencies, we're specifying a vague Dirichlet(2,2,2,2) prior. Again, slightly informative, attaching very weak rubber bands to each relative base frequency, keeping the relative frequencies from straying too far from 0.25. The <tt>pinvarpr=uniform(0,1)</tt> applies a flat, uniform prior to the pinvar parameter. The <tt>applyto=(2,3)</tt> statement applies its prior settings to the second (18S data) and third (COI) data subsets. The last <tt>ratepr=variable</tt> statement in each case ensures that each subset will have a relative rate parameter that allows it to evolve at different rate (potentially) than other subsets.<br />
<!--== Specifying a model for the COI data ==<br />
<br />
The COI gene is protein-coding, which means we can apply a codon model here if we like. Groups 1 and 2 will use a codon model for this subset, while group 3 will try using GTR+I+G for comparison. <br />
<br />
Using a codon model implies that the data begin at the first codon position. This wasn't true for the COI data I downloaded from TreeBase (Nylander et al. did not apply codon models, so they didn't have to worry about this). You'll notice that I've commented out the first nucleotide site for all the COI sequences because that represents the third codon position (the first two positions of this codon were not provided). Now the first nucleotide site corresponds to the first codon position of the first codon in the dataset. Another tricky thing here is that this gene does not use the universal genetic code. Instead, it uses the invertebrate mitochondrial code!<br />
<br />
=== An important aside about NCBI resources ===<br />
<br />
How do you find out about these things? The [http://www.ncbi.nlm.nih.gov/ NCBI (National Center for Biotechnology Information) web site] is invaluable for these sorts of things. Go to this web site now and try the following: Under ''Search'', choose <tt>'''nucleotide'''</tt>, and in the ''for'' box type <tt>'''Synophromorpha'''</tt> (one of the organisms for which we have COI data). After clicking the ''Go'' button, you should see three entries, the last of which is for the COI gene sequence. Click on the link labeled <tt>'''AY368911'''</tt> to bring up the GenBank record for this sequence. At the bottom of the page that appears, you will see the DNA sequence and, just above this, and indented, you will find the amino acid sequence. Note the line that is labeled <tt>/transl_table=5</tt>. This tells you which genetic code applies. Clicking the link labeled with the number 5 will take you to some information about the invertebrate mitochondrial code (translation table 5) used here. Note also the line labeled <tt>/codon_start=2</tt>. This tells you that the sequence of codons in the DNA sequence begins at the second site. This is how I knew to comment out the first site.<br />
<br />
=== The lset and prset commands ===<br />
<br />
For groups 1 and 2, use these commands to specify the model for subset 3:<br />
<br />
lset applyto=(3) nucmodel=codon nst=2 code=metmt rates=gamma;<br />
prset applyto=(3) tratiopr=beta(2,1) statefreqpr=dirichlet(2) ratepr=variable;<br />
<br />
Some of this (the <tt>applyto=(3)</tt>, <tt>rates=gamma</tt>, and <tt>ratepr=variable</tt> parts at least) should look familiar by now. Let's go through the remaining statements one by one. The <tt>nucmodel=codon</tt> says to use a codon model. The <tt>nst=2</tt> says to use an HKY-like codon model (one that has a kappa parameter governing the ratio of the transition rate to the transversion rate in addition to the omega parameter that governs the ratio of the nonsynonymous to synonymous rates). The <tt>code=metmt</tt> says to use the invertebrate mitochondrial genetic code when interpreting codons. <br />
<br />
In the prset command, the <tt>tratio=beta(2,1)</tt> applies a beta(2,1) prior to the kappa parameter. This is again a vague prior that slightly prefers a higher rate of transitions than transversions. The <tt>statefreqpr=dirichlet(2)</tt> applies a vague Dirichlet prior to the base frequencies (the single 2 is a sort of shorthand for the lazy allowed by MrBayes; we could have specified <tt>statefreqpr=Dirichlet(2,2,2,2)</tt>). Finally, <tt>ratepr=variable</tt> allows this subset to have its own relative rate.<br />
<br />
If you are in group 3, you should apply the same lset and prset commands we used for subset 2:<br />
<br />
lset applyto=(3) nst=6 rates=invgamma;<br />
prset applyto=(3) revmatpr=dirichlet(1,2,1,1,2,1) <br />
statefreqpr=dirichlet(2,2,2,2) pinvarpr=uniform(0,1) <br />
ratepr=variable;--><br />
<br />
== Specifying branch length and gamma shape priors ==<br />
<br />
The branch length and gamma shape priors apply to all subsets, so we can specify them like this:<br />
<br />
prset applyto=(all) brlenspr=unconstrained:exponential(1.0) shapepr=exponential(1.0);<br />
<br />
This says to apply an exponential distribution with mean 1 to all branch lengths and to all shape parameters.<br />
<br />
== Unlinking parameters ==<br />
<br />
By default, MrBayes tries to create the simplest model possible. If you have specified the same model parameters (and the same priors) for more than one subset, MrBayes will "link" these parameters across subsets. This means that it will assume that the values of these linked parameters apply to all subsets involved in the linking. If you want the gamma shape parameter (for example) to take on potentially different values for the morphology, 18S and COI subsets, you have to make sure MrBayes does not automatically link the shape parameter across all three subsets.<br />
<br />
unlink shape=(all) statefreq=(all) revmat=(all);<br />
<br />
The above command ensures that different values will be used for the shape parameter, state frequencies and GTR relative rates for all data subsets.<br />
<!-- == Specifying a starting tree ==<br />
<br />
Ordinarily, MrBayes runs begin with a random tree topology. If several independent runs are performed, this allows you to see if all of them converged on similar parameter values. Because time is limited in this lap, however, we will be skimping somewhat. For example, you will notice (below) that we will not be using heated chains or multiple runs, and we will specify (something close to) the maximum likelihood tree as the starting tree. To do this, add the following line to your MrBayes block:<br />
<br />
usertree = (1,2,(((((((3,5),11),((6,8),(7,9))),((29,30),31)),(((((12,13),(14,15)),17),16),(18,((22,(23,24)),25)))),((4,10),((19,21),(20,(26,(27,28)))))),32));<br />
<br />
The tree above has a log-likelihood equal to -18708 and was found by running PAUP* on just the sequence data. I used one TBR search based on HKY+G to get a good starting tree for a second TBR search using the GTR+I+G model. The first, preliminary search was allowed to explore 5000 rearrangements, with a reconnection limit of only 4. The second search started with the best tree from the first search, also limiting the search to 5000 rearrangements and a reconlimit of 4. My intention was not to find the maximum likelihood tree, but to get reasonably close so that our MrBayes runs could start in the heart of the posterior rather than being forced to wander in the wilderness for a long time at the start of the run.<br />
<br />
Note that MrBayes ignores branch lengths in your usertree (which is why I didn't bother providing them in the tree description above).--><br />
<br />
== Finishing the MrBayes block ==<br />
<br />
Finish off the MrBayes block by adding <tt>mcmc</tt>, <tt>sump</tt> commands and a <tt>quit</tt> command as follows:<br />
<br />
mcmc ngen=1000000 samplefreq=1000 printfreq=10000 nchains=1 nruns=1;<br />
sump;<br />
quit;<br />
<br />
This says to run a single Markov chain for 1,000,000 generations, saving a sample of the tree and model parameters every 1000 generations and showing progress every 10000 generations. Ordinarily, it would be a good idea to let MrBayes use the default values for nchains and nruns (4 and 2, respectively), but I am skimping here due to the time constraints imposed by a 2 hour lab period.<br />
<br />
== Running MrBayes on the cluster ==<br />
First, login to <tt>bbcsrv3.biotech.uconn.edu</tt> and '''upload''' the <tt>go.nex</tt> and <tt>nylander.nex</tt> files.<br />
<br />
Second, use the <tt>mkdir</tt> command to create a directory named <tt>mbpart</tt>, and use the <tt>mv</tt> command to move <tt>go.nex</tt> and <tt>nylander.nex</tt> into that directory:<br />
mkdir mbpart<br />
mv go.nex mbpart<br />
mv nylander.nex mbpart<br />
<br />
Third, create (in this newly-created <tt>mbpart</tt> directory) a file named <tt>runmb.sh</tt> that looks like this:<br />
#$ -S /bin/bash<br />
#$ -cwd<br />
#$ -m ea<br />
#$ -M change.me@uconn.edu <br />
#$ -N nylander<br />
mb go.nex<br />
'''Be sure to change the email address!''' Otherwise some person at UConn named Change Me will be getting a lot of strange emails!<br />
<br />
Feel free to also change the job name. I chose <tt>nylander</tt> as the name of the job, but you may want to see something else showing up as the name of the job when the email comes in (or when you use <tt>qstat</tt> to check on your run).<br />
<br />
Fourth, <tt>cd</tt> into the <tt>mbpart</tt> directory and start the run using the <tt>qsub</tt> command:<br />
<br />
cd mbpart<br />
qsub runmb.sh<br />
<br />
You can periodically check the status of your run using <br />
<br />
qstat<br />
<br />
You will find the output files that MrBayes generates in the <tt>mbpart</tt> directory.<br />
<br />
== Analyzing the output ==<br />
Try to answer these questions based on the file <tt>output.txt</tt>. As usual, you can peek at my answer by hovering over the "answer" link.<br />
<div style="background-color: #ccccff"> <br />
* ''Find the section labeled "Active parameters" Can you tell from this table that separate gamma shape parameters will be used for 18S and COI genes?'' {{title|yes, parameter 6 is the shape parameter for 18S and parameter 7 is the shape parameter for COI|answer}}<br />
* ''Can you tell that the same topology will be applied to all three partition subsets?'' {{title|yes, parameter 10 is the tree topology that applies to all 3 subsets|answer}}<br />
* ''Why is there a warning that "There are 25 characters incompatible with the specified coding bias"?'' {{title|25 morphological characters are constant across all taxa as a result of taxon deletion and we assured the model that it could assume all morphological characters are variable|answer}}<br />
* ''What is the log of the harmonic mean of the likelihood for this model? (write this down for later reference)'' {{title|-8558.91|answer}}<br />
* ''What is the AT content (sum of the nucleotide frequencies of A and T) in the COI gene?'' {{title|The very high AT content 0.77 is typical of mitochondrial and chloroplast genes and provides a good reason to put such genes in their own partition subset|answer}}<br />
* ''Which subset of characters evolves the fastest on average? Which evolves most slowly? (hint: look at the subset relative rate parameters m{1}, m{2} and m{3})'' {{title|morphology evolves fastest (relative rate 3.16), 18S evolves slowest (relative rate 0.29) and COI is intermediate (relative rate 1.48)|answer}}<br />
* ''Approximately how much faster does a typical morphological character evolve compared to a typical 18S site?'' {{title|3.16/0.29, or about 10.9 times faster|answer}}<br />
* ''Why does the simple average of the three relative rates equal 1.64 and not 1.0?'' {{title|The mean relative rate is indeed 1, but you must use a weighted average rather than a simple average because the three subsets contain different numbers of characters: 141 morphological characters, 1154 18S sites and 1077 COI sites. Compute the weighted average yourself to check.|answer}}<br />
</div><br />
<br />
Now open Tracer and load your nylander.nex.p file into the program. Click on the Trace tab to see a plot of parameter values against State (which is how Tracer refers to the generation or iteration).<br />
<div style="background-color: #ccccff"> <br />
* ''On the left, in the ESS (Effective Sample Size) column, are any values red (<100) or yellow (<200)?'' {{title|No, but some are close, for example the C to G GTR exchangeability, and the shape and pinvar parameters for 18S and COI are all nearly yellow|answer}}<br />
* ''Looking at the trace plot, can you suggest something you might do in a future run to increase the EES of this parameter?'' {{title|all of these exhibit long waves in the trace plot, suggesting that proposals are not bold enough|answer}}<br />
* ''On the left, select m{1}, m{2} and m{3} at the same time (hold down the shift key as you click them), then click the "Marginal Prob Distribution" tab and, finally, use the "Legend" drop-down control to place a legend at the upper right. Which of the 3 subset relative rates are you least certain about (i.e. has the largest marginal posterior variance)'' {{title|clearly the morphological relative rate is most variable and 18S is least variable|answer}}<br />
* ''Now do the same for pi(A){3}, pi(C){3}, pi(G){3}, and pi(T){3}. This should make the AT bias at the COI gene clear.<br />
</div><br />
<br />
<!-- <br />
== Asymmetric model ==<br />
<br />
Now move your output files into a new directory (to keep them from being overwritten) and start a second run after changing the prset command for subset 1 (the morphology partition) to the one commented out before. This will enable an asymmetric morphology model in which each character is allowed to have a different forward/reverse substitution rate (which entails also having unequal state frequencies). Once the run is finished, try answering the following questions using the log file and Tracer:<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the log of the harmonic mean of the likelihood for this model?'' {{title|I got -8556.26, but your answer may vary because you are using a different random number seed|answer}}<br />
* ''According to the harmonic mean method, does this model fit better than the symmetric model on average?'' {{title|I found that this model fits better because -8556.26 is larger than -8558.91|answer}}<br />
* ''In Tracer, are there more or fewer quantities with low (red or yellow) ESS values?'' {{title|I found much more yellow and red, in part because the run was the same length despite having |answer}}<br />
* ''Take a look at the trace plot of pi_166(0) in Tracer. Can you diagnose the problem here? What could you do to improve your next run?'' {{title|Problem is that steps are too bold and chain is getting stuck in one place for long periods of time. Answer is to make this move less bold using the MrBayes propset command|answer}}<br />
</div><br />
--><br />
<br />
== Using the marginal likelihood to choose a model ==<br />
In your first run, you wrote down the harmonic mean of the likelihoods sampled during the MCMC analysis. This value is an estimate of the (log of the) '''marginal likelihood''' (the denominator on the right side of Bayes' Rule). It turns out that the harmonic mean estimate is always an overestimate of the quantity it is supposed to be estimating, and a variety of better ways of estimating marginal likelihoods have been invented recently. MrBayes provides one of these better methods, known as the '''stepping-stone''' method, which you have heard (will hear) about in lecture.<br />
<br />
Why estimate the marginal likelihood, you ask? The marginal likelihood turns out to be one of the primary ways to compare models in Bayesian statistics. In the Bayesian framework, the effects of the prior have to be included because model performance is affected by the choice of prior distributions: if you choose a prior that presents an opinion very different than the opinion provided by your data, the resulting tug-of-war between prior and likelihood must somehow be reflected in the criterion you are using to choose a model. If you use AIC or BIC to choose a model, you are only getting the opinion of the data because these methods (as well as likelihood ratio tests) base their results only on the peak of the likelihood curve and the number of estimated parameters. <br />
<br />
The marginal likelihood is a weighted average of the likelihood over all combinations of parameter values allowed by the prior. The weights for the weighted average are provided by the prior. Thus, a model will do very well if both prior and likelihood are high for the same parameter combinations, and will be lower for models in which the prior is high where the likelihood is low, or vice versa. <br />
<br />
== Estimating the marginal likelihood using the stepping-stone method ==<br />
<br />
Let's ask MrBayes to estimate the marginal likelihood using the stepping-stone method, and then compare that estimate with the one provided by the harmonic mean method. Like likelihoods, marginal likelihoods are always used on the log scale, so whenever I say ''marginal likelihood'' I probably meant to say ''log marginal likelihood''. We need to make very few changes to our MrBayes block to perform a stepping-stone analysis.<br />
<br />
First, copy your go.nex file into a new directory so that this run will not overwrite files from your previous run: I've chosen to call the new directory mbss, but you are free to use any name you like:<br />
cd # this will take you back to your home directory<br />
mkdir mbss # create a new directory<br />
cp mbpart/go.nex mbss # copy go.nex to the new directory<br />
cp mbpart/nylander.nex mbss # copy the data file to the new directory<br />
cp mbpart/runmb.sh mbss # copy the qsub script into the new directory<br />
cd mbss # move into the new directory<br />
<br />
Second, replace your existing <tt>mcmc</tt> command in go.nex with the following two commands:<br />
<br />
mcmc ngen=1100000 samplefreq=100 printfreq=10000 nchains=1 nruns=1;<br />
ss alpha=0.3 nsteps=10;<br />
<br />
The only thing I've done to the <tt>mcmc</tt> command is increase <tt>ngen</tt> from 1000000 up to 1100000 and reduce <tt>samplefreq</tt> from 1000 down to 100. <br />
<br />
Go ahead and submit this to the cluster:<br />
qsub runmb.sh<br />
<br />
MrBayes is now running an MCMC analysis that begins exploring the posterior distribution but ends up exploring the prior distribution. The target distribution changes in a series of steps. The name stepping-stone arises from a stream-crossing metaphor: if a stream is too wide to jump, you can keep your feet dry by making smaller jumps to successive "stepping stones" until the stream is crossed. In marginal likelihood estimation, we are trying to estimate the area under the posterior kernel (posterior kernel is statistical jargon for unnormalized posterior density). We know that the area under the prior density is 1, so if we estimated the ratio of the area under the posterior kernel to the area under the prior, that would be identical to the marginal likelihood. Unfortunately, there is a huge difference between these two areas, and the ratio is thus difficult to estimate directly. Instead we estimate a series of ratios that, when multiplied together, equal the desired ratio. These intermediate ratios represent smaller differences and are thus easier to estimate accurately. To estimate these "stepping-stone" ratios, MrBayes needs to obtain samples from distributions intermediate bewteen the prior and posterior.<br />
<br />
MrBayes will, in our case, divide the 11000 samples (1100000 generations divided by 100 generations/sample) by 11 steps to yield 1000 samples per step. This is a little confusing because we specified <tt>nsteps=10</tt>, not <tt>nsteps=11</tt>. By default, MrBayes uses the first step as burnin, so you need to specify <tt>ngen</tt> and <tt>samplefreq</tt> in such a way that you get the desired number of samples per step. The formula is ngen = (nsteps+1)*(samplefreq)*(samples/step). In our case, ngen = (10+1)*(100)*(1000) = 1100000.<br />
<br />
=== Spacing the stepping stones ===<br />
[[File:Alpha-1.0.png|thumb|right|Evenly spaced powers (here, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0) result in a big gap between the prior and the next power posterior density]]<br />
[[File:Alpha-0.17.png|thumb|right|A strongly skewed spacing of powers (here, 0.0, 0.000077, 0.0046, 0.050, 0.27, 1.0) result in a more even spacing of power posterior densities (and better estimates)]]<br />
The setting <tt>alpha=0.3</tt> has to do with how evenly spaced the stepping-stones are placed in the stream: if alpha were set to 1, the beta values used to modify the likelihood would all be the same distance apart. This does not translate into evenly-spaced stepping stones, as you can see in the top figure on the right, and setting alpha to be smaller than 1 (the bottom figure on the right uses alpha = 0.17) usually results in more evenly spaced stones (i.e. power posterior density curves) and thus more accurate estimates of ratios, especially those close to the prior end of the spectrum.<br />
<br />
=== Interpreting the results ===<br />
Once the run finishes, try to answer the following questions based on what you find in the <tt>outout.txt</tt> file:<br />
<div style="background-color: #ccccff"> <br />
* ''What is the estimated log marginal likelihood using the stepping-stone method? Search for "Marginal likelihood (in natural log units) estimated using stepping-stone sampling..."'' {{title|I got -8665.57|answer}}<br />
* ''How does this estimate compare to that obtained using the harmonic mean approach?'' {{title|it is 106.66 log units smaller|answer}}<br />
</div><br />
Assuming that the stepping-stone estimate is closer to reality, it is clear that the harmonic mean estimator greatly overestimates how well the model is performing.<br />
<br />
== Challenge ==<br />
Your challenge for the last part of the lab is to estimate the marginal likelihood using the stepping-stone method for a "combined" model that partitions the data into only 2 subsets: morphology (1-166) and molecular (167-2397). You should create a new directory for this so that you do not overwrite files you've already created. <br />
<div style="background-color: #ccccff"> <br />
* ''What is the log marginal likelihood for this "combined" partition scheme?'' {{title|I got -8875.89|answer}}<br />
* ''Based on marginal likelihoods, is it better to divide the molecular data into separate 18S and COI subsets, or is it better to treat them as one combined subset?'' {{title|Yes, the log marginal likelihood for the separate scheme is -8665.57, which is 210.32 log units better than the log marginal likelihood for the combined scheme, -8875.89|answer}}<br />
* ''Assuming you found that the 3-subset partition scheme is better than the 2-subset partitioning, why do you think applying a separate model to 18S and COI improves the marginal likelihood?'' {{title|For one thing, the strong AT bias in COI compared to 18S argues for allowing these two data subsets to have models that differ, at least in base frequency. Also, 18S clearly evolves much more slowly than COI, so using the same relative rate for both leads to a model that doesn't fit either one particularly well.|answer}}<br />
</div><br />
<br />
== That's it for today ==<br />
<br />
This has been an unusual lab because we have not even looked at any phylogenetic trees resulting from these analyses! My purpose here is to show you how to set up a partitioned run with a mixture of morphological characters and sequence data, and how to estimate the marginal likelihood for a model accurately using the stepping-stone method. The [http://sysbio.oxfordjournals.org/content/53/1/47.full.pdf+html original paper] is worth reading if you end up running your own Bayesian analysis of combined morphological and molecular data. Nylander et al. do a great job of discussing the many issues that arise from partitioning the data and combining morphological characters with DNA sequences, and you should now have the background to understand a paper like this. The stepping-stone method is described in [http://sysbio.oxfordjournals.org/content/60/2/150.short this paper].<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Prospective_Student_Mark_Stukel&diff=38787Prospective Student Mark Stukel2018-03-20T22:09:45Z<p>Paul Lewis: </p>
<hr />
<div>==Sunday, March 25, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|12:08 ||Arrives BDL || Chris Simon Pickup<br />
|-<br />
|1:30 ||tour of Storrs || <br />
|-<br />
|6:00 ||Dinner || 17 Silver Falls Lane<br />
|-<br />
|-<br />
|}<br />
<br />
==Monday, March 26, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|9:00 || Paul Lewis || TLS 164<br />
|-<br />
|9:30 || Annette Evans || BioPharm 322<br />
|-<br />
|10:00 || Don Les|| BioPharm 305c<br />
|-<br />
|10:30 || Charlie Henry || TLS 479/481 <br />
|-<br />
|11:00 || Molecular Systematics Class || <br />
|-<br />
|12:15 || Entomeet-Wagner-Henry-Simon Labs|| <br />
|-<br />
|1:30 || Cera Fisher || BioPharm 318<br />
|-<br />
|2:00 || || <br />
|-<br />
|2:30 || Molecular Systematics Lab Minipresentation || Biopharm 3rd floor fishbowl <br />
|-<br />
|3:00 || || <br />
|-<br />
|3:30 || || <br />
|-<br />
|4:30 ||Simon Lab meeting || Biopharm 3rd floor fishbowl <br />
|-<br />
|6:00 || Dinner with Simon Lab Willington Pizza<br />
|-<br />
|-<br />
|}<br />
<br />
==Tuesday, March 27, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
|11:00 ||Phylogenetics Class || TLS 181<br />
|-<br />
|12:00 ||Lunch || <br />
|-<br />
|1:00 || || <br />
|-<br />
|1:30 || David Wagner || TLS 471<br />
|-<br />
|2:00 || || <br />
|-<br />
|3:00 || || <br />
|-<br />
|3:30 || || <br />
|-<br />
|4:00 || Drive to airport || <br />
|-</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Phylogenetics:_MrBayes_Lab&diff=38689Phylogenetics: MrBayes Lab2018-03-09T13:51:51Z<p>Paul Lewis: /* Specifying the prior on kappa */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]<br />
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span><br />
|-<br />
|The goal of this lab exercise is to introduce you to one of the most important computer programs for conducting Bayesian phylogenetic analyses, [http://mrbayes.csit.fsu.edu/ MrBayes]. We will be using MrBayes v3.2.4 x64 on the cluster in this lab. You will also learn how to use the program [http://tree.bio.ed.ac.uk/software/tracer/ Tracer] to analyze MrBayes' output.<br />
|}<br />
<br />
== Getting started ==<br />
=== (q)login to the cluster ===<br />
Login to the Bioinformatics Facility Dell cluster:<br />
ssh username@bbcsrv3.biotech.uconn.edu<br />
Then use the qlogin command to find a free node:<br />
qlogin<br />
Once you are transferred to a free node, type<br />
module load mrbayes/3.2.6<br />
This makes a more recent version of MrBayes available to you (without this module command, you will be using the slightly older version MrBayes 3.2.4).<br />
<br />
=== Create a directory ===<br />
Use the unix mkdir command to create a directory to play in today:<br />
mkdir mblab<br />
<br />
=== Download and save the data file ===<br />
Save the contents of the file [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/algaemb.nex algaemb.nex] to a file in the mblab folder. One easy way to do this is to cd into the mblab folder, then use the curl command ("Copy URL") to download the file:<br />
cd mblab<br />
curl http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/data/algaemb.nex > algaemb.nex<br />
<br />
Use nano to look at this file. This is the same 16S data you have used before, but the original file had to be changed to accommodate MrBayes' eccentricities. While MrBayes does a fair job of reading Nexus files, it chokes on certain constructs. The information about what I had to change in order to get it to work is in a comment at the top of the file (this might be helpful in converting other nexus files to work with MrBayes in the future).<br />
<br />
== Creating a MRBAYES block ==<br />
The next step is to set up an MCMC analysis. There are three commands in particular that you will need to know about in order to set up a typical run: <tt>lset</tt>, <tt>prset</tt> and <tt>mcmc</tt>. The command <tt>mcmcp</tt> is identical to <tt>mcmc</tt> except that it does not actually start a run. For each of these commands you can obtain online information by typing <tt>help</tt> followed by the command name: for example, <tt>help prset</tt>. Start MrBayes interactively by simply typing <tt>mb</tt> to see how to get help:<br />
mb<br />
This will start MrBayes. At the "MrBayes>" prompt, type <tt>help</tt>:<br />
MrBayes> help<br />
To quit MrBayes, type <tt>quit</tt> at the "MrBayes>" prompt. You will need to quit MrBayes in order to build up the MRBAYES block in your data file. (I often open up 2 terminal windows so that I can keep MrBayes going in one of them for the purpose of accessing help.)<br />
<br />
Create a MRBAYES block in your Nexus data file. MrBayes does not have a built-in editor, so you will need to use the nano editor to edit the <tt>algaemb.nex</tt> data file. Use Ctrl-/, Ctrl-v to jump to the bottom of the file in nano, then add the following at the ''very bottom'' of the file to begin creating the MRBAYES block:<br />
begin mrbayes;<br />
set autoclose=yes;<br />
end; <br />
Note that I refer to this block as a MRBAYES block (upper case), but the MrBayes program does not care about case, so using mrbayes (lower case) works just fine. <!-- The <tt>seed=1</tt> and <tt>swapseed=1</tt> statements in the <tt>set</tt> command tell MrBayes to set the pseudorandom number seeds to a particular value. That way you will get the same run I did when I wrote the answers to many of the questions that follow. --> The <tt>autoclose=yes</tt> statement in the <tt>set</tt> command tells MrBayes that we will not want to continue the run beyond the 10,000 generations specified. If you leave this out, MrBayes will ask you whether you wish to continue running the chains after the specified number of generations is finished.<br />
<br />
Add subsequent commands (described below) after the <tt>set</tt> command and before the <tt>end;</tt> line. Note that commands in MrBayes are (intentionally) similar to those in PAUP*, but the differences can be frustrating. For instance, <tt>lset ?</tt> in PAUP* gives you information about the current likelihood settings, but this does not work in MrBayes. Instead, you type <tt>help lset</tt>. Also, the <tt>lset</tt> command in MrBayes has many options not present in PAUP*, and ''vice versa''.<br />
<br />
=== Specifying the prior on branch lengths ===<br />
[[Image:Exp10.png|right|thumb|Exponential(10) density function]]<br />
begin mrbayes;<br />
set autoclose=yes;<br />
'''prset brlenspr=unconstrained:exp(10.0);'''<br />
end;<br />
The <tt>prset</tt> command above specifies that branch lengths are to be unconstrained (i.e. a molecular clock is not assumed) and the prior distribution to be applied to each branch length is an exponential distribution with mean 1/10. Note that the value you specify for <tt>unconstrained:exp</tt> is the ''inverse'' of the mean.<br clear="right"/><br />
<br />
=== Specifying the prior on the gamma shape parameter ===<br />
[[Image:Exp1.png|right|thumb|Exponential(1) density function]]<br />
begin mrbayes;<br />
set autoclose=yes;<br />
prset brlenspr=unconstrained:exp(10.0);<br />
'''prset shapepr=exp(1.0);'''<br />
end;<br />
The second <tt>prset</tt> command specifies an exponential distribution with mean 1.0 for the shape parameter of the gamma distribution we will use to model rate heterogeneity. Note that we have not yet told MrBayes that we wish to assume that substitution rates are variable - we will do that using the <tt>lset</tt> command below.<br clear="right"/><br />
<br />
=== Specifying the prior on kappa ===<br />
[[Image:Uniform01.png|left|thumb|Beta(1,1) density for transition and transversion rates]][[Image:Exp1ratio.png|right|thumb|BetaPrime(1,1) density for kappa]]<br />
begin mrbayes;<br />
set autoclose=yes;<br />
prset brlenspr=unconstrained:exp(10.0);<br />
prset shapepr=exp(1.0);<br />
'''prset tratiopr=beta(1.0,1.0);'''<br />
end;<br />
The command above says to use a Beta(1,1) distribution as the prior for the transition/transversion rate ratio. <br />
<div style="background-color: #ccccff"><br />
''Based on what you know about Beta distributions, does it make sense to use a Beta distribution as a prior for the transition/transversion rate ratio?'' {{title|it is strange because tratio ranges from 0 to infinity and the Beta distribution has support only from 0 to 1|answer}}<br />
</div><br />
Allow me to explain the use of a Beta distribution for tratio as best I can. Recall that the kappa parameter is the ratio <math>\alpha/\beta</math>, where <math>\alpha</math> is the rate of transitions and <math>\beta</math> is the rate of transversions. Rather than allowing you to place a prior directly on the ratio <math>\alpha/\beta</math>, which ranges from 0 to infinity, MrBayes asks you to instead place a joint (Beta) prior on <math>\alpha/(\alpha + \beta)</math> and <math>\beta/(\alpha + \beta)</math>. Here, <math>\alpha/(\alpha + \beta)</math> and <math>\beta/(\alpha + \beta)</math> act like <math>p</math> and <math>1-p</math> in the familiar coin flipping experiment. The reasoning behind this is esoteric, but is the same as the reasoning behind the (now commonplace) use of Dirichlet priors for the GTR relative rates, which is explained nicely in Zwickl, D., and Holder, M. T. 2004. Model parameterization, prior distributions, and the general time-reversible model in Bayesian phylogenetics. Systematic Biology 53(6):877–888. <br />
<br />
You might wonder what the Beta(1,1) distribution (figure on the left) implies about kappa. Transforming the Beta density into the density of <math>\alpha/\beta</math> results in the plot on the right. This density for kappa is very close, but not identical, to an exponential(1) distribution. This is known as the [http://en.wikipedia.org/wiki/Beta_prime_distribution Beta Prime distribution], and has support [0, infinity), which is appropriate for a ratio such as kappa. The Beta Prime distribution is somewhat peculiar, however, when both parameters are 1 (as they are in this case): in this case, the mean is not defined, which is to say that we cannot predict the mean of a sample of kappa values drawn from this distribution. It is not essential for a prior distribution to have a well-defined mean, so even though this is a little weird it nevertheless works pretty well.<br clear="both"/><br />
<br />
=== Specifying a prior on base frequencies ===<br />
begin mrbayes;<br />
set autoclose=yes;<br />
prset brlenspr=unconstrained:exp(10.0);<br />
prset shapepr=exp(1.0);<br />
prset tratiopr=beta(1.0,1.0);<br />
'''prset statefreqpr=dirichlet(1.0,1.0,1.0,1.0);'''<br />
end;<br />
The above command states that a flat Dirichlet distribution is to be used for base frequencies. The Dirichlet distribution is like the Beta distribution, except that it is applicable to ''combinations'' of parameters. Like the Beta distribution, the distribution is symmetrical if all the parameters of the distribution are equal, and the distribution is flat if all the parameters of the distribution are equal to 1.0. Using the command above specifies a flat Dirichlet prior, which says that any combination of base frequencies will be given equal prior weight. This means that (0.01, 0.99, 0.0, 0.0) is just as probable, ''a priori'', as (0.25, 0.25, 0.25, 0.25). If you wanted base frequencies to not stray much from (0.25, 0.25, 0.25, 0.25), you could specify, say, <tt>statefreqpr=dirichlet(10.0,10.0,10.0,10.0)</tt> instead.<br />
<br />
=== The lset command ===<br />
begin mrbayes;<br />
set autoclose=yes;<br />
prset brlenspr=unconstrained:exp(10.0);<br />
prset shapepr=exp(1.0);<br />
prset tratiopr=beta(1.0,1.0);<br />
prset statefreqpr=dirichlet(1.0,1.0,1.0,1.0);<br />
'''lset nst=2 rates=gamma ngammacat=4;'''<br />
end;<br />
We are finished setting priors now, so the <tt>lset</tt> command above finishes our specification of the model by telling MrBayes that we would like a 2-parameter substitution matrix (i.e. the rate matrix has only two substitution rates, the transition rate and the transversion rate). It also specifies that we would like rates to vary across sites according to a gamma distribution with 4 categories.<br />
<br />
=== Specifying MCMC options ===<br />
begin mrbayes;<br />
set autoclose=yes;<br />
prset brlenspr=unconstrained:exp(10.0);<br />
prset shapepr=exp(1.0);<br />
prset tratiopr=beta(1.0,1.0);<br />
prset statefreqpr=dirichlet(1.0,1.0,1.0,1.0);<br />
lset nst=2 rates=gamma ngammacat=4;<br />
'''mcmcp ngen=10000 samplefreq=10 printfreq=100 nruns=1 nchains=3 savebrlens=yes;'''<br />
end;<br />
The <tt>mcmcp</tt> command above specifies most of the remaining details of the analysis. <br />
<br />
<tt>ngen=10000</tt> tells MrBayes that its robots should each take 10,000 steps. You should ordinarily use much larger values for <tt>ngen</tt> than this (the default is 1 million steps). We're keeping it small here because we do not have a lot of time and the purpose of this lab is to learn how to use MrBayes, not produce a publishable result. <br />
<br />
<tt>samplefreq=10</tt> says to only save parameter values and the tree topology every 10 steps. <br />
<br />
<tt>printfreq=100</tt> says that we would like a progress report every 100 steps. <br />
<br />
<tt>nruns=1</tt> says to just do one independent run. MrBayes performs two separate analyses by default.<br />
<br />
<tt>nchains=3</tt> says that we would like to have 2 heated chains running in addition to the cold chain. <br />
<br />
Finally, <tt>savebrlens=yes</tt> tells MrBayes that we would like it to save branch lengths when it saves the sampled tree topologies.<br />
<br />
=== Specifying an outgroup ===<br />
begin mrbayes;<br />
set autoclose=yes;<br />
prset brlenspr=unconstrained:exp(10.0);<br />
prset shapepr=exp(1.0);<br />
prset tratiopr=beta(1.0,1.0);<br />
prset statefreqpr=dirichlet(1.0,1.0,1.0,1.0);<br />
lset nst=2 rates=gamma ngammacat=4;<br />
mcmcp ngen=10000 samplefreq=10 printfreq=100 nruns=1 nchains=3 savebrlens=yes;<br />
'''outgroup Anacystis_nidulans;'''<br />
end;<br />
The <tt>outgroup</tt> command merely affects the display of trees. It says we want trees to be rooted between the taxon <tt>Anacystis_nidulans</tt> and everything else.<br />
<br />
== Running MrBayes and interpreting the results ==<br />
Now save the file and start MrBayes (from within the <tt>mblab</tt> directory) by typing<br />
mb<br />
Once it starts, type the following at the <tt>MrBayes&gt;</tt> prompt<br />
exe algaemb.nex<br />
Then type<br />
mcmc<br />
This command starts the run. While MrBayes runs, it shows one-line progress reports. The first column is the iteration (generation) number. The next three columns show the log-likelihoods of the separate chains that are running, with the cold chain indicated by square brackets rather than parentheses. The last complete column is a prediction of the time remaining until the run completes. The columns consisting of only -- are simply separators, they have no meaning.<br />
<div style="background-color: #ccccff"><br />
* ''Do you see evidence that the 3 chains are swapping with each other?'' {{title|yes, the square brackets move around indicating swaps have taken place|answer}}<br />
</div><br />
<br />
The section entitled ''Chain swap information:'' reports the number of times each of the three chains attempted to swap with one of the other chains (three values in lower left, below the main diagonal) and the proportion of time such attempts were successful (three values in upper right, above the main diagonal). <br />
<div style="background-color: #ccccff"><br />
* ''How many times did MrBayes attempt to swap chains per generation? Use the information in the lower diagonal of the chain swap information table for this, in conjunction with the number of total generations you specified in the MRBAYES block'' {{title|MrBayes attempts to swap a random pair of chains once per generation, as indicated by the fact that the total number of swap attempts equals the number of generations|answer}}<br />
</div><br />
<br />
When the run has finished, MrBayes will report (in the section entitled ''Acceptance rates for the moves in the "cold" chain:'') various statistics about the run, such as the percentage of time it was able to accept proposed changes of various sorts. These percentages should, ideally, all be between about 20% and 50%, but as long as they are not extreme (e.g. 1% or 99%) then things went well. Even if there are low acceptance rates for some proposal types, this may not be important if there are other proposal types that operate on the same parameters. For example, note that ExtSPR, ExtTBR, NNI and PrsSPR all operate on Tau, which is the tree topology. As long as these proposals are collectively effective, the fact that one of them is accepting at a very low rate is not of concern.<br />
<div style="background-color: #ccccff"><br />
* ''What explanation could you offer if the acceptance rate was very low, e.g. 1%?'' {{title|proposals are too bold and tend to propose places too far down the hill that are then rejected|answer}}<br />
* ''What explanation could you offer if the acceptance rate was very high, e.g. 99%?'' {{title|proposals are not bold enough and tend to propose places very close to the current position, which are accepted with high probability because they cannot be very far downhill|answer}}<br />
</div><br />
Below is the acceptance information for my run:<br />
Acceptance rates for the moves in the "cold" chain:<br />
With prob. (last 100) chain accepted proposals by move<br />
38.3 % ( 32 %) Dirichlet(Tratio)<br />
21.7 % ( 18 %) Dirichlet(Pi)<br />
NA NA Slider(Pi)<br />
49.3 % ( 52 %) Multiplier(Alpha)<br />
9.0 % ( 14 %) ExtSPR(Tau,V)<br />
2.3 % ( 5 %) ExtTBR(Tau,V)<br />
10.3 % ( 17 %) NNI(Tau,V)<br />
9.8 % ( 8 %) ParsSPR(Tau,V)<br />
46.3 % ( 39 %) Multiplier(V)<br />
29.8 % ( 28 %) Nodeslider(V)<br />
19.6 % ( 22 %) TLMultiplier(V)<br />
In the above table, 49.3% of proposals to change the gamma shape parameter (denoted Alpha by MrBayes) were accepted. This makes it sounds as if the gamma shape parameter was changed quite often, but to get the full picture, you need to scroll up to the beginning of the output and examine this section: <br />
The MCMC sampler will use the following moves:<br />
With prob. Chain will use move<br />
2.00 % Dirichlet(Tratio)<br />
1.00 % Dirichlet(Pi)<br />
1.00 % Slider(Pi)<br />
2.00 % Multiplier(Alpha)<br />
10.00 % ExtSPR(Tau,V)<br />
10.00 % ExtTBR(Tau,V)<br />
10.00 % NNI(Tau,V)<br />
10.00 % ParsSPR(Tau,V)<br />
40.00 % Multiplier(V)<br />
10.00 % Nodeslider(V)<br />
4.00 % TLMultiplier(V)<br />
This says that an attempt to change the gamma shape parameter will only be made in 2% of the iterations. <br />
<div style="background-color: #ccccff"><br />
* ''How many times did MrBayes attempt to modify the gamma shape parameter?'' {{title|2% 0f 10000 is 200 times|answer}}<br />
* ''How many times did MrBayes actually modify the gamma shape parameter?'' {{title|49.3% of 2% 0f 10000 is 99 times|answer}}<br />
</div><br />
<br />
The fact that MrBayes modified the gamma shape parameter fewer than 100 times out of a run involving 10000 iterations brings up a couple of important points. First, in each iteration, MrBayes chooses a move (i.e. proposal) at random to try. Each move is associated with a "Rel. prob." (relative probability). Using the <tt>showmoves</tt> command shows the following list of moves that were used in this particular analysis:<br />
1 -- Move = Dirichlet(Tratio)<br />
Type = Dirichlet proposal<br />
Parameter = Tratio [param. 1] (Transition and transversion rates)<br />
Tuningparam = alpha (Dirichlet parameter)<br />
alpha = 49.010 [chain 1]<br />
49.502 [chain 2]<br />
49.502 [chain 3]<br />
Targetrate = 0.250<br />
Rel. prob. = 1.0<br />
<br />
2 -- Move = Dirichlet(Pi)<br />
Type = Dirichlet proposal<br />
Parameter = Pi [param. 2] (Stationary state frequencies)<br />
Tuningparam = alpha (Dirichlet parameter)<br />
alpha = 101.005 [chain 1]<br />
101.005 [chain 2]<br />
100.000 [chain 3]<br />
Targetrate = 0.250<br />
Rel. prob. = 0.5<br />
<br />
3 -- Move = Slider(Pi)<br />
Type = Sliding window<br />
Parameter = Pi [param. 2] (Stationary state frequencies)<br />
Tuningparam = delta (Sliding window size)<br />
delta = 0.202 [chain 1]<br />
0.202 [chain 2]<br />
0.200 [chain 3]<br />
Targetrate = 0.250<br />
Rel. prob. = 0.5<br />
<br />
4 -- Move = Multiplier(Alpha)<br />
Type = Multiplier<br />
Parameter = Alpha [param. 3] (Shape of scaled gamma distribution of site rates)<br />
Tuningparam = lambda (Multiplier tuning parameter)<br />
lambda = 0.827 [chain 1]<br />
0.827 [chain 2]<br />
0.819 [chain 3]<br />
Targetrate = 0.250<br />
Rel. prob. = 1.0<br />
<br />
5 -- Move = ExtSPR(Tau,V)<br />
Type = Extending SPR<br />
Parameters = Tau [param. 5] (Topology)<br />
V [param. 6] (Branch lengths)<br />
Tuningparam = p_ext (Extension probability)<br />
lambda (Multiplier tuning parameter)<br />
p_ext = 0.500<br />
lambda = 0.098<br />
Rel. prob. = 5.0<br />
<br />
6 -- Move = ExtTBR(Tau,V)<br />
Type = Extending TBR<br />
Parameters = Tau [param. 5] (Topology)<br />
V [param. 6] (Branch lengths)<br />
Tuningparam = p_ext (Extension probability)<br />
lambda (Multiplier tuning parameter)<br />
p_ext = 0.500<br />
lambda = 0.098<br />
Rel. prob. = 5.0<br />
<br />
7 -- Move = NNI(Tau,V)<br />
Type = NNI move<br />
Parameters = Tau [param. 5] (Topology)<br />
V [param. 6] (Branch lengths)<br />
Rel. prob. = 5.0<br />
<br />
8 -- Move = ParsSPR(Tau,V)<br />
Type = Parsimony-biased SPR<br />
Parameters = Tau [param. 5] (Topology)<br />
V [param. 6] (Branch lengths)<br />
Tuningparam = warp (parsimony warp factor)<br />
lambda (multiplier tuning parameter)<br />
r (reweighting probability)<br />
warp = 0.100<br />
lambda = 0.098<br />
r = 0.050<br />
Rel. prob. = 5.0<br />
<br />
9 -- Move = Multiplier(V)<br />
Type = Random brlen hit with multiplier<br />
Parameter = V [param. 6] (Branch lengths)<br />
Tuningparam = lambda (Multiplier tuning parameter)<br />
lambda = 2.048<br />
Targetrate = 0.250<br />
Rel. prob. = 20.0<br />
<br />
10 -- Move = Nodeslider(V)<br />
Type = Node slider (uniform on possible positions)<br />
Parameter = V [param. 6] (Branch lengths)<br />
Tuningparam = lambda (Multiplier tuning parameter)<br />
lambda = 0.191<br />
Rel. prob. = 5.0<br />
<br />
11 -- Move = TLMultiplier(V)<br />
Type = Whole treelength hit with multiplier<br />
Parameter = V [param. 6] (Branch lengths)<br />
Tuningparam = lambda (Multiplier tuning parameter)<br />
lambda = 1.332 [chain 1]<br />
1.345 [chain 2]<br />
1.332 [chain 3]<br />
Targetrate = 0.250<br />
Rel. prob. = 2.0<br />
<br />
Use 'Showmoves allavailable=yes' to see a list of all available moves<br />
Summing the 11 relative probabilities yields 1 + 0.5 + 0.5 + 1 + 5 + 5 + 5 + 5 + 20 + 5 + 2 = 50. To get the probability of using one of these moves in any particular iteration, MrBayes divides the relative probability for the move by this sum. Thus, move 4, whose job is to update the gamma shape parameter (called Alpha by MrBayes) will be chosen with probability 1/50 = 0.02. This is where the "2.00 % Multiplier(Alpha)" line comes from in the move probability table spit out just before the run started.<br />
<br />
Second, note that MrBayes places a lot of emphasis on modifying the tree topology and branch lengths (in this case 94% of proposals), but puts little effort (in this case only 6%) into updating other model parameters. You can change the percent effort for a particular move using the <tt>propset</tt> command. For example, to increase the effort devoted to updating the gamma shape parameter, you could ('''but don't do this now!''') issue the following command either at the MrBayes prompt or in a MRBAYES block:<br />
propset Multiplier(Alpha)$prob=10<br />
This will change the relative probability of the "Multiplier(Alpha)" move from its default value 1 to the value you specified (10). You can also change tuning parameters for moves using the <tt>propset</tt> command. Before doing that, however, we need to see if the boldness of any moves needs to be changed.<br />
<br />
=== The sump command ===<br />
MrBayes saves information in several files. Only two of these will concern us today. One of them will be called <tt>algaemb.nex.p</tt>. This is the file in which the sampled parameter values were saved. This file is saved as a tab-delimited text file so it is possible to read it into a variety of programs that can be used for summarization or plotting. We will examine this file graphically in a moment, but first let's get MrBayes to summarize its contents for us.<br />
<br />
At the MrBayes prompt, type the command <tt>sump</tt>. This will generate a crude graph showing the log-likelihood as a function of time. Note that the log-likelihood starts out low on the left (you started from a random tree, remember), then quickly climbs to a range of values just below -3176.<br />
<br />
Below the graph, MrBayes provides the arithmetic mean and harmonic mean of the marginal likelihood. The '''harmonic mean''' has been often used in estimating '''Bayes factors''', which are in turn useful for deciding which among different models fits the data best on average. We will talk about how to use this value in lecture, where you will also get some dire warnings about Bayes factors calculated in this way.<br />
<br />
The table at the end is quite useful. It shows the posterior mean, median, variance and 95% credible interval for each parameter in your model based on the samples taken during the run. The credible interval shows the range of values of a parameter that account for the middle 95% of its marginal posterior distribution. If the credible interval for kappa is 3.8 to 6.8, then you can say that there is a 95% chance that kappa is between 3.8 and 6.8 given your data and the assumed model. The parameter TL represents the sum of all the branch lengths. Rather than report every branch length individually, MrBayes just keeps track of their sum.<br />
<br />
Look at the output of the sump command and answer these questions:<br />
<div style="background-color: #ccccff"> <br />
* ''What is the total number of samples saved from the posterior distribution?'' {{title|1001|answer}}<br />
* ''How many iterations (generations) did you specify in your MRBAYES block?'' {{title|10000|answer}}<br />
* ''Explain why the two numbers above are different.'' {{title|sampled only every 10th iteration, which yields 1000 samples; the 1 additional sample represents the starting state|answer}}<br />
* ''What proportion of the sampled values did MrBayes automatically exclude as burn-in?'' {{title|25%, as indicated by the statement: Based on a total of 751 samples out of a total of 1001 samples|answer}}<br />
* ''Which value in the parameter column had the largest effective sample size (ESS)?'' {{title|TL|answer}}<br />
* ''Would you conclude from the ESS column that a longer run is necessary?'' {{title|yes, I found that only 2 parameters have ESS values greater than 100|answer}}<br />
</div><br />
<br />
=== The sumt command ===<br />
Now type the command <tt>sumt</tt>. This will summarize the trees that have been saved in the file <tt>algaemb.nex.t</tt>. <br />
<br />
The output of this command includes a bipartition (=split) table, showing posterior probabilities for every split found in any tree sampled during the run. After the bipartition table is shown a majority-rule consensus tree (labeled ''Clade credibility values'') containing all splits that had posterior probability 0.5 or above. <br />
<br />
If you chose to save branch lengths (and we did), MrBayes shows a second tree (labeled ''Phylogram'') in which each branch is displayed in such a way that branch lengths are proportional to their posterior mean. MrBayes keeps a running sum of the branch lengths for particular splits it finds in trees as it reads the file <tt>algaemb.nex.t</tt>. Before displaying this tree, it divides the sum for each split by the total number of times it encountered the split to get a simple average branch length for each split. It then draws the tree so that branch lengths are proportional to these mean branch lengths.<br />
<br />
Finally, the last thing the <tt>sumt</tt> command does is tell you how many tree topologies are in credible sets of various sizes. For example, in my run, it said that the 99% credible set contained 16 trees. What does this tell us? MrBayes orders tree topologies from most frequent to least frequent (where frequency refers to the number of times they appear in <tt>algaemb.nex.t</tt>). To construct the 99% credible set of trees, it begins by adding the most frequent tree to the set. If that tree accounts for 99% or more of the posterior probability (i.e. at least 99% of all the trees in the <tt>algaemb.nex.t</tt> file have this topology), then MrBayes would say that the 99% credible set contains 1 tree. If the most frequent tree topology was not that frequent, then MrBayes would add the next most frequent tree topology to the set. If the combined posterior probability of both trees was at least 0.99, it would say that the 99% credible set contains 2 trees. In our case, it had to add the top 16 trees to get the total posterior probability up to 99%. <br />
<br />
Type <tt>quit</tt> (or just <tt>q</tt>), to quit MrBayes now.<br />
<br />
== Using Tracer to summarize MCMC results ==<br />
The Java program [http://tree.bio.ed.ac.uk/software/tracer/ Tracer] is very useful for summarizing the results of Bayesian phylogenetic analyses. Tracer was written to accompany the program [http://tree.bio.ed.ac.uk/software/beast/ Beast], but it works well with the output file produced by MrBayes as well. This lab was written using Tracer version 1.6.<br />
<br />
To use Tracer on your own computer to view files created on the cluster, you need to get the file on the cluster downloaded to your laptop. Download (using Cyberduck, FileZilla, Fugu, scp, or whatever has been working) the file <tt>algaemb.nex.p</tt>.<br />
<br />
After starting Tracer, choose ''File > Import Trace File...'' to choose a parameter sample file to display (you can also do this by clicking the + button under the list of trace files in the upper left corner of the main window). Select the <tt>algaemb.nex.p</tt> in your working folder, then click the ''Open'' button to read it. '''Important''' You will need to change the format to "All Files" instead of "BEAST log (*.log) Files" before Tracer will allow you to select it.<br />
<br />
You should now see 8 rows of values in the table labeled ''Traces'' on the left side of the main window. The first row (''LnL'') is selected by default, and Tracer shows a histogram of log-likelihood values on the right, with summary statistics above the histogram.<br />
<br />
A histogram is perhaps not the most useful plot to make with the LnL values. Click the Trace tab to see a trace plot (plot of the log-likelihood through time).<br />
<br />
Tracer determines the burn-in period using an undocumented algorithm. You may wish to be more conservative than Tracer. Opinions vary about burn-in. Some Bayesians feel it is important to exclude the first few samples because it is obvious that the chains have not reached stationarity at this point. Other Bayesians feel that if you are worried about the effect of the earliest samples, then you definitely have not run your chains long enough! You might be interested in reading [http://www.stat.umn.edu/~charlie/mcmc/burn.html Charlie Geyer's rant] on burn-in some time. <br />
<br />
Because our MrBayes run was just to learn how to run MrBayes and not to do a serious analysis, the trace plot of the log-likelihood will reflect the fact that in this case the burn-in period should be at least 20% of the run! A longer run is also indicated by all the ESS values shown in <span style="color:red; font-weight:bold">red</span> in the Traces panel. Tracer shows an ESS in red if it is less than 200, which it treats as the minimal effective sample size.<br />
<div style="background-color: #ccccff"> <br />
* ''What is the effective sample size for TL? {{title|14|answer}}<br />
* ''What did MrBayes report as the effective sample size for TL? {{title|155.18|answer}}<br />
* ''Why is there a difference? Hint: compare the burn-in for both. {{title|Tracer excluded 10% of samples as burn-in, while MrBayes excluded 25%|answer}}<br />
* ''Explain why the ESS reported by MrBayes is higher than that reported by Tracer even though fewer samples were included by MrBayes. {{title|MrBayes cut out all of the initial climb out of randomness, leaving only samples that were much less autocorrelated|answer}}<br />
</div><br />
<br />
'''Before going further!!!''' Change the burn-in used by Tracer from 1000 to 2500 so that the burn-in includes all of the initial climb out of randomness evident in the trace plot of LnL.<br />
<br />
Click the Estimates tab again at the top, then click the row labeled kappa on the left. <br />
<div style="background-color: #ccccff"> <br />
* ''What is the posterior mean of kappa?'' {{title|4.8487|answer}}<br />
* ''What is the 95% credible interval for kappa?'' {{title|3.8506 to 5.8384|answer}}<br />
</div><br />
<br />
Click the row labeled ''alpha'' on the left. This is the shape parameter of the gamma distribution governing rates across sites. <br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the posterior mean of alpha?'' {{title|0.2498|answer}}<br />
* ''What is the 95% credible interval for alpha?'' {{title|0.1663 to 0.3145|answer}}<br />
* ''Is there rate heterogeneity among sites, or are all sites evolving at nearly the same rate?'' {{title|a gamma shape parameter considerably less than 1 indicates substantial rate heterogeneity|answer}}<br />
</div><br />
<br />
Click on the row labeled ''TL'' on the left (the Tree Length).<br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the posterior mean tree length?'' {{title|0.646|answer}}<br />
* ''What is the mean edge length? (Hint: divide the tree length by the number of edges, which is 2n-3 if n is the number of taxa.) {{title|0.646 divided by 13 equals 0.0496|answer}}<br />
</div><br />
<br />
=== Scatterplots of pairs of parameters ===<br />
Note that Tracer lets you easily create scatterplots of combinations of parameters. Simply select two parameters (you will have to hold down the Ctrl or Cmd key to select multiple items) and then click on the Joint-Marginal tab.<br />
<br />
=== Marginal densities ===<br />
Try selecting all four base frequencies and then clicking the Marginal Prob Distribution tab. This will show (estimated) marginal probability density plots for all four frequencies at once. Note that KDE is selected underneath the plot in the Display drop-down list. KDE stands for "Kernel Density Estimation" and represents a common non-parametric method for smoothing histograms into estimates of probabilty density functions.<br />
<br />
== Running MrBayes with no data ==<br />
Why would you want to run MrBayes with no data? Here's a possible reason. You discover by reading the text that results from typing <tt>help prset</tt> that MrBayes assumes, by default, the following branch length prior: <tt>exp(10)</tt>. What does the 10 mean here? Is this an exponential distribution with mean 10 or is 10 the "rate" parameter (a common way to parameterize the exponential distribution)? If 10 is correctly interpreted as the rate parameter, then the mean of the distribution is 1/rate, or 0.1. Even good documentation such as that provided for MrBayes does not explicitly spell out everything you might want to know, but running MrBayes without data can provide answers, at least to questions concerning prior distributions.<br />
<br />
Also, it is not possible to place prior distributions directly on some quantities of interest. For example, while you can specify a flat prior on topologies, it is not possible to place a prior on a particular split you are interested in. This is because the prior distribution of splits is '''induced''' by the prior you place on topologies. Running a Bayesian MCMC program without data is a good way to make sure you know what priors you are actually placing on the quantities of interest.<br />
<br />
If there is no information in the data, the posterior distribution equals the prior distribution. An MCMC analysis in such cases provides an approximation of the prior. MrBayes makes it easy to run the MCMC analysis without data. (For programs that don't make it easy, simply create a data set containing just one site for which each taxon has missing data.) <br />
<br />
Start by deleting the output from the earlier run of the <tt>algaemb.nex</tt> data file:<br />
rm -f algaemb.nex.*<br />
The above command will leave the data file <tt>algaemb.nex</tt> behind, but delete files with names based on the data file but which append other characters to the filename, such as <tt>algaemb.nex.p</tt> and <tt>algaemb.nex.t</tt>. The -f means "force" (i.e. don't ask, just delete). It goes without saying that you should not use <tt>rm -f</tt> if you are tired!<br />
<br />
If the only file remaining in the <tt>mblab</tt> directory is the <tt>algaemb.nex</tt> data file, type the following to start the data-free analysis:<br />
mb -i algaemb.nex<br />
MrBayes> mcmc data=no ngen=1000000 samplefreq=100<br />
Note that I have increased the number of generations to 1 million because the run will go very fast. Sampling every 100th generation will give us a sample of size 10000 to work with.<br />
<br />
<div style="background-color: #ccccff"><br />
* ''Consulting Bayes' formula, what value of the likelihood would cause the posterior to equal the prior?'' {{title|1.0|answer}}<br />
* ''Is this the value that MrBayes reports for the log-likelihood in this case?'' {{title|yes, the log-likelihood is 0.0, which corresponds to a likelihood equal to 1.0|answer}}<br />
</div><br />
<br />
=== Checking the shape parameter prior ===<br />
Import the output file <tt>algaemb.nex.p</tt> in Tracer. Look first at the histogram of alpha, the shape parameter of the gamma distribution.<br />
<br />
<div style="background-color: #ccccff"><br />
* ''What is the mean you expected for alpha based on the <tt>prset shapepr=exp(1.0)</tt> command in the <tt>blank.nex</tt> file?'' {{title|1.0|answer}}<br />
* ''What is the posterior mean actually estimated by MrBayes (and presented by Tracer)?'' {{title|0.9856|answer}}<br />
* ''An exponential distribution always starts high and approaches zero as you move to the right along the x-axis. The highest point of the exponential density function is 1/mean. If you look at the approximated density plot (click on the ''Marginal Density'' tab), does it appear to approach 1/mean at the value alpha=0.0?'' {{title|yes, but changing Display from KDE to Histogram may make it clearer|answer}}<br />
</div><br />
<br />
=== Checking the branch length prior ===<br />
Now look at the histogram of TL, the tree length. <br />
<br />
<div style="background-color: #ccccff"> <br />
* ''What is the posterior mean of TL, as reported by Tracer?'' {{title|1.2944|answer}}<br />
* ''What value did you expect based on the <tt>prset brlenspr=unconstrained:exp(10)</tt> command?'' {{title|prior mean edge length 0.1 multiplied by 13 edges equals 1.3|answer}}<br />
* ''Does the approximated posterior distribution of TL appear to be an exponential distribution?'' {{title|no, it has a mode to the right of 1 whereas an exponential distribution peaks at 0|answer}}<br />
</div><br />
<br />
The second and third questions are a bit tricky, so I'll just give you the explanation. Please make sure this explanation makes sense to you, however, and ask us to explain further if it doesn't make sense. We told MrBayes to place an exponential prior with mean 0.1 on each branch. There are 13 branches in a 8-taxon, unrooted tree. Thus, 13 times 0.1 equals 1.3, which should be close to the posterior mean you obtained for TL. That part is fairly straightforward. <br />
<br />
The marginal distribution of TL does not look at all like an exponential distribution, despite the fact that TL should be the sum of 13 exponential distributions. It turns out that the sum of <math>n</math> independent Exponential(<math>\lambda</math>) distributions is a Gamma(<math>n</math>, <math>1/\lambda</math>) distribution. In our case the tree length distribution is a sum of 13 independent Exponential(10) distributions, which equals a Gamma(13, 0.1) distribution. Such a Gamma distribution would have a mean of 1.3 and a peak (mode) at 1.2. If you want to visualize this, fire up R and type the following commands:<br />
curve(dgamma(theta, shape=13, scale=0.1), from=0, to=2, xname="theta")<br />
<div style="background-color: #ccccff"> <br />
* ''How does the Gamma(13, 0.1) density compare to the distribution of TL as shown by Tracer? (Be sure to click the "Marginal Density" tab in Tracer)'' {{title|looks the same|answer}}<br />
</div><br />
<br />
=== Other output files produced by MrBayes ===<br />
That's it for the lab today. You can look at plots of the other parameters if you like. You should also spend some time opening the other output files MrBayes produces in a text editor to make sure you understand what information is saved in these files. Note that some of MrBayes' output files are actually Nexus tree files, which you can open in [http://tree.bio.ed.ac.uk/software/figtree/ FigTree]. For example, <tt>algaemb.nex.t</tt> contains the sampled trees; however, if there are many trees in <tt>algaemb.nex.t</tt>, be prepared for a long wait while FigTree loads the file.<br />
<br />
<br />
[[Category:Phylogenetics]]</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Ggtree&diff=38673Ggtree2018-03-07T22:03:26Z<p>Paul Lewis: /* Getting Started */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan<br />
<br />
== Goals ==<br />
<br />
To introduce you to the R package [http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12628/abstract ggtree] for plotting phylogenetic trees.<br />
<br />
== Introduction ==<br />
<br />
== Getting Started ==<br />
<br />
Download the tree file [http://hydrodictyon.eeb.uconn.edu/people/plewis/downloads/moths.txt moths.txt] and save in a convenient place on your hard drive.<br />
<br />
====Installing Packages====<br />
<br />
Open a terminal, start <tt>R</tt>, and install the packages we will be using. We'll be using the packages:<br />
<br />
BiocInstaller<br />
ape<br />
Biostrings<br />
ggplot2<br />
ggtree<br />
phytools<br />
ggrepel<br />
stringr<br />
stringi<br />
abind<br />
treeio<br />
<br />
You can install a package like so:<br />
<br />
install.packages("BiocInstaller")<br />
<br />
Many of the above packages are part of the [https://bioconductor.org/packages/release/bioc/ Bioconductor project] (like ggtree and treeio). You can find extensive documentation on their website for packages associated with their project.<br />
<br />
====Read in the Tree File====<br />
<br />
We're dealing with a tree in the Newick file format which the function <tt>read.newick</tt> from the package <tt>treeio</tt> can handle:<br />
<br />
tree <- read.newick("moth.txt")<br />
<br />
R can handle more than just Newick formatted tree files. To see what other file formats from the various phylogenetic software that R can handle checkout [https://bioconductor.org/packages/release/bioc/html/treeio.html <tt>treeio</tt>]. Note: the functionality within <tt>treeio</tt> used to be part of the <tt>ggtree</tt> package itself, but the authors recently split <tt>ggtree</tt> in two with one part (<tt>ggtree</tt>) handling mostly plotting, and the other other part (<tt>treeio</tt>) handling mostly file input/output operations.<br />
<br />
Let's quickly plot the tree to see what it looks like using the regular old <tt>plot</tt> function from the <tt>graphics</tt> package:<br />
<br />
plot(tree)<br />
<br />
Notice the tree has all of its tips labeled. It's also a little cramped. You can expand the plot window to try to get the tree to display more legibly. We'll eventually use the package <tt>ggsave</tt> to control the dimensions of the plot when we finally export it to a PDF file. But until then, expand the plot window to get the tree to display reasonably well.<br />
<br />
Now plot the tree using the <tt>ggtree</tt> package:<br />
<br />
ggtree(tree)<br />
<br />
What happened to our tree!? The <tt>plot</tt> function from the <tt>graphics</tt> package simply, but stubbornly, plots your tree without much ability to alter aesthetics. <tt>ggtree</tt> by default plots almost nothing, assuming you will add what you want to your tree plot. You can add elements to the plot using <tt>geoms</tt>, just the same way that you would add elements to plots using the package <tt>ggplot2</tt>. The use of <tt>geoms</tt> makes plotting easily extensible, but it is by no means normal <tt>R</tt> syntax. To see the <tt>geoms</tt> available to <tt>ggtree</tt> check out its [https://www.bioconductor.org/packages/release/bioc/manuals/ggtree/man/ggtree.pdf reference manual on BioConductor].<br />
<br />
====Adding/Altering Tree Elements with Geoms====<br />
<br />
=====Tip Labels=====<br />
<br />
OK this tree would be more useful with tiplabels. Let's add them using <tt>geom_tiplab</tt>:<br />
<br />
ggtree(tree)+geom_tiplab()<br />
<br />
Those tip labels are nice but a little big. <tt>geom_tiplab</tt> has a bunch of arguments that you can play around with, including one for the text size. You can read more about the available arguments in [https://www.bioconductor.org/packages/release/bioc/manuals/ggtree/man/ggtree.pdf the <tt>ggtree</tt> manual] Plot the tree again but with smaller labels:<br />
<br />
ggtree(tree)+geom_tiplab(size=3.5)<br />
<br />
=====Clade Labels=====<br />
<br />
=====Node Labels=====<br />
<br />
<br />
=====Clade Color=====<br />
<br />
<br />
=====Scale Bar=====<br />
<br />
====Export Plot to PDF====<br />
<br />
====Cite ggtree====<br />
<br />
citation("ggtree")<br />
<br />
== References ==<br />
<br />
Yu G, Smith D, Zhu H, Guan Y and Lam TT (2017). “ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data.” Methods in Ecology and Evolution, 8, pp. 28-36. doi: 10.1111/2041-210X.12628, http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12628/abstract.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Ggtree&diff=38672Ggtree2018-03-07T22:02:02Z<p>Paul Lewis: /* Getting Started */</p>
<hr />
<div>{| border="0"<br />
|-<br />
|rowspan="2" valign="top"|[[Image:Adiantum.png|200px]]<br />
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span><br />
|-<br />
|<br />
|}<br />
by Kevin Keegan<br />
<br />
== Goals ==<br />
<br />
To introduce you to the R package [http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12628/abstract ggtree] for plotting phylogenetic trees.<br />
<br />
== Introduction ==<br />
<br />
== Getting Started ==<br />
<br />
Download the tree file [http://hydrodictyon.eeb.uconn.edu/people/plewis/courses/phylogenetics/labs/moths.txt moths.txt] and save in a convenient place on your hard drive.<br />
<br />
====Installing Packages====<br />
<br />
Open a terminal, start <tt>R</tt>, and install the packages we will be using. We'll be using the packages:<br />
<br />
BiocInstaller<br />
ape<br />
Biostrings<br />
ggplot2<br />
ggtree<br />
phytools<br />
ggrepel<br />
stringr<br />
stringi<br />
abind<br />
treeio<br />
<br />
You can install a package like so:<br />
<br />
install.packages("BiocInstaller")<br />
<br />
Many of the above packages are part of the [https://bioconductor.org/packages/release/bioc/ Bioconductor project] (like ggtree and treeio). You can find extensive documentation on their website for packages associated with their project.<br />
<br />
====Read in the Tree File====<br />
<br />
We're dealing with a tree in the Newick file format which the function <tt>read.newick</tt> from the package <tt>treeio</tt> can handle:<br />
<br />
tree <- read.newick("moth.txt")<br />
<br />
R can handle more than just Newick formatted tree files. To see what other file formats from the various phylogenetic software that R can handle checkout [https://bioconductor.org/packages/release/bioc/html/treeio.html <tt>treeio</tt>]. Note: the functionality within <tt>treeio</tt> used to be part of the <tt>ggtree</tt> package itself, but the authors recently split <tt>ggtree</tt> in two with one part (<tt>ggtree</tt>) handling mostly plotting, and the other other part (<tt>treeio</tt>) handling mostly file input/output operations.<br />
<br />
Let's quickly plot the tree to see what it looks like using the regular old <tt>plot</tt> function from the <tt>graphics</tt> package:<br />
<br />
plot(tree)<br />
<br />
Notice the tree has all of its tips labeled. It's also a little cramped. You can expand the plot window to try to get the tree to display more legibly. We'll eventually use the package <tt>ggsave</tt> to control the dimensions of the plot when we finally export it to a PDF file. But until then, expand the plot window to get the tree to display reasonably well.<br />
<br />
Now plot the tree using the <tt>ggtree</tt> package:<br />
<br />
ggtree(tree)<br />
<br />
What happened to our tree!? The <tt>plot</tt> function from the <tt>graphics</tt> package simply, but stubbornly, plots your tree without much ability to alter aesthetics. <tt>ggtree</tt> by default plots almost nothing, assuming you will add what you want to your tree plot. You can add elements to the plot using <tt>geoms</tt>, just the same way that you would add elements to plots using the package <tt>ggplot2</tt>. The use of <tt>geoms</tt> makes plotting easily extensible, but it is by no means normal <tt>R</tt> syntax. To see the <tt>geoms</tt> available to <tt>ggtree</tt> check out its [https://www.bioconductor.org/packages/release/bioc/manuals/ggtree/man/ggtree.pdf reference manual on BioConductor].<br />
<br />
====Adding/Altering Tree Elements with Geoms====<br />
<br />
=====Tip Labels=====<br />
<br />
OK this tree would be more useful with tiplabels. Let's add them using <tt>geom_tiplab</tt>:<br />
<br />
ggtree(tree)+geom_tiplab()<br />
<br />
Those tip labels are nice but a little big. <tt>geom_tiplab</tt> has a bunch of arguments that you can play around with, including one for the text size. You can read more about the available arguments in [https://www.bioconductor.org/packages/release/bioc/manuals/ggtree/man/ggtree.pdf the <tt>ggtree</tt> manual] Plot the tree again but with smaller labels:<br />
<br />
ggtree(tree)+geom_tiplab(size=3.5)<br />
<br />
=====Clade Labels=====<br />
<br />
=====Node Labels=====<br />
<br />
<br />
=====Clade Color=====<br />
<br />
<br />
=====Scale Bar=====<br />
<br />
====Export Plot to PDF====<br />
<br />
====Cite ggtree====<br />
<br />
citation("ggtree")<br />
<br />
== References ==<br />
<br />
Yu G, Smith D, Zhu H, Guan Y and Lam TT (2017). “ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data.” Methods in Ecology and Evolution, 8, pp. 28-36. doi: 10.1111/2041-210X.12628, http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12628/abstract.</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Josh_Justison_Visit_to_Storrs&diff=38652Josh Justison Visit to Storrs2018-03-07T14:42:40Z<p>Paul Lewis: /* Thursday, March 8, 2018 */</p>
<hr />
<div><br />
== '''Josh Justison''' ==<br />
<br />
Josh is taking time out from his final semester at the University of Minnesota, where he is majoring in Biology and minoring in Computer Science, to visit UConn as a prospective graduate student.<br />
<br />
==Thursday, March 8, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:30 || Diler Haji || TLS 164<br />
|-<br />
| 10:00 || David Wagner || Skype call<br />
|-<br />
| 10:30 || Kevin Keegan || TLS 461<br />
|-<br />
|11:00 || Phylogenetics lecture || TLS 181<br />
|-<br />
|1:00 || Lewis Lab meeting || TLS 164<br />
|-<br />
|2:00 || Meet with Katie || TLS 479<br />
|-<br />
|2:30 || Jack Phillips || TLS 365<br />
|-<br />
|3:00 ||Pam Diggle || BioPharm 500A<br />
|-<br />
| 4:00 || Teale || Dodd Center<br />
|-<br />
| 5:00 || Jill Wegrzyn || ESB 306C<br />
|-<br />
| 6:30 || Dinner || Willibrew<br />
|}<br />
<br />
==Friday, March 9, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || Collections Coffee || Collections Library<br />
|-<br />
| 10:00 || Tim Moore || TLS 363<br />
|-<br />
| 10:30 || Kristen Nolting || BioPharm 302<br />
|-<br />
|11:00 || Systematics Seminar || TLS 171b (Bamford)<br />
|-<br />
|12:00 || Plant Lunch || BioPharm 303 <br />
|-<br />
|1:30 || Phylogenetics Lab || TLS 181<br />
|-<br />
|4:00 || Mark Urban || BioPharm 200A<br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Josh_Justison_Visit_to_Storrs&diff=38651Josh Justison Visit to Storrs2018-03-07T01:02:03Z<p>Paul Lewis: /* Thursday, March 8, 2018 */</p>
<hr />
<div><br />
== '''Josh Justison''' ==<br />
<br />
Josh is taking time out from his final semester at the University of Minnesota, where he is majoring in Biology and minoring in Computer Science, to visit UConn as a prospective graduate student.<br />
<br />
==Thursday, March 8, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:30 || Diler Haji || TLS 164<br />
|-<br />
| 10:00 || David Wagner || Skype call<br />
|-<br />
| 10:30 || Kevin Keegan || TLS 461<br />
|-<br />
|11:00 || Phylogenetics lecture || TLS 181<br />
|-<br />
|1:00 || Lewis Lab meeting || TLS 164<br />
|-<br />
|2:00 || Meet with Katie || TLS 479<br />
|-<br />
|2:30 || Jack Phillips || TLS 365<br />
|-<br />
|3:00 ||Pam Diggle || BioPharm 500A<br />
|-<br />
| 4:00 || Teale || Dodd Center<br />
|-<br />
| 6:30 || Dinner || Willibrew<br />
|}<br />
<br />
==Friday, March 9, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || Collections Coffee || Collections Library<br />
|-<br />
| 10:00 || Tim Moore || TLS 363<br />
|-<br />
| 10:30 || Kristen Nolting || BioPharm 302<br />
|-<br />
|11:00 || Systematics Seminar || TLS 171b (Bamford)<br />
|-<br />
|12:00 || Plant Lunch || BioPharm 303 <br />
|-<br />
|1:30 || Phylogenetics Lab || TLS 181<br />
|-<br />
|4:00 || Mark Urban || BioPharm 200A<br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Josh_Justison_Visit_to_Storrs&diff=38650Josh Justison Visit to Storrs2018-03-07T00:56:46Z<p>Paul Lewis: /* Wednesday, March 7, 2018 */</p>
<hr />
<div><br />
== '''Josh Justison''' ==<br />
<br />
Josh is taking time out from his final semester at the University of Minnesota, where he is majoring in Biology and minoring in Computer Science, to visit UConn as a prospective graduate student.<br />
<br />
==Thursday, March 8, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:30 || Diler Haji || Simon Lab<br />
|-<br />
| 10:00 || David Wagner || Skype call<br />
|-<br />
| 10:30 || Kevin Keegan || TLS 461<br />
|-<br />
|11:00 || Phylogenetics lecture || TLS 181<br />
|-<br />
|1:00 || Lewis Lab meeting || TLS 164<br />
|-<br />
|2:00 || Meet with Katie || TLS 479<br />
|-<br />
|2:30 || Jack Phillips || TLS 365<br />
|-<br />
|3:00 ||Pam Diggle || BioPharm 500A<br />
|-<br />
| 4:00 || Teale || Dodd Center<br />
|-<br />
| 6:30 || Dinner || Willibrew<br />
|}<br />
<br />
==Friday, March 9, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || Collections Coffee || Collections Library<br />
|-<br />
| 10:00 || Tim Moore || TLS 363<br />
|-<br />
| 10:30 || Kristen Nolting || BioPharm 302<br />
|-<br />
|11:00 || Systematics Seminar || TLS 171b (Bamford)<br />
|-<br />
|12:00 || Plant Lunch || BioPharm 303 <br />
|-<br />
|1:30 || Phylogenetics Lab || TLS 181<br />
|-<br />
|4:00 || Mark Urban || BioPharm 200A<br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Josh_Justison_Visit_to_Storrs&diff=38649Josh Justison Visit to Storrs2018-03-07T00:56:29Z<p>Paul Lewis: /* Friday, March 9, 2018 */</p>
<hr />
<div><br />
== '''Josh Justison''' ==<br />
<br />
Josh is taking time out from his final semester at the University of Minnesota, where he is majoring in Biology and minoring in Computer Science, to visit UConn as a prospective graduate student.<br />
<br />
==Wednesday, March 7, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| || snow storm makes Wednesday appointments questionable ||<br />
|-<br />
|5:00 || Jill Wegrzyn || ESB 306C<br />
|-<br />
|}<br />
<br />
==Thursday, March 8, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:30 || Diler Haji || Simon Lab<br />
|-<br />
| 10:00 || David Wagner || Skype call<br />
|-<br />
| 10:30 || Kevin Keegan || TLS 461<br />
|-<br />
|11:00 || Phylogenetics lecture || TLS 181<br />
|-<br />
|1:00 || Lewis Lab meeting || TLS 164<br />
|-<br />
|2:00 || Meet with Katie || TLS 479<br />
|-<br />
|2:30 || Jack Phillips || TLS 365<br />
|-<br />
|3:00 ||Pam Diggle || BioPharm 500A<br />
|-<br />
| 4:00 || Teale || Dodd Center<br />
|-<br />
| 6:30 || Dinner || Willibrew<br />
|}<br />
<br />
==Friday, March 9, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || Collections Coffee || Collections Library<br />
|-<br />
| 10:00 || Tim Moore || TLS 363<br />
|-<br />
| 10:30 || Kristen Nolting || BioPharm 302<br />
|-<br />
|11:00 || Systematics Seminar || TLS 171b (Bamford)<br />
|-<br />
|12:00 || Plant Lunch || BioPharm 303 <br />
|-<br />
|1:30 || Phylogenetics Lab || TLS 181<br />
|-<br />
|4:00 || Mark Urban || BioPharm 200A<br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Josh_Justison_Visit_to_Storrs&diff=38648Josh Justison Visit to Storrs2018-03-07T00:56:20Z<p>Paul Lewis: /* Friday, March 9, 2018 */</p>
<hr />
<div><br />
== '''Josh Justison''' ==<br />
<br />
Josh is taking time out from his final semester at the University of Minnesota, where he is majoring in Biology and minoring in Computer Science, to visit UConn as a prospective graduate student.<br />
<br />
==Wednesday, March 7, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| || snow storm makes Wednesday appointments questionable ||<br />
|-<br />
|5:00 || Jill Wegrzyn || ESB 306C<br />
|-<br />
|}<br />
<br />
==Thursday, March 8, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:30 || Diler Haji || Simon Lab<br />
|-<br />
| 10:00 || David Wagner || Skype call<br />
|-<br />
| 10:30 || Kevin Keegan || TLS 461<br />
|-<br />
|11:00 || Phylogenetics lecture || TLS 181<br />
|-<br />
|1:00 || Lewis Lab meeting || TLS 164<br />
|-<br />
|2:00 || Meet with Katie || TLS 479<br />
|-<br />
|2:30 || Jack Phillips || TLS 365<br />
|-<br />
|3:00 ||Pam Diggle || BioPharm 500A<br />
|-<br />
| 4:00 || Teale || Dodd Center<br />
|-<br />
| 6:30 || Dinner || Willibrew<br />
|}<br />
<br />
==Friday, March 9, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || Collections Coffee || Collections Library<br />
|-<br />
| 10:00 || Tim Moore || TLS 363<br />
|-<br />
| 10:30 || Kristen Nolting || BioPharm 302<br />
|-<br />
|11:00 || Systematics Seminar || TLS 171b (Bamford)<br />
|-<br />
|12:00 || Plant Lunch || BioPharm 303 <br />
|-<br />
|1:30 || Phylogenetics Lab || TLS 181<br />
|-<br />
|4:00 || Mark Urban || BioPharm 200A<br />
|-<br />
|4:30 || ||<br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Josh_Justison_Visit_to_Storrs&diff=38647Josh Justison Visit to Storrs2018-03-07T00:56:10Z<p>Paul Lewis: /* Thursday, March 8, 2018 */</p>
<hr />
<div><br />
== '''Josh Justison''' ==<br />
<br />
Josh is taking time out from his final semester at the University of Minnesota, where he is majoring in Biology and minoring in Computer Science, to visit UConn as a prospective graduate student.<br />
<br />
==Wednesday, March 7, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| || snow storm makes Wednesday appointments questionable ||<br />
|-<br />
|5:00 || Jill Wegrzyn || ESB 306C<br />
|-<br />
|}<br />
<br />
==Thursday, March 8, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:30 || Diler Haji || Simon Lab<br />
|-<br />
| 10:00 || David Wagner || Skype call<br />
|-<br />
| 10:30 || Kevin Keegan || TLS 461<br />
|-<br />
|11:00 || Phylogenetics lecture || TLS 181<br />
|-<br />
|1:00 || Lewis Lab meeting || TLS 164<br />
|-<br />
|2:00 || Meet with Katie || TLS 479<br />
|-<br />
|2:30 || Jack Phillips || TLS 365<br />
|-<br />
|3:00 ||Pam Diggle || BioPharm 500A<br />
|-<br />
| 4:00 || Teale || Dodd Center<br />
|-<br />
| 6:30 || Dinner || Willibrew<br />
|}<br />
<br />
==Friday, March 9, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || Collections Coffee || Collections Library<br />
|-<br />
| 9:30 || ||<br />
|-<br />
| 10:00 || Tim Moore || TLS 363<br />
|-<br />
| 10:30 || Kristen Nolting || BioPharm 302<br />
|-<br />
|11:00 || Systematics Seminar || TLS 171b (Bamford)<br />
|-<br />
|12:00 || Plant Lunch || BioPharm 303 <br />
|-<br />
|1:30 || Phylogenetics Lab || TLS 181<br />
|-<br />
|4:00 || Mark Urban || BioPharm 200A<br />
|-<br />
|4:30 || ||<br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Josh_Justison_Visit_to_Storrs&diff=38646Josh Justison Visit to Storrs2018-03-07T00:55:56Z<p>Paul Lewis: /* Thursday, March 8, 2018 */</p>
<hr />
<div><br />
== '''Josh Justison''' ==<br />
<br />
Josh is taking time out from his final semester at the University of Minnesota, where he is majoring in Biology and minoring in Computer Science, to visit UConn as a prospective graduate student.<br />
<br />
==Wednesday, March 7, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| || snow storm makes Wednesday appointments questionable ||<br />
|-<br />
|5:00 || Jill Wegrzyn || ESB 306C<br />
|-<br />
|}<br />
<br />
==Thursday, March 8, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || ||<br />
|-<br />
| 9:30 || Diler Haji || Simon Lab<br />
|-<br />
| 10:00 || David Wagner || Skype call<br />
|-<br />
| 10:30 || Kevin Keegan || TLS 461<br />
|-<br />
|11:00 || Phylogenetics lecture || TLS 181<br />
|-<br />
|1:00 || Lewis Lab meeting || TLS 164<br />
|-<br />
|2:00 || Meet with Katie || TLS 479<br />
|-<br />
|2:30 || Jack Phillips || TLS 365<br />
|-<br />
|3:00 ||Pam Diggle || BioPharm 500A<br />
|-<br />
| 4:00 || Teale || Dodd Center<br />
|-<br />
| 6:30 || Dinner || Willibrew<br />
|}<br />
<br />
==Friday, March 9, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || Collections Coffee || Collections Library<br />
|-<br />
| 9:30 || ||<br />
|-<br />
| 10:00 || Tim Moore || TLS 363<br />
|-<br />
| 10:30 || Kristen Nolting || BioPharm 302<br />
|-<br />
|11:00 || Systematics Seminar || TLS 171b (Bamford)<br />
|-<br />
|12:00 || Plant Lunch || BioPharm 303 <br />
|-<br />
|1:30 || Phylogenetics Lab || TLS 181<br />
|-<br />
|4:00 || Mark Urban || BioPharm 200A<br />
|-<br />
|4:30 || ||<br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Josh_Justison_Visit_to_Storrs&diff=38644Josh Justison Visit to Storrs2018-03-06T21:59:00Z<p>Paul Lewis: /* Friday, March 9, 2018 */</p>
<hr />
<div><br />
== '''Josh Justison''' ==<br />
<br />
Josh is taking time out from his final semester at the University of Minnesota, where he is majoring in Biology and minoring in Computer Science, to visit UConn as a prospective graduate student.<br />
<br />
==Wednesday, March 7, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| || snow storm makes Wednesday appointments questionable ||<br />
|-<br />
|5:00 || Jill Wegrzyn || ESB 306C<br />
|-<br />
|}<br />
<br />
==Thursday, March 8, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || ||<br />
|-<br />
| 9:30 || ||<br />
|-<br />
| 10:00 || ||<br />
|-<br />
| 10:30 || Kevin Keegan || TLS 461<br />
|-<br />
|11:00 || Phylogenetics lecture || TLS 181<br />
|-<br />
|1:00 || Lewis Lab meeting || TLS 164<br />
|-<br />
|2:00 || Meet with Katie || TLS 479<br />
|-<br />
|2:30 || Jack Phillips || TLS 365<br />
|-<br />
|3:00 ||Pam Diggle || BioPharm 500A<br />
|-<br />
| 4:00 || Teale || Dodd Center<br />
|-<br />
| 6:30 || Dinner at Willibrew || let Paul Lewis know if you are interested in joining us<br />
|}<br />
<br />
==Friday, March 9, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || Collections Coffee || Collections Library<br />
|-<br />
| 9:30 || ||<br />
|-<br />
| 10:00 || Tim Moore || TLS 363<br />
|-<br />
| 10:30 || Kristen Nolting || BioPharm 302<br />
|-<br />
|11:00 || Systematics Seminar || TLS 171b (Bamford)<br />
|-<br />
|12:00 || Plant Lunch || BioPharm 303 <br />
|-<br />
|1:30 || Phylogenetics Lab || TLS 181<br />
|-<br />
|4:00 || Mark Urban || BioPharm 200A<br />
|-<br />
|4:30 || ||<br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Josh_Justison_Visit_to_Storrs&diff=38642Josh Justison Visit to Storrs2018-03-06T18:54:26Z<p>Paul Lewis: /* Thursday, March 8, 2018 */</p>
<hr />
<div><br />
== '''Josh Justison''' ==<br />
<br />
Josh is taking time out from his final semester at the University of Minnesota, where he is majoring in Biology and minoring in Computer Science, to visit UConn as a prospective graduate student.<br />
<br />
==Wednesday, March 7, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| || snow storm makes Wednesday appointments questionable ||<br />
|-<br />
|5:00 || Jill Wegrzyn || ESB 306C<br />
|-<br />
|}<br />
<br />
==Thursday, March 8, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || ||<br />
|-<br />
| 9:30 || ||<br />
|-<br />
| 10:00 || ||<br />
|-<br />
| 10:30 || Kevin Keegan || TLS 461<br />
|-<br />
|11:00 || Phylogenetics lecture || TLS 181<br />
|-<br />
|1:00 || Lewis Lab meeting || TLS 164<br />
|-<br />
|2:00 || Meet with Katie || TLS 479<br />
|-<br />
|2:30 || Jack Phillips || TLS 365<br />
|-<br />
|3:00 || || <br />
|-<br />
| 4:00 || Teale || Dodd Center<br />
|-<br />
| 6:30 || Dinner at Willibrew || let Paul Lewis know if you are interested in joining us<br />
|}<br />
<br />
==Friday, March 9, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || Collections Coffee || Collections Library<br />
|-<br />
| 9:30 || ||<br />
|-<br />
| 10:00 || Tim Moore || TLS 363<br />
|-<br />
| 10:30 || Kristen Nolting || BioPharm 302<br />
|-<br />
|11:00 || Systematics Seminar || TLS 171b (Bamford)<br />
|-<br />
|12:00 || Plant Lunch || BioPharm 303 <br />
|-<br />
|1:30 || Phylogenetics Lab || TLS 181<br />
|-<br />
|4:00 || || <br />
|-<br />
|4:30 || ||<br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Josh_Justison_Visit_to_Storrs&diff=38640Josh Justison Visit to Storrs2018-03-06T15:04:49Z<p>Paul Lewis: /* Thursday, March 8, 2018 */</p>
<hr />
<div><br />
== '''Josh Justison''' ==<br />
<br />
Josh is taking time out from his final semester at the University of Minnesota, where he is majoring in Biology and minoring in Computer Science, to visit UConn as a prospective graduate student.<br />
<br />
==Wednesday, March 7, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| || snow storm makes Wednesday appointments questionable ||<br />
|-<br />
|5:00 || Jill Wegrzyn || ESB 306C<br />
|-<br />
|}<br />
<br />
==Thursday, March 8, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || ||<br />
|-<br />
| 9:30 || ||<br />
|-<br />
| 10:00 || ||<br />
|-<br />
| 10:30 || ||<br />
|-<br />
|11:00 || Phylogenetics lecture || TLS 181<br />
|-<br />
|1:00 || Lewis Lab meeting ||<br />
|-<br />
|2:00 || Meet with Katie || TLS 479<br />
|-<br />
|2:30 || Jack Phillips || TLS 365<br />
|-<br />
|3:00 || || <br />
|-<br />
| 4:00 || Teale || Dodd Center<br />
|-<br />
| 6:30 || Dinner at Willibrew || let Paul Lewis know if you are interested in joining us<br />
|}<br />
<br />
==Friday, March 9, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || Collections Coffee || Collections Library<br />
|-<br />
| 9:30 || ||<br />
|-<br />
| 10:00 || Tim Moore || TLS 363<br />
|-<br />
| 10:30 || Kristen Nolting || BioPharm 302<br />
|-<br />
|11:00 || Systematics Seminar || TLS 171b (Bamford)<br />
|-<br />
|12:00 || Plant Lunch || BioPharm 303 <br />
|-<br />
|1:30 || Phylogenetics Lab || TLS 181<br />
|-<br />
|4:00 || || <br />
|-<br />
|4:30 || ||<br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Josh_Justison_Visit_to_Storrs&diff=38638Josh Justison Visit to Storrs2018-03-06T14:20:34Z<p>Paul Lewis: /* Wednesday, March 7, 2018 */</p>
<hr />
<div><br />
== '''Josh Justison''' ==<br />
<br />
Josh is taking time out from his final semester at the University of Minnesota, where he is majoring in Biology and minoring in Computer Science, to visit UConn as a prospective graduate student.<br />
<br />
==Wednesday, March 7, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| || snow storm makes Wednesday appointments questionable ||<br />
|-<br />
|5:00 || Jill Wegrzyn || ESB 306C<br />
|-<br />
|}<br />
<br />
==Thursday, March 8, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || ||<br />
|-<br />
| 9:30 || ||<br />
|-<br />
| 10:00 || ||<br />
|-<br />
| 10:30 || ||<br />
|-<br />
|11:00 || Phylogenetics lecture || TLS 181<br />
|-<br />
|1:00 || Lewis Lab meeting ||<br />
|-<br />
|2:00 || Meet with Katie || TLS 479<br />
|-<br />
|2:30 || ||<br />
|-<br />
|3:00 || || <br />
|-<br />
| 4:00 || Teale || Dodd Center<br />
|-<br />
| 6:30 || Dinner at Willibrew || let Paul Lewis know if you are interested in joining us<br />
|}<br />
<br />
==Friday, March 9, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || Collections Coffee || Collections Library<br />
|-<br />
| 9:30 || ||<br />
|-<br />
| 10:00 || ||<br />
|-<br />
| 10:30 || Kristen Nolting || BioPharm 302<br />
|-<br />
|11:00 || Systematics Seminar || TLS 171b (Bamford)<br />
|-<br />
|12:00 || Plant Lunch || BioPharm 303 <br />
|-<br />
|1:30 || Phylogenetics Lab || TLS 181<br />
|-<br />
|4:00 || || <br />
|-<br />
|4:30 || ||<br />
|-<br />
|}</div>Paul Lewishttp://hydrodictyon.eeb.uconn.edu/eebedia/index.php?title=Josh_Justison_Visit_to_Storrs&diff=38637Josh Justison Visit to Storrs2018-03-06T14:20:20Z<p>Paul Lewis: /* Wednesday, March 7, 2018 */</p>
<hr />
<div><br />
== '''Josh Justison''' ==<br />
<br />
Josh is taking time out from his final semester at the University of Minnesota, where he is majoring in Biology and minoring in Computer Science, to visit UConn as a prospective graduate student.<br />
<br />
==Wednesday, March 7, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| || snow storm makes Wednesday appointments questionable ||<br />
|5:00 || Jill Wegrzyn || ESB 306C<br />
|-<br />
|}<br />
<br />
==Thursday, March 8, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || ||<br />
|-<br />
| 9:30 || ||<br />
|-<br />
| 10:00 || ||<br />
|-<br />
| 10:30 || ||<br />
|-<br />
|11:00 || Phylogenetics lecture || TLS 181<br />
|-<br />
|1:00 || Lewis Lab meeting ||<br />
|-<br />
|2:00 || Meet with Katie || TLS 479<br />
|-<br />
|2:30 || ||<br />
|-<br />
|3:00 || || <br />
|-<br />
| 4:00 || Teale || Dodd Center<br />
|-<br />
| 6:30 || Dinner at Willibrew || let Paul Lewis know if you are interested in joining us<br />
|}<br />
<br />
==Friday, March 9, 2018 ==<br />
{|border=1 cellpadding=8<br />
| '''Time''' || '''Name''' || '''Location'''<br />
|-<br />
| 9:00 || Collections Coffee || Collections Library<br />
|-<br />
| 9:30 || ||<br />
|-<br />
| 10:00 || ||<br />
|-<br />
| 10:30 || Kristen Nolting || BioPharm 302<br />
|-<br />
|11:00 || Systematics Seminar || TLS 171b (Bamford)<br />
|-<br />
|12:00 || Plant Lunch || BioPharm 303 <br />
|-<br />
|1:30 || Phylogenetics Lab || TLS 181<br />
|-<br />
|4:00 || || <br />
|-<br />
|4:30 || ||<br />
|-<br />
|}</div>Paul Lewis