EEBedia - User contributions [en]

Classic Works in Evolutionary Biology S2014

2022-01-23T15:06:59Z

Paul Lewis:

== This web page is obsolete ==

Please see [https://uconneeb.github.io/classicworks/ the new Classic Works course web page] for the latest version of this course.

==Course Information==
'''Instructors''': Kurt Schwenk and Elizabeth Jockusch<br/>
'''Meeting time''': Tuesdays, 4-5 pm<br/>
'''Meeting place''': Bamford Room<br/>
<br/>

==Resources==
'''Classic Works in Evolutionary Biology''' [http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Classic_Works_in_Evolutionary_Biology—The_List_With_Links '''List''']<br/>
Ned Friedman's page on early Evolutionists [http://spot.colorado.edu/~friedmaw/Early_Evolution/Homepage.html link]<br/>
course pdf upload [http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/upload.html link]
<br/>
Class email list {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/EEB6485_ClassEmailList.pdf}}

==Discussion Schedule==
{| border="1" cellpadding="2"
!style="background:#00CC66;" width="60"|Date
!style="background:#00CC99;"|Discussion Leader(s)
!style="background:#00CCCC;"|Readings
!style="background:#007AA5;"|Additional Resources
|-
|Jan 21|| || <br>'''SNOW DAY'''<br><br> ||
|-
|Jan. 28||Kurt and Elizabeth||Excerpts from William Paley's "Natural Theology"<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/PaleyNaturalTheology1824%20EXCERPT.pdf}} Design Argument<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Paley2.pdf}} Paley on Selection<br>
Excerpts from William Whewell's 1845 "Indications of the Creator" <br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/WhewellIndicationsOfTheCreator1845EXCERPT.pdf}} Final Causes<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Whewell_Transmutation.pdf}} Transmutation of Species
||
|-
|Feb. 4||Bill ||Lamarck's Physiological Zoology (1809)<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Lamarck_1809_Ch7.pdf}} Ch. 7-Influence of the Environment<br>
Excerpts from Owen on Homology and Types<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Owen1846.pdf}} 1846 {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Owen1848.pdf}} 1848 {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Owen1849.pdf}} 1849
||
|-
|Feb. 11||Veronica||Darwin's and Wallace's 1858 Linnean Society Texts <br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Darwin_Wallace_1858.pdf}} Text with Ghiselin commentary <br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/DarwinWallace_ProcLinSoc1858.pdf}} Original
|| [http://benfry.com/traces/ Visualization: The Preservation of Favored Traces] <br> -- a graphic representation of the changes to *On the Origin of Species* through its six editions. <br>
Owen's (anonymous) review of the Origin [http://www.victorianweb.org/science/science_texts/owen_review_of_origin.html link]<br>
Wallace Appreciations
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Evolution%E2%80%99s%20red-hot%20radical.pdf}} Red-hot radical {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Knapp_WallaceBiogeographyFounder_Science2013.pdf}} Biogeographer<br>
Wallace-Darwin similarities: more than a coincidence?
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Davies_DarwinWallaceNonidenpendence_BiolJLinnSoc2013.pdf}} Davies {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Smith_Wallace-DarwinMail_BiolJLinnSoc2013.pdf}} Smith
|-
|Feb. 18|| || <br>'''SNOW DAY'''<br><br>||
|-
|Feb. 25||Cera ||Heredity and Variation
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Weismann_1893_GermPlasm_Introduction.pdf}} Weismann (1893) Introduction to the Germ-Plasm--Read part B, Descriptive part (p. 20-end)
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Bateson_1899_Crossbreeding.pdf}} Bateson (1899)
Also worth a look, to see how rapidly genetics progressed
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Morgan_1915_HereditaryMaterial.pdf}} Morgan (1915)
||Classic Literature in Genetics [http://www.esp.org/foundations/genetics/classical/browse/date.html link]<br>
Morgan, T.H. (1916) ''A Critique of the Theory of Evolution''—Morgan reviews evolution and theories of heredity to that time: [http://books.google.com/books?id=N3RIAAAAMAAJ&printsec=frontcover&dq=morgan+a+critique+of+the+theory+of+evolution&hl=en&sa=X&ei=yB4NU6fXCYfV0QGl9oHgAg&ved=0CCsQ6AEwAA#v=onepage&q=morgan%20a%20critique%20of%20the%20theory%20of%20evolution&f=false Morgan (1916) link]<br>
{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Grant1900_HeredityReviewForPublic.pdf}} Grant (1900)-Overview of the heredity debate for general public
|-
|March 4|| Brigette || Population Genetic Groundwork for the Modern Synthesis
Focus of Discussion
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Wright_1932_GeneticCongress.pdf}} Wright (1932)-includes adaptive landscape metaphor
Also have a look at the beginning and end of these (but skip the math in between)
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Haldane_1924_SelectionPart1.pdf}} Haldane (1924)-mathematical theory of selection, pp. 19, 37 (end)-39
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Fisher_1918_MendelianCorrBetweenRelsANOVA.pdf}} Fisher (1918)-Mendelian inheritance, pp. 399-401, 432 (end)-433
||
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Wright_20thCenturyGeneticsReviewMayrCritique_AJHG1960.pdf}} Wright's (1960) perspective on the history of genetics
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Wright_Obituary_Crow1988.pdf}} James Crow's obituary of Wright
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Fisher_Retrospective1963.pdf}} Fisher Retrospective
|-
|March 11|| Suman || Excerpt from Dobzhansky (1937) Genetics and the Origin of Species
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Dobzhansky_1937_Ch8_IsolatingMechanisms.pdf}} Chapter 8-Isolating Mechanisms
Also have a look at Dobzhansky's work on isolating mechanisms in Drosophila
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Dobzhansky_HybridSterilityII_Genetics1936.pdf}} Dobzhansky (1936); read "The Problem" and "Discussion"
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Dobzhansky_GeneticNatureSpeciesDiffs_AmNat1937.pdf}} Dobzhansky (1937); read "Introduction" and "Summary"
||[http://www.stephenjaygould.org/library/modern-science/chapter05.html Bateson (1909)] includes statement of Dobzhansky-Muller incompatibility model of speciation
{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Muller_IsolatingMechanisms_BiolSymp1942.pdf}} H. J. Muller (1942) on the genetics of isolating mechanisms<br>
{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Dobzhansky1937Review_Emerson1938.pdf}} Emerson's (1938) review of Genetics and the Origin of Species<br>
{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/DobzhanskyMemoir_Ayala1985.pdf}} Ayala's (1985) Dobzhansky Memoir<br>
{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Orr_Perspective_DobzhanskyMullerSpeciation_Genetics1996.pdf}} Orr (1996) reflections on Dobzhansky and the genetics of speciation<br>
|-
|March 18|| || SPRING BREAK ||
|-
|March 25|| Velotta, Michael||<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/SimpsonMajorFeatures1953WithNoteOPT.pdf}} Excerpt from George Gaylord Simpson's (1953) ''Major Features of Evolution''
|| {{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Simpson_FRSMemoir1986.pdf}} Whittington's (1986) Simpson memoir<br>
|-
|April 1|| Katie, Sara ||Speciation and hybridization
Zoological perspective--Mayr on genetic revolutions; also includes his historical perspective and a bit on hybridization<br>
'''if short on time''', skip pp. 535-546
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Mayr1963_GeneticRevolutions.pdf}} Excerpts from Mayr (1963) Animal Species and Evolution
Botanical perspective--focused on hybridization
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Stebbins1950Excerpts.pdf}} Excerpts from Stebbins (1950) Variation and Evolution in Plants
|| Some biographical information on Stebbins:<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/StebbinsBioNAS04.pdf}} By the National Academy of Sciences<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/StebbinBiobyPHRaven00.pdf}} By Peter Raven<br>
<br>
Although not 'officially' accepted into the Synthesis fold (i.e., approved by Ernst Mayr), another important botanist contributor is Verne Grant, member of the National Academy of Sciences, who wrote a significant, early book called, ''The Origin of Adaptations'' (Columbia Univ. Press, 1963). You can still buy used copies inexpensively.
|-
|April 8|| Ellen, Bill || Evo-Devo: outside the synthesis:<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Goldschmidt1940EXCERPTSb.pdf}} Richard Goldschmidt's (1940) ''Material Basis of Evolution'' ('hopeful monsters')<br><br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Waddington1957EXCERPTS.pdf}} '''Waddington (1957) Excerpts'''<br>
||'''You can get a free (probably illegal) full-text copy of Goldschmidt's book''' [http://www.evolocus.com/Textbooks/Goldschmidt1960.pdf '''HERE''']<br>

'''Some interesting papers about Goldschmidt and his work:'''<br>

:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/GoldschmidtMemoir_Stern1967.pdf}} Goldschmidt Biographical Memoir (Stern 1967)
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/GoldschmidtIntegrEvoDevoGenet00.pdf}} Goldschmidt and Evo-Devo (2000)<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/GoldschmidtOverview03.pdf}} Overview of Goldschmidt and his work (2003)<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/GoldschmidtUseOfMetaphors08.pdf}} Goldschmidt's use of metaphors (2008)<br>

'''Waddington was an experimental biologist as well as a theorist. Here are a few of his papers, all pertinent to what is contained in his book:'''<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/oWaddingtonBaldwinEffectGenAssim1953.pdf}} oWaddingtonBaldwinEffectGenAssim1953.pdf<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/oWaddingtonCanalGenAssimAcquiredChar1959.pdf}} oWaddingtonCanalGenAssimAcquiredChar1959.pdf<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/oWaddingtonGenAssimBithorax1956.pdf}} oWaddingtonGenAssimBithorax1956.pdf<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/oWaddingtonGenetAssimAcquiredChar53.pdf}} oWaddingtonGenetAssimAcquiredChar53.pdf<br>

'''Here are some papers ABOUT Waddington and his work:'''<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Waddington%27sLegacyHall92.pdf}} Waddington'sLegacyHall92.pdf<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/WaddingtonOverviewGilbert00.pdf}} WaddingtonOverviewGilbert00.pdf<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/WaddingtonBioOverview02.pdf}} WaddingtonBioOverview02.pdf<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/WaddingtonCanalizationRevisited02.pdf}} WaddingtonCanalizationRevisited02.pdf<br>
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/GeneticAssimBaldwinEffectPlasticity07.pdf}} GeneticAssimBaldwinEffectPlasticity07.pdf<br>
|-
|April 15|| Johana, Manette ||
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/GouldLewontinSpandrels79.pdf}} '''Gould and Lewontin (1979), Spandrels''' <br>
[Whatever you think of this paper, there is no denying that it was absolutely seminal; probably the most critiqued paper ever published in terms of its cross-disciplinary appeal - including an entire edited volume analyzing it from a rhetorical perspective! It marks the beginning of the modern era of constraint theory. Also, if you've wondered about the reference to the "Panglossian paradigm", this refers to Dr. Pangloss, a character in Voltaire's book 'Candide' (1759) who, even in the worst of circumstances, continues to explain why everything is just as it must be and that this is the most perfect of all worlds. It's a hilarious story and biting social commentary, with a great deal of relevance to biology and especially academics, generally (e.g., "what a great genius this Pococurante must be! Nothing can please him" and "but still, there must certainly be a pleasure in criticising everything, and in seeing faults where others think they see beauties." And for grad students regarding their advisors: "...but when I realized that he had doubts about everything, I figured I knew as much as himself, and had no need of a guide to learn ignorance." Finally, who can beat, "I have grown old in misery and disgrace, living with only one buttock..."?). But I digress... (KS)]<br/>
||
'''One of many critiques published about G & L – an indication of its seminal influence'''
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/QuellerSpanielsSt.Marx95.pdf}} Queller, D. C. 1995.
'''Example of extreme adaptationist thinking'''
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/CainPerfectionOfAnimals89.pdf}} Cain (1964)<br/>
|-
|April 22|| Jimmy, Jessie ||
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/VanValen_RedQueen_EvolTheory1973.pdf}} '''Van Valen, Red Queen'''
||'''Comment on the Red Queen'''
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/MaynardSmith_VanValenComment_AmNat1976.pdf}} Maynard Smith (1980)
Think we make you read too much? You could take Van Valen's class instead!
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/VanValen_EvolutionaryProcessesSyllabus.pdf}} Van Valen's Reading List

|-
|April 29||Nora, Tim ||'''Classic Demonstrations of Natural Selection in the Wild'''
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Endler_GuppyNaturalSelection_Evolution1980.pdf}} Endler (1980) on Guppies
:{{pdf|http://hydrodictyon.eeb.uconn.edu/courses/EEB6485/restricted/Grant_DarwinFinchMicroevolution_Science2002.pdf}} Grant and Grant (2002) on Darwin's Finches

||
|-
|}

[[Category:EEB Seminars]]

Phylogenetics: Xanadu Cluster

2022-01-13T16:37:48Z

Paul Lewis: /* Starting a PAUP* analysis */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to show you how to log into the [https://bioinformatics.uconn.edu/resources-and-events/tutorials-2/xanadu/ Xanadu] computer cluster and perform a basic PAUP* analysis.
|}

= Using the UConn Xanadu cluster =
The UConn [http://bioinformatics.uconn.edu Computational Biology Core] is part of the [http://cgi.uconn.edu/ Center for Genome Innovation (CGI)]. We will use the Xanadu computing cluster located at the UConn Health Center for most of the data crunching we will do in this course. By now, you should have an account on the cluster, and today you will learn how to start analyses remotely (i.e. from your laptop), check on their status, and download the results when your analysis is finished.

== Obtaining the necessary communications software ==
You will be using a couple of simple (and free) programs to communicate with the head node of the cluster.

=== If you use Windows...please scroll down to the Windows section===
=== If you use MacOS 10.x... ===
==== SSH ====
The program '''ssh''' will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). You will use ssh to send commands to the cluster and see the output generated.

Start by opening the Terminal application, which you can find in the Applications/Utilities folder on your hard drive. Using the Terminal program, you can connect to the cluster with the following command:
ssh username@xanadu-submit-ext.cam.uchc.edu
where username should be replaced by your username on the cluster.

You may wish to install [http://www.iterm2.com/ iTerm2], which is a terminal program that makes some things easier than Terminal, but the built-in Terminal will work just fine.

'''Creating a shortcut'''

If you want to avoid having to type the long command above every time you want to connect to the cluster, it is possible to create a shortcut. You will need to edit the ''config'' file in your ''.ssh'' directory. Directories with names beginning with a period are not usually shown, but you can open this directory in Finder by typing
cd
open .ssh
in your terminal (the initial cd command changes the directory to your default, a.k.a. home, directory).

If you do not see a file named "config" in your ".ssh" director, create an empty config file using the command
touch config
Open the config file in a text editor such as [https://www.barebones.com BBEdit] (NOT a word processor such as Microsoft Word or Pages!) and add the following lines (replace ''plewis'' with your actual username on xanadu, and replace ''/Users/plewis'' with your home directory path on your Mac, which you can get by typing <tt>pwd</tt> at the command line):
host xanadu
HostName xanadu-submit-ext.cam.uchc.edu
User plewis
IdentityFile /Users/plewis/.ssh/id_rsa
Cipher blowfish
host xfer
HostName transfer.cam.uchc.edu
User plewis
IdentityFile /Users/plewis/.ssh/id_rsa
Cipher blowfish
Once you save the config file, you should be able to just type
ssh xanadu
to login to the xanadu cluster. (The second entry (xfer) will be used when transferring files to and from the cluster.)

==== SCP/SFTP ====
An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). I will show you how to transfer files using both methods, but for now you should go ahead and install [http://cyberduck.io Cyberduck]. Cyberduck provides a nice graphical user interface, but you might find that the command line clients let you get your work done faster once you get used to using them.

=== If you use Windows... ===
==== SSH ====
The program '''Git for Windows''' provides a terminal that will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). We will not actually use the "git" part of "Git for Windows", although it is there in case you need it later. Instead, you will use the Git for Windows bash shell to send commands to the cluster and see the output generated.

Visit the [https://gitforwindows.org Git for Windows] web site, press the Download button, and install Git for Windows on your system. Once installed, open Git BASH from the All Programs section of the start menu. This will open a terminal running the bash shell (a shell is a program that interprets operating system control commands) on your desktop.

Connect to the cluster with the following command:
ssh username@xanadu-submit-ext.cam.uchc.edu
where ''username'' should be replaced by your username on the cluster.

'''Creating a shortcut'''

If you want to avoid having to type the long command above every time you want to connect to the cluster, it is possible to create a shortcut. You will need to edit the ''config'' file in your ''.ssh'' directory. Open (or create, if it does not yet exist) the file named ''config'' in a text editor such as [https://notepad-plus-plus.org NotePad++] (NOT a word processor such as Microsoft Word!) and add the following lines (replace ''plewis'' with your actual username on xanadu, and replace ''/c/Users/Paul Lewis'' with your home directory path):
host xanadu
HostName xanadu-submit-ext.cam.uchc.edu
User plewis
IdentityFile "/c/Users/Paul Lewis/.ssh/id_rsa"
Cipher blowfish
host xfer
HostName transfer.cam.uchc.edu
User plewis
IdentityFile "/c/Users/Paul Lewis/.ssh/id_rsa"
Cipher blowfish
Use the <tt>pwd</tt> command to find out what your home directory path is, and use double quotes if your home directory path contains embedded spaces (note that I had to use quotes for mine).

Once you save the config file, you should be able to just type
ssh xanadu
to login to the xanadu cluster. (The second entry (xfer) will be used for transferring files to and from Xanadu.)

==== SCP/SFTP ====
An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). For now you should go ahead and install [http://cyberduck.io Cyberduck]. Cyberduck provides a nice graphical user interface, which makes moving files back and forth easy.

=== Learning enough UNIX to get around ===
I'm presuming that you do not know a lot of UNIX commands, but even if you are already a UNIX guru, please complete this section anyway because otherwise you will fail to create some files you will need later.

==== ls command: finding out what is in the present working directory ====
The '''ls''' command lists the files in the ''present working directory''. Try typing just
ls
If you need more details about files than you see here, type
ls -la
instead. This version provides information about file permissions, ownership, size, and last modification date.

==== pwd command: finding out what directory you are in ====
Typing
pwd
shows you the full path of the present working directory. The path shown should end with your username, indicating that you are currently in your home directory.

==== mkdir command: creating a new directory ====
Typing the following command will create a new directory named <tt>pauprun</tt> in your home directory:
mkdir pauprun
Use the <tt>ls</tt> command now to make sure a directory of that name was indeed created.

==== cd command: leaving the nest and returning home again ====
The cd command lets you change the present working directory. To move into the newly-created <tt>pauprun</tt> directory, type
cd pauprun
You can always go back to your home directory (no matter how lost you get!) by typing just cd by itself
cd
If you want to go down one directory level (say from pauprun back down to your home directory), you can specify the parent directory using two dots:
cd ..

==== Creating run.nex using the nano editor ====

One way to create a new file, or edit one that already exists, is to use the nano editor. You will now use nano to create a run.nex file containing a paup block. You will later execute this file in PAUP* to perform an analysis.

First use the pwd command to see where you are, then use cd to go into the pauprun directory if you are not already there. Type
nano run.nex
This will open the nano editor, and it should say <nowiki>[ New File ]</nowiki> at the bottom of the window to indicate that the run.nex file does not already exist. Note the menu of the commands along the bottom two rows. Each of these commands is invoked using the Ctrl key with the letter indicated. Thus, ^X Exit indicates that you can use the Ctrl key in combination with the letter X to exit nano.

For now, type the following into the editor:
#nexus

begin paup;
log file=algae.output.txt start replace flush;
execute algae.nex;
set criterion=likelihood autoclose;
lset nst=2 basefreq=estimate tratio=estimate rates=gamma shape=estimate;
hsearch swap=none start=stepwise addseq=random nrep=1;
lscores 1;
lset basefreq=previous tratio=previous shape=previous;
hsearch swap=tbr start=1;
savetrees file=algae.ml.tre brlens;
log stop;
quit;
end;
Once you have entered everything, use ^X to exit. Nano will ask if you want to save the modified '''buffer''' (a ''buffer'' is a predefined amount of computer memory used to store the text you type; the text stored in the buffer will be lost once you exit the program unless you save it to a file on the hard drive), at which point you should press the Y key to answer yes. Nano will now ask you whether you want to use the file name run.nex; this time just press Enter to accept. Nano should now exit and you can use cat to look at the contents of the file you just created:
cat run.nex

==== Create the gopaup file ====
Now use nano to create a second file named <tt>gopaup</tt> in your <tt>pauprun</tt> directory. To do this, type <tt>pwd</tt> to make sure you are in the <tt>pauprun</tt> directory, then type <tt>nano gopaup</tt>. This file should contain this text:
#!/bin/bash
#SBATCH --partition=mcbstudent
#SBATCH --qos=mcbstudent
#SBATCH --job-name=pauprun
cd $HOME/pauprun
module load paup/4.0a-166
paup -n run.nex

=== Using Cyberduck to upload the algae.nex data file ===
[[File:Cyberduck-bookmark-xanadu.png|right]]
Download the file <tt>algae.nex</tt> from [http://hydrodictyon.eeb.uconn.edu/people/plewis/downloads/algae.nex here] and save it on your hard drive.

Open Cyberduck, choose Bookmark > New Bookmark from the main menu, then fill out the resulting dialog box as shown on the right (except substitute your own user name, of course). Be sure to change the protocol to SFTP (not the default FTP). Click the button to close the dialog box and you should see your bookmark appear at the bottom of the main window. Double click the bookmark to open a connection. You will then be warned that the host key is unknown - choose Allow (and go ahead and check the Always button so you do not need to do this every time.

Once you are in, you will see a listing of the files in your home directory on the cluster. To copy the <tt>algae.nex</tt> file to the cluster, you need only drag-and-drop it onto the Cyberduck window.

=== Using scp to upload the algae.nex data file ===

While you will probably want to do your file transfers with Cyberduck as described above, it is also possible to transfer files using the command line scp client. Read on if you are interested in this option, but feel free to skip this section if you are happy using Cyberduck.

The following should be carried out in a terminal that *is not* already logged into the cluster.

In your terminal, navigate to where you saved the file (on your local Mac or Windows computer). If you saved it on the desktop, you can go there by typing <tt>cd Desktop</tt>.

If you've made a shortcut in your ''.ssh/config'' file, you can use the following command to upload the ''algae.nex'' file to the cluster:
scp algae.nex xfer:

If you have not made a shortcut, use this command instead:
scp algae.nex username@transfer.cam.uchc.edu:
where <tt>username</tt> should be replaced by your own user name on the cluster. (Don't overlook the colon on the very end of the line!)

=== Using curl to download the algae.nex data file directly ===

One of my favorite methods to transfer files that are stored on a web site involves the program curl (which stands for copy url). The following command should be carried out in a terminal that *is* logged into the cluster.

In your terminal, navigate to the directory where you want to save the file (on the cluster). Use this command to download the algae.nex file directly into the present working directory on the cluster:
curl -O http://hydrodictyon.eeb.uconn.edu/people/plewis/downloads/algae.nex
The <tt>-O</tt> tells curl to save it under the same file name (algae.nex) that it has on the remote server. If you forget the <tt>-O</tt>, curl will just spit out the entire contents of the file to your terminal, which is almost never what you want!

=== A few more UNIX commands ===

You have now transferred a data file (algae.nex) to the cluster, but it is not in the right place. The algae.nex file is currently in your home directory, whereas the run.nex file is in the pauprun directory. The run.nex file contains a line containing the command <tt>execute algae.nex</tt>, which means that algae.nex should also be located in the pauprun directory. Use the following commands to ensure that (1) you are in your home directory, and (2) algae.nex is also in your home directory:
cd $HOME
ls algae.*
Note the use of a wildcard character (*) in the ls command. This will show you only files that begin with the letters <tt>algae</tt> followed by a period and any number of other non-whitespace characters. The <tt>$HOME</tt> is a predefined shell variable that will be replaced with your home directory. It is not necessary in this case - typing <tt>cd</tt> all by itself would take you to your home directory - but the <tt>$HOME</tt> variable is good to know about (especially for use in scripts).

==== mv command: moving or renaming a file ====
Now use the mv command to move algae.nex to the directory pauprun:
mv algae.nex pauprun
The mv command takes two arguments. The first argument is the name of the directory or file you want to move, whereas the second argument is the destination. The destination could be either a directory (which is true in this case) or a file name. If the directory pauprun did not already exist, mv would have interpreted this as a request to rename algae.nex to the file name pauprun! So, be aware that mv can rename files as well as move them.

==== cp command: copying a file ====
The cp command copies files. It leaves the original file in place and makes a copy elsewhere. You could have used this command to get a copy of algae.nex into the directory pauprun:
cp algae.nex pauprun
This would have left the original in your home directory, and made a duplicate of this file in the directory pauprun.

==== rm command: cleaning up ====
The rm command removes files. If you had used the cp command to copy algae.nex into the pauprun directory, you could remove the original file using these commands:
cd
rm algae.nex
The first cd command just ensures that the copy you are removing will be the one in your home directory (typing <tt>cd</tt> by itself acts the same as typing <tt>cd $HOME</tt>).
To delete an entire directory (don't try this now!), you can add the -rf flags. The r flag tells rm to recursively apply the remove command to everything in every subdirectory, while the f flag means force (don't ask whether each file should be deleted in each subdirectory, just do it!):
rm -rf pauprun
The above command would remove everything in the pauprun directory (without asking!), and then remove the pauprun directory itself. I want to stress that this is a particularly dangerous command, so make sure you are not distracted or sleep-deprived when you use it! Unlike the Windows or Mac graphical user interface, files deleted using rm are '''not''' moved to the Recycle Bin or Trash, they are just gone. There is '''no undo''' for the rm command.

=== Starting a PAUP* analysis ===

If you've been following the directions in sequence, you now have two files (algae.nex and run.nex) in your <tt>$HOME/pauprun</tt> directory on the cluster, whereas the gopaup file should be in <tt>$HOME</tt>. Use the cd command to make sure you are in your home directory, then the cat command to look at the contents of the gopaup file you created earlier. You should see this:
#!/bin/bash
#SBATCH --partition=mcbstudent
#SBATCH --qos=mcbstudent
#SBATCH --job-name=pauprun
cd $HOME/pauprun
module load paup/4.0a-166
paup -n run.nex

This file will be used by software called SLURM to start your run. SLURM provides a command called <tt>sbatch</tt> that you will use to submit your analysis. The SLURM <tt>sbatch</tt> command will look for a core (i.e. processor) on a node (i.e. machine) in the cluster that is currently not being used and will start your analysis on that node. This saves you the effort of looking amongst all nodes in the cluster for a core that is not busy.

Here is an explanation of each of the lines in gopaup:
* The 1st line specifies the command interpreter to use (just include this in your scripts verbatim).
* The 2nd, 3rd, and 4th lines begin with #SBATCH and are interpreted as commands by SLURM itself. In this case, the first and second #SBATCH commands tell SLURM to use the general partition (--partition=general) and the general quality of service (--qos=general). You should always include these two lines verbatim. The last #SBATCH line gives a name to your job (--job-name=pauprun). You could change "pauprun" here to something else, but keep your job names short and without embedded spaces or punctuation. The job name will help you identify your run when checking status.
* The 5th line is simply a cd command that changes the present working directory to the pauprun directory you created earlier. This will ensure that anything saved by PAUP* ends up in this directory rather than in your home directory. Note that $HOME is like a macro that will be expanded to the full path to your home directory.
* The 6th line informs the system that you want to use a particular version of paup. If you left this line out, the command on the last line might not work at all, or might run an older version of paup. You can get a list of all available modules using the command "module avail"
* The 7th and last line starts up PAUP* and executes the run.nex file. The -n flag tells PAUP* that no human is going to be listening or answering questions, so it should just use default answers to any questions it needs to ask during the run.

==== Submitting a job using sbatch ====
Now you are ready to start the analysis. Type these commands to start your run:
cd
cd pauprun
sbatch gopaup

==== Checking status using squeue ====
You can see if your run is still going using the squeue command:
squeue
If it is running, you will see an entry named pauprun. Here is what it looked like for me:
hpc-ext-2 pauprun $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
645170 general pauprun plewis PD 0:00 1 (Priority)
The PD under ST (state) means that my job is pending (not yet running). This job goes so fast that you will be lucky to find it in the running state. If you see no jobs listed when you run squeue, it means your job has finished.

==== Killing a job using scancel ====

Sometimes it is clear that an analysis is not going to do what you wanted it to. Suppose that just after you press the Enter key to start an analysis, you realize that you forgot to put in a savetrees command in your paup block (so in the end you will not be able to see the results of the search). In such situations, you really want to just kill the job, fix the problem, and then start it up again. Use the scancel command for this. Note that in the output of the squeue command above, my run had a job-ID equal to 645170. I could kill the job like this:
scancel 645170
Be sure to delete any output files that have already been created before starting your run over again.

==== While PAUP* is running ====

While PAUP* is running, you can use cat to look at the log file:
cd pauprun
cat algae.output.txt

==== Using Cyberduck to download the log file and the tree file ====

When PAUP* finishes, squeue will no longer list your process. At this point, you need to use Cyberduck to get the log and tree files that were saved back to your local computer. Assuming you left Cyberduck open and connected to the cluster, double-click on the pauprun directory and locate the files <tt>algae.ml.tre</tt> and <tt>algae.output.txt</tt>. Select these two files and drag them out of the Cyberduck window and drop them on your desktop. After a flurry of activity, Cyberduck should report that the two files were downloaded successfully, at which point you can close the download status window.

==== Using scp to download the log file and the tree file ====

You can also use scp to get the log and tree files that were saved back to your local computer, but, again, if you are happy with Cyberduck you can skip this section. In the Terminal app on your Mac (or the Git for Windows BASH session on your Windows PC), type the following (being careful to separate the final dot character from everything else by a blank space):
scp xfer:pauprun/algae.output.txt .
scp xfer:pauprun/algae.ml.tre .
This assumes you have set up a shortcut: if not, you will need to use the longer version below (being sure to replace <tt>username</tt> with your own user name on the cluster):
scp username@transfer.cam.uchc.edu:pauprun/algae.output.txt .
scp username@transfer.cam.uchc.edu:pauprun/algae.ml.tre .

These scp commands copy the files <tt>algae.output.txt</tt> and <tt>algae.ml.tre</tt> to your current directory (this is what the single dot at the end of each line stands for).

=== Using FigTree to view tree files ===
[[File:AlgaeMLtree.png|right|400px|thumb|<tt>algae.ml.tre</tt> file viewed with FigTree]]
If you do not already have it, download and install the [http://tree.bio.ed.ac.uk/software/figtree/ FigTree] application on your laptop. FigTree is a Java application, so you will also need to install a Java Runtime Environment (JRE) if you don't already have one (just start FigTree and it will tell you if it cannot find a JRE). Once FigTree is running, choose File > Open from the menu to open the <tt>algae.ml.tre</tt> file.

==== Adjusting taxon label font ====

The first thing you will probably want to do is make the taxon labels larger or change the font. Expand the Tip Labels section on the left and play with the Font Size up/down control. You can also set font details in the preferences, which will save you a lot of time in the future

==== Line thickness ====

You can modify the thickness of the lines used by FigTree to draw the edges of the tree by expanding the Appearance tab.

==== Ladderization ====

You can ladderize the tree (make it appear to flow one way or the other) by playing with the Order Nodes option in the Trees tab.

==== Export tree as PDF ====

There are many other options that you can discover in FigTree, but one more thing to try today is to save the tree as a PDF file. Once you have the tree looking just the way you want, choose File > Export PDF...

==== Why have PAUP* create the log file algae.output.txt? ====

In your pauprun directory, SLURM saved the output that PAUP* normally sends to the console to a file named slurm-645170.out (your file will have a different number, however). You will not need this file after the run: the log command in your paup block ends up saving the same output in the file algae.output.txt. Why did we tell PAUP* to start a log file if SGE was going to save the output anyway? The main reason is that you can view the log file during the run, but you cannot view slurm-645170.out until the run is finished. There will come a day when you have a PAUP* run that has been going for several days and want to know whether it is 10% or 90% finished. At this point you will appreciate being able to view the output file!

==== Delete the slurm-xxxx.out file using the rm command ====

Because you do not need the slurm-xxxxx.out file (where the xxxx are placeholders for the job number), delete it using the rm command (the -f stands for force; i.e. don't ask if it is ok, just do it!):
cd
cd pauprun
rm -f slurm-*.out
You also no longer need the log and tree files because you downloaded them to your local computer using PSFTP:
rm -f algae.ml.tre
rm -f algae.output.txt
It is a good idea to delete files you no longer need for two reasons:
* you will later wonder whether you downloaded those files to your local machine and will have to spend time making sure you actually have saved the results locally
* our cluster only has so much disk space, and thus it is just not possible for everyone to keep every file they ever created

=== Tips and tricks ===

Here are some miscellaneous tips and tricks to make your life easier when communicating with the cluster.

==== Command completion using the tab key ====

You can often get away with only typing the first few letters of a filename; try pressing the Tab key after the first few letters and the command interpreter will try to complete the thought. For example, cd into the pauprun directory, then type
cat alg<TAB>
If algae.nex is the only file in the directory in which the first three letters are alg, then the command interpreter will type in the rest of the file name for you.

==== Wildcards ====

I've already mentioned this tip, but it bears repeating. When using most UNIX commands that accept filenames (e.g. ls, rm, mv, cp), you can place an asterisk inside the filename to stand in for any number of letters. So
ls algae*
will produce output like this
algae.ml.tre algae.nex algae.output.txt

==== Man pages ====

If you want to learn more options for any of the UNIX commands, you can use the man command to see the manual for that command. For example, here's how to see the manual describing the ls command:
man ls
It is important to know how to escape from a man page! The way to get out is to type the letter q. You can page down using Ctrl-f, page up through a man page using Ctrl-b, go to the end using Shift-g and return to the very beginning using 1,Shift-g (that is, type a 1, release it, then type Shift-g). You can also move line by line in a man page using the down and up arrows, and page by page using the PgUp and PgDn keys.

==== Xanadu information ====

You can find a lot of great information about the Xanadu cluster at the [https://bioinformatics.uconn.edu/ Computational Biology Core] web site. In particular, take time to look through the first tutorial named "Understanding the UConn Health Cluster (Xanadu)".

[[Category: Phylogenetics]]
__NOTOC__

Phylogenetics: Xanadu Cluster

2022-01-13T16:37:03Z

Paul Lewis: /* Create the gopaup file */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to show you how to log into the [https://bioinformatics.uconn.edu/resources-and-events/tutorials-2/xanadu/ Xanadu] computer cluster and perform a basic PAUP* analysis.
|}

= Using the UConn Xanadu cluster =
The UConn [http://bioinformatics.uconn.edu Computational Biology Core] is part of the [http://cgi.uconn.edu/ Center for Genome Innovation (CGI)]. We will use the Xanadu computing cluster located at the UConn Health Center for most of the data crunching we will do in this course. By now, you should have an account on the cluster, and today you will learn how to start analyses remotely (i.e. from your laptop), check on their status, and download the results when your analysis is finished.

== Obtaining the necessary communications software ==
You will be using a couple of simple (and free) programs to communicate with the head node of the cluster.

=== If you use Windows...please scroll down to the Windows section===
=== If you use MacOS 10.x... ===
==== SSH ====
The program '''ssh''' will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). You will use ssh to send commands to the cluster and see the output generated.

Start by opening the Terminal application, which you can find in the Applications/Utilities folder on your hard drive. Using the Terminal program, you can connect to the cluster with the following command:
ssh username@xanadu-submit-ext.cam.uchc.edu
where username should be replaced by your username on the cluster.

You may wish to install [http://www.iterm2.com/ iTerm2], which is a terminal program that makes some things easier than Terminal, but the built-in Terminal will work just fine.

'''Creating a shortcut'''

If you want to avoid having to type the long command above every time you want to connect to the cluster, it is possible to create a shortcut. You will need to edit the ''config'' file in your ''.ssh'' directory. Directories with names beginning with a period are not usually shown, but you can open this directory in Finder by typing
cd
open .ssh
in your terminal (the initial cd command changes the directory to your default, a.k.a. home, directory).

If you do not see a file named "config" in your ".ssh" director, create an empty config file using the command
touch config
Open the config file in a text editor such as [https://www.barebones.com BBEdit] (NOT a word processor such as Microsoft Word or Pages!) and add the following lines (replace ''plewis'' with your actual username on xanadu, and replace ''/Users/plewis'' with your home directory path on your Mac, which you can get by typing <tt>pwd</tt> at the command line):
host xanadu
HostName xanadu-submit-ext.cam.uchc.edu
User plewis
IdentityFile /Users/plewis/.ssh/id_rsa
Cipher blowfish
host xfer
HostName transfer.cam.uchc.edu
User plewis
IdentityFile /Users/plewis/.ssh/id_rsa
Cipher blowfish
Once you save the config file, you should be able to just type
ssh xanadu
to login to the xanadu cluster. (The second entry (xfer) will be used when transferring files to and from the cluster.)

==== SCP/SFTP ====
An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). I will show you how to transfer files using both methods, but for now you should go ahead and install [http://cyberduck.io Cyberduck]. Cyberduck provides a nice graphical user interface, but you might find that the command line clients let you get your work done faster once you get used to using them.

=== If you use Windows... ===
==== SSH ====
The program '''Git for Windows''' provides a terminal that will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). We will not actually use the "git" part of "Git for Windows", although it is there in case you need it later. Instead, you will use the Git for Windows bash shell to send commands to the cluster and see the output generated.

Visit the [https://gitforwindows.org Git for Windows] web site, press the Download button, and install Git for Windows on your system. Once installed, open Git BASH from the All Programs section of the start menu. This will open a terminal running the bash shell (a shell is a program that interprets operating system control commands) on your desktop.

Connect to the cluster with the following command:
ssh username@xanadu-submit-ext.cam.uchc.edu
where ''username'' should be replaced by your username on the cluster.

'''Creating a shortcut'''

If you want to avoid having to type the long command above every time you want to connect to the cluster, it is possible to create a shortcut. You will need to edit the ''config'' file in your ''.ssh'' directory. Open (or create, if it does not yet exist) the file named ''config'' in a text editor such as [https://notepad-plus-plus.org NotePad++] (NOT a word processor such as Microsoft Word!) and add the following lines (replace ''plewis'' with your actual username on xanadu, and replace ''/c/Users/Paul Lewis'' with your home directory path):
host xanadu
HostName xanadu-submit-ext.cam.uchc.edu
User plewis
IdentityFile "/c/Users/Paul Lewis/.ssh/id_rsa"
Cipher blowfish
host xfer
HostName transfer.cam.uchc.edu
User plewis
IdentityFile "/c/Users/Paul Lewis/.ssh/id_rsa"
Cipher blowfish
Use the <tt>pwd</tt> command to find out what your home directory path is, and use double quotes if your home directory path contains embedded spaces (note that I had to use quotes for mine).

Once you save the config file, you should be able to just type
ssh xanadu
to login to the xanadu cluster. (The second entry (xfer) will be used for transferring files to and from Xanadu.)

==== SCP/SFTP ====
An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). For now you should go ahead and install [http://cyberduck.io Cyberduck]. Cyberduck provides a nice graphical user interface, which makes moving files back and forth easy.

=== Learning enough UNIX to get around ===
I'm presuming that you do not know a lot of UNIX commands, but even if you are already a UNIX guru, please complete this section anyway because otherwise you will fail to create some files you will need later.

==== ls command: finding out what is in the present working directory ====
The '''ls''' command lists the files in the ''present working directory''. Try typing just
ls
If you need more details about files than you see here, type
ls -la
instead. This version provides information about file permissions, ownership, size, and last modification date.

==== pwd command: finding out what directory you are in ====
Typing
pwd
shows you the full path of the present working directory. The path shown should end with your username, indicating that you are currently in your home directory.

==== mkdir command: creating a new directory ====
Typing the following command will create a new directory named <tt>pauprun</tt> in your home directory:
mkdir pauprun
Use the <tt>ls</tt> command now to make sure a directory of that name was indeed created.

==== cd command: leaving the nest and returning home again ====
The cd command lets you change the present working directory. To move into the newly-created <tt>pauprun</tt> directory, type
cd pauprun
You can always go back to your home directory (no matter how lost you get!) by typing just cd by itself
cd
If you want to go down one directory level (say from pauprun back down to your home directory), you can specify the parent directory using two dots:
cd ..

==== Creating run.nex using the nano editor ====

One way to create a new file, or edit one that already exists, is to use the nano editor. You will now use nano to create a run.nex file containing a paup block. You will later execute this file in PAUP* to perform an analysis.

First use the pwd command to see where you are, then use cd to go into the pauprun directory if you are not already there. Type
nano run.nex
This will open the nano editor, and it should say <nowiki>[ New File ]</nowiki> at the bottom of the window to indicate that the run.nex file does not already exist. Note the menu of the commands along the bottom two rows. Each of these commands is invoked using the Ctrl key with the letter indicated. Thus, ^X Exit indicates that you can use the Ctrl key in combination with the letter X to exit nano.

For now, type the following into the editor:
#nexus

begin paup;
log file=algae.output.txt start replace flush;
execute algae.nex;
set criterion=likelihood autoclose;
lset nst=2 basefreq=estimate tratio=estimate rates=gamma shape=estimate;
hsearch swap=none start=stepwise addseq=random nrep=1;
lscores 1;
lset basefreq=previous tratio=previous shape=previous;
hsearch swap=tbr start=1;
savetrees file=algae.ml.tre brlens;
log stop;
quit;
end;
Once you have entered everything, use ^X to exit. Nano will ask if you want to save the modified '''buffer''' (a ''buffer'' is a predefined amount of computer memory used to store the text you type; the text stored in the buffer will be lost once you exit the program unless you save it to a file on the hard drive), at which point you should press the Y key to answer yes. Nano will now ask you whether you want to use the file name run.nex; this time just press Enter to accept. Nano should now exit and you can use cat to look at the contents of the file you just created:
cat run.nex

==== Create the gopaup file ====
Now use nano to create a second file named <tt>gopaup</tt> in your <tt>pauprun</tt> directory. To do this, type <tt>pwd</tt> to make sure you are in the <tt>pauprun</tt> directory, then type <tt>nano gopaup</tt>. This file should contain this text:
#!/bin/bash
#SBATCH --partition=mcbstudent
#SBATCH --qos=mcbstudent
#SBATCH --job-name=pauprun
cd $HOME/pauprun
module load paup/4.0a-166
paup -n run.nex

=== Using Cyberduck to upload the algae.nex data file ===
[[File:Cyberduck-bookmark-xanadu.png|right]]
Download the file <tt>algae.nex</tt> from [http://hydrodictyon.eeb.uconn.edu/people/plewis/downloads/algae.nex here] and save it on your hard drive.

Open Cyberduck, choose Bookmark > New Bookmark from the main menu, then fill out the resulting dialog box as shown on the right (except substitute your own user name, of course). Be sure to change the protocol to SFTP (not the default FTP). Click the button to close the dialog box and you should see your bookmark appear at the bottom of the main window. Double click the bookmark to open a connection. You will then be warned that the host key is unknown - choose Allow (and go ahead and check the Always button so you do not need to do this every time.

Once you are in, you will see a listing of the files in your home directory on the cluster. To copy the <tt>algae.nex</tt> file to the cluster, you need only drag-and-drop it onto the Cyberduck window.

=== Using scp to upload the algae.nex data file ===

While you will probably want to do your file transfers with Cyberduck as described above, it is also possible to transfer files using the command line scp client. Read on if you are interested in this option, but feel free to skip this section if you are happy using Cyberduck.

The following should be carried out in a terminal that *is not* already logged into the cluster.

In your terminal, navigate to where you saved the file (on your local Mac or Windows computer). If you saved it on the desktop, you can go there by typing <tt>cd Desktop</tt>.

If you've made a shortcut in your ''.ssh/config'' file, you can use the following command to upload the ''algae.nex'' file to the cluster:
scp algae.nex xfer:

If you have not made a shortcut, use this command instead:
scp algae.nex username@transfer.cam.uchc.edu:
where <tt>username</tt> should be replaced by your own user name on the cluster. (Don't overlook the colon on the very end of the line!)

=== Using curl to download the algae.nex data file directly ===

One of my favorite methods to transfer files that are stored on a web site involves the program curl (which stands for copy url). The following command should be carried out in a terminal that *is* logged into the cluster.

In your terminal, navigate to the directory where you want to save the file (on the cluster). Use this command to download the algae.nex file directly into the present working directory on the cluster:
curl -O http://hydrodictyon.eeb.uconn.edu/people/plewis/downloads/algae.nex
The <tt>-O</tt> tells curl to save it under the same file name (algae.nex) that it has on the remote server. If you forget the <tt>-O</tt>, curl will just spit out the entire contents of the file to your terminal, which is almost never what you want!

=== A few more UNIX commands ===

You have now transferred a data file (algae.nex) to the cluster, but it is not in the right place. The algae.nex file is currently in your home directory, whereas the run.nex file is in the pauprun directory. The run.nex file contains a line containing the command <tt>execute algae.nex</tt>, which means that algae.nex should also be located in the pauprun directory. Use the following commands to ensure that (1) you are in your home directory, and (2) algae.nex is also in your home directory:
cd $HOME
ls algae.*
Note the use of a wildcard character (*) in the ls command. This will show you only files that begin with the letters <tt>algae</tt> followed by a period and any number of other non-whitespace characters. The <tt>$HOME</tt> is a predefined shell variable that will be replaced with your home directory. It is not necessary in this case - typing <tt>cd</tt> all by itself would take you to your home directory - but the <tt>$HOME</tt> variable is good to know about (especially for use in scripts).

==== mv command: moving or renaming a file ====
Now use the mv command to move algae.nex to the directory pauprun:
mv algae.nex pauprun
The mv command takes two arguments. The first argument is the name of the directory or file you want to move, whereas the second argument is the destination. The destination could be either a directory (which is true in this case) or a file name. If the directory pauprun did not already exist, mv would have interpreted this as a request to rename algae.nex to the file name pauprun! So, be aware that mv can rename files as well as move them.

==== cp command: copying a file ====
The cp command copies files. It leaves the original file in place and makes a copy elsewhere. You could have used this command to get a copy of algae.nex into the directory pauprun:
cp algae.nex pauprun
This would have left the original in your home directory, and made a duplicate of this file in the directory pauprun.

==== rm command: cleaning up ====
The rm command removes files. If you had used the cp command to copy algae.nex into the pauprun directory, you could remove the original file using these commands:
cd
rm algae.nex
The first cd command just ensures that the copy you are removing will be the one in your home directory (typing <tt>cd</tt> by itself acts the same as typing <tt>cd $HOME</tt>).
To delete an entire directory (don't try this now!), you can add the -rf flags. The r flag tells rm to recursively apply the remove command to everything in every subdirectory, while the f flag means force (don't ask whether each file should be deleted in each subdirectory, just do it!):
rm -rf pauprun
The above command would remove everything in the pauprun directory (without asking!), and then remove the pauprun directory itself. I want to stress that this is a particularly dangerous command, so make sure you are not distracted or sleep-deprived when you use it! Unlike the Windows or Mac graphical user interface, files deleted using rm are '''not''' moved to the Recycle Bin or Trash, they are just gone. There is '''no undo''' for the rm command.

=== Starting a PAUP* analysis ===

If you've been following the directions in sequence, you now have two files (algae.nex and run.nex) in your <tt>$HOME/pauprun</tt> directory on the cluster, whereas the gopaup file should be in <tt>$HOME</tt>. Use the cd command to make sure you are in your home directory, then the cat command to look at the contents of the gopaup file you created earlier. You should see this:
#!/bin/bash
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --job-name=pauprun
cd $HOME/pauprun
module load paup/4.0a-166
paup -n run.nex

This file will be used by software called SLURM to start your run. SLURM provides a command called <tt>sbatch</tt> that you will use to submit your analysis. The SLURM <tt>sbatch</tt> command will look for a core (i.e. processor) on a node (i.e. machine) in the cluster that is currently not being used and will start your analysis on that node. This saves you the effort of looking amongst all nodes in the cluster for a core that is not busy.

Here is an explanation of each of the lines in gopaup:
* The 1st line specifies the command interpreter to use (just include this in your scripts verbatim).
* The 2nd, 3rd, and 4th lines begin with #SBATCH and are interpreted as commands by SLURM itself. In this case, the first and second #SBATCH commands tell SLURM to use the general partition (--partition=general) and the general quality of service (--qos=general). You should always include these two lines verbatim. The last #SBATCH line gives a name to your job (--job-name=pauprun). You could change "pauprun" here to something else, but keep your job names short and without embedded spaces or punctuation. The job name will help you identify your run when checking status.
* The 5th line is simply a cd command that changes the present working directory to the pauprun directory you created earlier. This will ensure that anything saved by PAUP* ends up in this directory rather than in your home directory. Note that $HOME is like a macro that will be expanded to the full path to your home directory.
* The 6th line informs the system that you want to use a particular version of paup. If you left this line out, the command on the last line might not work at all, or might run an older version of paup. You can get a list of all available modules using the command "module avail"
* The 7th and last line starts up PAUP* and executes the run.nex file. The -n flag tells PAUP* that no human is going to be listening or answering questions, so it should just use default answers to any questions it needs to ask during the run.

==== Submitting a job using sbatch ====
Now you are ready to start the analysis. Type these commands to start your run:
cd
cd pauprun
sbatch gopaup

==== Checking status using squeue ====
You can see if your run is still going using the squeue command:
squeue
If it is running, you will see an entry named pauprun. Here is what it looked like for me:
hpc-ext-2 pauprun $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
645170 general pauprun plewis PD 0:00 1 (Priority)
The PD under ST (state) means that my job is pending (not yet running). This job goes so fast that you will be lucky to find it in the running state. If you see no jobs listed when you run squeue, it means your job has finished.

==== Killing a job using scancel ====

Sometimes it is clear that an analysis is not going to do what you wanted it to. Suppose that just after you press the Enter key to start an analysis, you realize that you forgot to put in a savetrees command in your paup block (so in the end you will not be able to see the results of the search). In such situations, you really want to just kill the job, fix the problem, and then start it up again. Use the scancel command for this. Note that in the output of the squeue command above, my run had a job-ID equal to 645170. I could kill the job like this:
scancel 645170
Be sure to delete any output files that have already been created before starting your run over again.

==== While PAUP* is running ====

While PAUP* is running, you can use cat to look at the log file:
cd pauprun
cat algae.output.txt

==== Using Cyberduck to download the log file and the tree file ====

When PAUP* finishes, squeue will no longer list your process. At this point, you need to use Cyberduck to get the log and tree files that were saved back to your local computer. Assuming you left Cyberduck open and connected to the cluster, double-click on the pauprun directory and locate the files <tt>algae.ml.tre</tt> and <tt>algae.output.txt</tt>. Select these two files and drag them out of the Cyberduck window and drop them on your desktop. After a flurry of activity, Cyberduck should report that the two files were downloaded successfully, at which point you can close the download status window.

==== Using scp to download the log file and the tree file ====

You can also use scp to get the log and tree files that were saved back to your local computer, but, again, if you are happy with Cyberduck you can skip this section. In the Terminal app on your Mac (or the Git for Windows BASH session on your Windows PC), type the following (being careful to separate the final dot character from everything else by a blank space):
scp xfer:pauprun/algae.output.txt .
scp xfer:pauprun/algae.ml.tre .
This assumes you have set up a shortcut: if not, you will need to use the longer version below (being sure to replace <tt>username</tt> with your own user name on the cluster):
scp username@transfer.cam.uchc.edu:pauprun/algae.output.txt .
scp username@transfer.cam.uchc.edu:pauprun/algae.ml.tre .

These scp commands copy the files <tt>algae.output.txt</tt> and <tt>algae.ml.tre</tt> to your current directory (this is what the single dot at the end of each line stands for).

=== Using FigTree to view tree files ===
[[File:AlgaeMLtree.png|right|400px|thumb|<tt>algae.ml.tre</tt> file viewed with FigTree]]
If you do not already have it, download and install the [http://tree.bio.ed.ac.uk/software/figtree/ FigTree] application on your laptop. FigTree is a Java application, so you will also need to install a Java Runtime Environment (JRE) if you don't already have one (just start FigTree and it will tell you if it cannot find a JRE). Once FigTree is running, choose File > Open from the menu to open the <tt>algae.ml.tre</tt> file.

==== Adjusting taxon label font ====

The first thing you will probably want to do is make the taxon labels larger or change the font. Expand the Tip Labels section on the left and play with the Font Size up/down control. You can also set font details in the preferences, which will save you a lot of time in the future

==== Line thickness ====

You can modify the thickness of the lines used by FigTree to draw the edges of the tree by expanding the Appearance tab.

==== Ladderization ====

You can ladderize the tree (make it appear to flow one way or the other) by playing with the Order Nodes option in the Trees tab.

==== Export tree as PDF ====

There are many other options that you can discover in FigTree, but one more thing to try today is to save the tree as a PDF file. Once you have the tree looking just the way you want, choose File > Export PDF...

==== Why have PAUP* create the log file algae.output.txt? ====

In your pauprun directory, SLURM saved the output that PAUP* normally sends to the console to a file named slurm-645170.out (your file will have a different number, however). You will not need this file after the run: the log command in your paup block ends up saving the same output in the file algae.output.txt. Why did we tell PAUP* to start a log file if SGE was going to save the output anyway? The main reason is that you can view the log file during the run, but you cannot view slurm-645170.out until the run is finished. There will come a day when you have a PAUP* run that has been going for several days and want to know whether it is 10% or 90% finished. At this point you will appreciate being able to view the output file!

==== Delete the slurm-xxxx.out file using the rm command ====

Because you do not need the slurm-xxxxx.out file (where the xxxx are placeholders for the job number), delete it using the rm command (the -f stands for force; i.e. don't ask if it is ok, just do it!):
cd
cd pauprun
rm -f slurm-*.out
You also no longer need the log and tree files because you downloaded them to your local computer using PSFTP:
rm -f algae.ml.tre
rm -f algae.output.txt
It is a good idea to delete files you no longer need for two reasons:
* you will later wonder whether you downloaded those files to your local machine and will have to spend time making sure you actually have saved the results locally
* our cluster only has so much disk space, and thus it is just not possible for everyone to keep every file they ever created

=== Tips and tricks ===

Here are some miscellaneous tips and tricks to make your life easier when communicating with the cluster.

==== Command completion using the tab key ====

You can often get away with only typing the first few letters of a filename; try pressing the Tab key after the first few letters and the command interpreter will try to complete the thought. For example, cd into the pauprun directory, then type
cat alg<TAB>
If algae.nex is the only file in the directory in which the first three letters are alg, then the command interpreter will type in the rest of the file name for you.

==== Wildcards ====

I've already mentioned this tip, but it bears repeating. When using most UNIX commands that accept filenames (e.g. ls, rm, mv, cp), you can place an asterisk inside the filename to stand in for any number of letters. So
ls algae*
will produce output like this
algae.ml.tre algae.nex algae.output.txt

==== Man pages ====

If you want to learn more options for any of the UNIX commands, you can use the man command to see the manual for that command. For example, here's how to see the manual describing the ls command:
man ls
It is important to know how to escape from a man page! The way to get out is to type the letter q. You can page down using Ctrl-f, page up through a man page using Ctrl-b, go to the end using Shift-g and return to the very beginning using 1,Shift-g (that is, type a 1, release it, then type Shift-g). You can also move line by line in a man page using the down and up arrows, and page by page using the PgUp and PgDn keys.

==== Xanadu information ====

You can find a lot of great information about the Xanadu cluster at the [https://bioinformatics.uconn.edu/ Computational Biology Core] web site. In particular, take time to look through the first tutorial named "Understanding the UConn Health Cluster (Xanadu)".

[[Category: Phylogenetics]]
__NOTOC__

Phylogenetics: Xanadu Cluster

2022-01-13T16:34:18Z

Paul Lewis: /* Using scp to upload the algae.nex data file */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to show you how to log into the [https://bioinformatics.uconn.edu/resources-and-events/tutorials-2/xanadu/ Xanadu] computer cluster and perform a basic PAUP* analysis.
|}

= Using the UConn Xanadu cluster =
The UConn [http://bioinformatics.uconn.edu Computational Biology Core] is part of the [http://cgi.uconn.edu/ Center for Genome Innovation (CGI)]. We will use the Xanadu computing cluster located at the UConn Health Center for most of the data crunching we will do in this course. By now, you should have an account on the cluster, and today you will learn how to start analyses remotely (i.e. from your laptop), check on their status, and download the results when your analysis is finished.

== Obtaining the necessary communications software ==
You will be using a couple of simple (and free) programs to communicate with the head node of the cluster.

=== If you use Windows...please scroll down to the Windows section===
=== If you use MacOS 10.x... ===
==== SSH ====
The program '''ssh''' will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). You will use ssh to send commands to the cluster and see the output generated.

Start by opening the Terminal application, which you can find in the Applications/Utilities folder on your hard drive. Using the Terminal program, you can connect to the cluster with the following command:
ssh username@xanadu-submit-ext.cam.uchc.edu
where username should be replaced by your username on the cluster.

You may wish to install [http://www.iterm2.com/ iTerm2], which is a terminal program that makes some things easier than Terminal, but the built-in Terminal will work just fine.

'''Creating a shortcut'''

If you want to avoid having to type the long command above every time you want to connect to the cluster, it is possible to create a shortcut. You will need to edit the ''config'' file in your ''.ssh'' directory. Directories with names beginning with a period are not usually shown, but you can open this directory in Finder by typing
cd
open .ssh
in your terminal (the initial cd command changes the directory to your default, a.k.a. home, directory).

If you do not see a file named "config" in your ".ssh" director, create an empty config file using the command
touch config
Open the config file in a text editor such as [https://www.barebones.com BBEdit] (NOT a word processor such as Microsoft Word or Pages!) and add the following lines (replace ''plewis'' with your actual username on xanadu, and replace ''/Users/plewis'' with your home directory path on your Mac, which you can get by typing <tt>pwd</tt> at the command line):
host xanadu
HostName xanadu-submit-ext.cam.uchc.edu
User plewis
IdentityFile /Users/plewis/.ssh/id_rsa
Cipher blowfish
host xfer
HostName transfer.cam.uchc.edu
User plewis
IdentityFile /Users/plewis/.ssh/id_rsa
Cipher blowfish
Once you save the config file, you should be able to just type
ssh xanadu
to login to the xanadu cluster. (The second entry (xfer) will be used when transferring files to and from the cluster.)

==== SCP/SFTP ====
An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). I will show you how to transfer files using both methods, but for now you should go ahead and install [http://cyberduck.io Cyberduck]. Cyberduck provides a nice graphical user interface, but you might find that the command line clients let you get your work done faster once you get used to using them.

=== If you use Windows... ===
==== SSH ====
The program '''Git for Windows''' provides a terminal that will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). We will not actually use the "git" part of "Git for Windows", although it is there in case you need it later. Instead, you will use the Git for Windows bash shell to send commands to the cluster and see the output generated.

Visit the [https://gitforwindows.org Git for Windows] web site, press the Download button, and install Git for Windows on your system. Once installed, open Git BASH from the All Programs section of the start menu. This will open a terminal running the bash shell (a shell is a program that interprets operating system control commands) on your desktop.

Connect to the cluster with the following command:
ssh username@xanadu-submit-ext.cam.uchc.edu
where ''username'' should be replaced by your username on the cluster.

'''Creating a shortcut'''

If you want to avoid having to type the long command above every time you want to connect to the cluster, it is possible to create a shortcut. You will need to edit the ''config'' file in your ''.ssh'' directory. Open (or create, if it does not yet exist) the file named ''config'' in a text editor such as [https://notepad-plus-plus.org NotePad++] (NOT a word processor such as Microsoft Word!) and add the following lines (replace ''plewis'' with your actual username on xanadu, and replace ''/c/Users/Paul Lewis'' with your home directory path):
host xanadu
HostName xanadu-submit-ext.cam.uchc.edu
User plewis
IdentityFile "/c/Users/Paul Lewis/.ssh/id_rsa"
Cipher blowfish
host xfer
HostName transfer.cam.uchc.edu
User plewis
IdentityFile "/c/Users/Paul Lewis/.ssh/id_rsa"
Cipher blowfish
Use the <tt>pwd</tt> command to find out what your home directory path is, and use double quotes if your home directory path contains embedded spaces (note that I had to use quotes for mine).

Once you save the config file, you should be able to just type
ssh xanadu
to login to the xanadu cluster. (The second entry (xfer) will be used for transferring files to and from Xanadu.)

==== SCP/SFTP ====
An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). For now you should go ahead and install [http://cyberduck.io Cyberduck]. Cyberduck provides a nice graphical user interface, which makes moving files back and forth easy.

=== Learning enough UNIX to get around ===
I'm presuming that you do not know a lot of UNIX commands, but even if you are already a UNIX guru, please complete this section anyway because otherwise you will fail to create some files you will need later.

==== ls command: finding out what is in the present working directory ====
The '''ls''' command lists the files in the ''present working directory''. Try typing just
ls
If you need more details about files than you see here, type
ls -la
instead. This version provides information about file permissions, ownership, size, and last modification date.

==== pwd command: finding out what directory you are in ====
Typing
pwd
shows you the full path of the present working directory. The path shown should end with your username, indicating that you are currently in your home directory.

==== mkdir command: creating a new directory ====
Typing the following command will create a new directory named <tt>pauprun</tt> in your home directory:
mkdir pauprun
Use the <tt>ls</tt> command now to make sure a directory of that name was indeed created.

==== cd command: leaving the nest and returning home again ====
The cd command lets you change the present working directory. To move into the newly-created <tt>pauprun</tt> directory, type
cd pauprun
You can always go back to your home directory (no matter how lost you get!) by typing just cd by itself
cd
If you want to go down one directory level (say from pauprun back down to your home directory), you can specify the parent directory using two dots:
cd ..

==== Creating run.nex using the nano editor ====

One way to create a new file, or edit one that already exists, is to use the nano editor. You will now use nano to create a run.nex file containing a paup block. You will later execute this file in PAUP* to perform an analysis.

First use the pwd command to see where you are, then use cd to go into the pauprun directory if you are not already there. Type
nano run.nex
This will open the nano editor, and it should say <nowiki>[ New File ]</nowiki> at the bottom of the window to indicate that the run.nex file does not already exist. Note the menu of the commands along the bottom two rows. Each of these commands is invoked using the Ctrl key with the letter indicated. Thus, ^X Exit indicates that you can use the Ctrl key in combination with the letter X to exit nano.

For now, type the following into the editor:
#nexus

begin paup;
log file=algae.output.txt start replace flush;
execute algae.nex;
set criterion=likelihood autoclose;
lset nst=2 basefreq=estimate tratio=estimate rates=gamma shape=estimate;
hsearch swap=none start=stepwise addseq=random nrep=1;
lscores 1;
lset basefreq=previous tratio=previous shape=previous;
hsearch swap=tbr start=1;
savetrees file=algae.ml.tre brlens;
log stop;
quit;
end;
Once you have entered everything, use ^X to exit. Nano will ask if you want to save the modified '''buffer''' (a ''buffer'' is a predefined amount of computer memory used to store the text you type; the text stored in the buffer will be lost once you exit the program unless you save it to a file on the hard drive), at which point you should press the Y key to answer yes. Nano will now ask you whether you want to use the file name run.nex; this time just press Enter to accept. Nano should now exit and you can use cat to look at the contents of the file you just created:
cat run.nex

==== Create the gopaup file ====
Now use nano to create a second file named <tt>gopaup</tt> in your <tt>pauprun</tt> directory. To do this, type <tt>pwd</tt> to make sure you are in the <tt>pauprun</tt> directory, then type <tt>nano gopaup</tt>. This file should contain this text:
#!/bin/bash
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --job-name=pauprun
cd $HOME/pauprun
module load paup/4.0a-166
paup -n run.nex

=== Using Cyberduck to upload the algae.nex data file ===
[[File:Cyberduck-bookmark-xanadu.png|right]]
Download the file <tt>algae.nex</tt> from [http://hydrodictyon.eeb.uconn.edu/people/plewis/downloads/algae.nex here] and save it on your hard drive.

Open Cyberduck, choose Bookmark > New Bookmark from the main menu, then fill out the resulting dialog box as shown on the right (except substitute your own user name, of course). Be sure to change the protocol to SFTP (not the default FTP). Click the button to close the dialog box and you should see your bookmark appear at the bottom of the main window. Double click the bookmark to open a connection. You will then be warned that the host key is unknown - choose Allow (and go ahead and check the Always button so you do not need to do this every time.

Once you are in, you will see a listing of the files in your home directory on the cluster. To copy the <tt>algae.nex</tt> file to the cluster, you need only drag-and-drop it onto the Cyberduck window.

=== Using scp to upload the algae.nex data file ===

While you will probably want to do your file transfers with Cyberduck as described above, it is also possible to transfer files using the command line scp client. Read on if you are interested in this option, but feel free to skip this section if you are happy using Cyberduck.

The following should be carried out in a terminal that *is not* already logged into the cluster.

In your terminal, navigate to where you saved the file (on your local Mac or Windows computer). If you saved it on the desktop, you can go there by typing <tt>cd Desktop</tt>.

If you've made a shortcut in your ''.ssh/config'' file, you can use the following command to upload the ''algae.nex'' file to the cluster:
scp algae.nex xfer:

If you have not made a shortcut, use this command instead:
scp algae.nex username@transfer.cam.uchc.edu:
where <tt>username</tt> should be replaced by your own user name on the cluster. (Don't overlook the colon on the very end of the line!)

=== Using curl to download the algae.nex data file directly ===

One of my favorite methods to transfer files that are stored on a web site involves the program curl (which stands for copy url). The following command should be carried out in a terminal that *is* logged into the cluster.

In your terminal, navigate to the directory where you want to save the file (on the cluster). Use this command to download the algae.nex file directly into the present working directory on the cluster:
curl -O http://hydrodictyon.eeb.uconn.edu/people/plewis/downloads/algae.nex
The <tt>-O</tt> tells curl to save it under the same file name (algae.nex) that it has on the remote server. If you forget the <tt>-O</tt>, curl will just spit out the entire contents of the file to your terminal, which is almost never what you want!

=== A few more UNIX commands ===

You have now transferred a data file (algae.nex) to the cluster, but it is not in the right place. The algae.nex file is currently in your home directory, whereas the run.nex file is in the pauprun directory. The run.nex file contains a line containing the command <tt>execute algae.nex</tt>, which means that algae.nex should also be located in the pauprun directory. Use the following commands to ensure that (1) you are in your home directory, and (2) algae.nex is also in your home directory:
cd $HOME
ls algae.*
Note the use of a wildcard character (*) in the ls command. This will show you only files that begin with the letters <tt>algae</tt> followed by a period and any number of other non-whitespace characters. The <tt>$HOME</tt> is a predefined shell variable that will be replaced with your home directory. It is not necessary in this case - typing <tt>cd</tt> all by itself would take you to your home directory - but the <tt>$HOME</tt> variable is good to know about (especially for use in scripts).

==== mv command: moving or renaming a file ====
Now use the mv command to move algae.nex to the directory pauprun:
mv algae.nex pauprun
The mv command takes two arguments. The first argument is the name of the directory or file you want to move, whereas the second argument is the destination. The destination could be either a directory (which is true in this case) or a file name. If the directory pauprun did not already exist, mv would have interpreted this as a request to rename algae.nex to the file name pauprun! So, be aware that mv can rename files as well as move them.

==== cp command: copying a file ====
The cp command copies files. It leaves the original file in place and makes a copy elsewhere. You could have used this command to get a copy of algae.nex into the directory pauprun:
cp algae.nex pauprun
This would have left the original in your home directory, and made a duplicate of this file in the directory pauprun.

==== rm command: cleaning up ====
The rm command removes files. If you had used the cp command to copy algae.nex into the pauprun directory, you could remove the original file using these commands:
cd
rm algae.nex
The first cd command just ensures that the copy you are removing will be the one in your home directory (typing <tt>cd</tt> by itself acts the same as typing <tt>cd $HOME</tt>).
To delete an entire directory (don't try this now!), you can add the -rf flags. The r flag tells rm to recursively apply the remove command to everything in every subdirectory, while the f flag means force (don't ask whether each file should be deleted in each subdirectory, just do it!):
rm -rf pauprun
The above command would remove everything in the pauprun directory (without asking!), and then remove the pauprun directory itself. I want to stress that this is a particularly dangerous command, so make sure you are not distracted or sleep-deprived when you use it! Unlike the Windows or Mac graphical user interface, files deleted using rm are '''not''' moved to the Recycle Bin or Trash, they are just gone. There is '''no undo''' for the rm command.

=== Starting a PAUP* analysis ===

If you've been following the directions in sequence, you now have two files (algae.nex and run.nex) in your <tt>$HOME/pauprun</tt> directory on the cluster, whereas the gopaup file should be in <tt>$HOME</tt>. Use the cd command to make sure you are in your home directory, then the cat command to look at the contents of the gopaup file you created earlier. You should see this:
#!/bin/bash
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --job-name=pauprun
cd $HOME/pauprun
module load paup/4.0a-166
paup -n run.nex

This file will be used by software called SLURM to start your run. SLURM provides a command called <tt>sbatch</tt> that you will use to submit your analysis. The SLURM <tt>sbatch</tt> command will look for a core (i.e. processor) on a node (i.e. machine) in the cluster that is currently not being used and will start your analysis on that node. This saves you the effort of looking amongst all nodes in the cluster for a core that is not busy.

Here is an explanation of each of the lines in gopaup:
* The 1st line specifies the command interpreter to use (just include this in your scripts verbatim).
* The 2nd, 3rd, and 4th lines begin with #SBATCH and are interpreted as commands by SLURM itself. In this case, the first and second #SBATCH commands tell SLURM to use the general partition (--partition=general) and the general quality of service (--qos=general). You should always include these two lines verbatim. The last #SBATCH line gives a name to your job (--job-name=pauprun). You could change "pauprun" here to something else, but keep your job names short and without embedded spaces or punctuation. The job name will help you identify your run when checking status.
* The 5th line is simply a cd command that changes the present working directory to the pauprun directory you created earlier. This will ensure that anything saved by PAUP* ends up in this directory rather than in your home directory. Note that $HOME is like a macro that will be expanded to the full path to your home directory.
* The 6th line informs the system that you want to use a particular version of paup. If you left this line out, the command on the last line might not work at all, or might run an older version of paup. You can get a list of all available modules using the command "module avail"
* The 7th and last line starts up PAUP* and executes the run.nex file. The -n flag tells PAUP* that no human is going to be listening or answering questions, so it should just use default answers to any questions it needs to ask during the run.

==== Submitting a job using sbatch ====
Now you are ready to start the analysis. Type these commands to start your run:
cd
cd pauprun
sbatch gopaup

==== Checking status using squeue ====
You can see if your run is still going using the squeue command:
squeue
If it is running, you will see an entry named pauprun. Here is what it looked like for me:
hpc-ext-2 pauprun $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
645170 general pauprun plewis PD 0:00 1 (Priority)
The PD under ST (state) means that my job is pending (not yet running). This job goes so fast that you will be lucky to find it in the running state. If you see no jobs listed when you run squeue, it means your job has finished.

==== Killing a job using scancel ====

Sometimes it is clear that an analysis is not going to do what you wanted it to. Suppose that just after you press the Enter key to start an analysis, you realize that you forgot to put in a savetrees command in your paup block (so in the end you will not be able to see the results of the search). In such situations, you really want to just kill the job, fix the problem, and then start it up again. Use the scancel command for this. Note that in the output of the squeue command above, my run had a job-ID equal to 645170. I could kill the job like this:
scancel 645170
Be sure to delete any output files that have already been created before starting your run over again.

==== While PAUP* is running ====

While PAUP* is running, you can use cat to look at the log file:
cd pauprun
cat algae.output.txt

==== Using Cyberduck to download the log file and the tree file ====

When PAUP* finishes, squeue will no longer list your process. At this point, you need to use Cyberduck to get the log and tree files that were saved back to your local computer. Assuming you left Cyberduck open and connected to the cluster, double-click on the pauprun directory and locate the files <tt>algae.ml.tre</tt> and <tt>algae.output.txt</tt>. Select these two files and drag them out of the Cyberduck window and drop them on your desktop. After a flurry of activity, Cyberduck should report that the two files were downloaded successfully, at which point you can close the download status window.

==== Using scp to download the log file and the tree file ====

You can also use scp to get the log and tree files that were saved back to your local computer, but, again, if you are happy with Cyberduck you can skip this section. In the Terminal app on your Mac (or the Git for Windows BASH session on your Windows PC), type the following (being careful to separate the final dot character from everything else by a blank space):
scp xfer:pauprun/algae.output.txt .
scp xfer:pauprun/algae.ml.tre .
This assumes you have set up a shortcut: if not, you will need to use the longer version below (being sure to replace <tt>username</tt> with your own user name on the cluster):
scp username@transfer.cam.uchc.edu:pauprun/algae.output.txt .
scp username@transfer.cam.uchc.edu:pauprun/algae.ml.tre .

These scp commands copy the files <tt>algae.output.txt</tt> and <tt>algae.ml.tre</tt> to your current directory (this is what the single dot at the end of each line stands for).

=== Using FigTree to view tree files ===
[[File:AlgaeMLtree.png|right|400px|thumb|<tt>algae.ml.tre</tt> file viewed with FigTree]]
If you do not already have it, download and install the [http://tree.bio.ed.ac.uk/software/figtree/ FigTree] application on your laptop. FigTree is a Java application, so you will also need to install a Java Runtime Environment (JRE) if you don't already have one (just start FigTree and it will tell you if it cannot find a JRE). Once FigTree is running, choose File > Open from the menu to open the <tt>algae.ml.tre</tt> file.

==== Adjusting taxon label font ====

The first thing you will probably want to do is make the taxon labels larger or change the font. Expand the Tip Labels section on the left and play with the Font Size up/down control. You can also set font details in the preferences, which will save you a lot of time in the future

==== Line thickness ====

You can modify the thickness of the lines used by FigTree to draw the edges of the tree by expanding the Appearance tab.

==== Ladderization ====

You can ladderize the tree (make it appear to flow one way or the other) by playing with the Order Nodes option in the Trees tab.

==== Export tree as PDF ====

There are many other options that you can discover in FigTree, but one more thing to try today is to save the tree as a PDF file. Once you have the tree looking just the way you want, choose File > Export PDF...

==== Why have PAUP* create the log file algae.output.txt? ====

In your pauprun directory, SLURM saved the output that PAUP* normally sends to the console to a file named slurm-645170.out (your file will have a different number, however). You will not need this file after the run: the log command in your paup block ends up saving the same output in the file algae.output.txt. Why did we tell PAUP* to start a log file if SGE was going to save the output anyway? The main reason is that you can view the log file during the run, but you cannot view slurm-645170.out until the run is finished. There will come a day when you have a PAUP* run that has been going for several days and want to know whether it is 10% or 90% finished. At this point you will appreciate being able to view the output file!

==== Delete the slurm-xxxx.out file using the rm command ====

Because you do not need the slurm-xxxxx.out file (where the xxxx are placeholders for the job number), delete it using the rm command (the -f stands for force; i.e. don't ask if it is ok, just do it!):
cd
cd pauprun
rm -f slurm-*.out
You also no longer need the log and tree files because you downloaded them to your local computer using PSFTP:
rm -f algae.ml.tre
rm -f algae.output.txt
It is a good idea to delete files you no longer need for two reasons:
* you will later wonder whether you downloaded those files to your local machine and will have to spend time making sure you actually have saved the results locally
* our cluster only has so much disk space, and thus it is just not possible for everyone to keep every file they ever created

=== Tips and tricks ===

Here are some miscellaneous tips and tricks to make your life easier when communicating with the cluster.

==== Command completion using the tab key ====

You can often get away with only typing the first few letters of a filename; try pressing the Tab key after the first few letters and the command interpreter will try to complete the thought. For example, cd into the pauprun directory, then type
cat alg<TAB>
If algae.nex is the only file in the directory in which the first three letters are alg, then the command interpreter will type in the rest of the file name for you.

==== Wildcards ====

I've already mentioned this tip, but it bears repeating. When using most UNIX commands that accept filenames (e.g. ls, rm, mv, cp), you can place an asterisk inside the filename to stand in for any number of letters. So
ls algae*
will produce output like this
algae.ml.tre algae.nex algae.output.txt

==== Man pages ====

If you want to learn more options for any of the UNIX commands, you can use the man command to see the manual for that command. For example, here's how to see the manual describing the ls command:
man ls
It is important to know how to escape from a man page! The way to get out is to type the letter q. You can page down using Ctrl-f, page up through a man page using Ctrl-b, go to the end using Shift-g and return to the very beginning using 1,Shift-g (that is, type a 1, release it, then type Shift-g). You can also move line by line in a man page using the down and up arrows, and page by page using the PgUp and PgDn keys.

==== Xanadu information ====

You can find a lot of great information about the Xanadu cluster at the [https://bioinformatics.uconn.edu/ Computational Biology Core] web site. In particular, take time to look through the first tutorial named "Understanding the UConn Health Cluster (Xanadu)".

[[Category: Phylogenetics]]
__NOTOC__

Phylogenetics: Xanadu Cluster

2022-01-13T16:26:44Z

Paul Lewis: /* Using Cyberduck to upload the algae.nex data file */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to show you how to log into the [https://bioinformatics.uconn.edu/resources-and-events/tutorials-2/xanadu/ Xanadu] computer cluster and perform a basic PAUP* analysis.
|}

= Using the UConn Xanadu cluster =
The UConn [http://bioinformatics.uconn.edu Computational Biology Core] is part of the [http://cgi.uconn.edu/ Center for Genome Innovation (CGI)]. We will use the Xanadu computing cluster located at the UConn Health Center for most of the data crunching we will do in this course. By now, you should have an account on the cluster, and today you will learn how to start analyses remotely (i.e. from your laptop), check on their status, and download the results when your analysis is finished.

== Obtaining the necessary communications software ==
You will be using a couple of simple (and free) programs to communicate with the head node of the cluster.

=== If you use Windows...please scroll down to the Windows section===
=== If you use MacOS 10.x... ===
==== SSH ====
The program '''ssh''' will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). You will use ssh to send commands to the cluster and see the output generated.

Start by opening the Terminal application, which you can find in the Applications/Utilities folder on your hard drive. Using the Terminal program, you can connect to the cluster with the following command:
ssh username@xanadu-submit-ext.cam.uchc.edu
where username should be replaced by your username on the cluster.

You may wish to install [http://www.iterm2.com/ iTerm2], which is a terminal program that makes some things easier than Terminal, but the built-in Terminal will work just fine.

'''Creating a shortcut'''

If you want to avoid having to type the long command above every time you want to connect to the cluster, it is possible to create a shortcut. You will need to edit the ''config'' file in your ''.ssh'' directory. Directories with names beginning with a period are not usually shown, but you can open this directory in Finder by typing
cd
open .ssh
in your terminal (the initial cd command changes the directory to your default, a.k.a. home, directory).

If you do not see a file named "config" in your ".ssh" director, create an empty config file using the command
touch config
Open the config file in a text editor such as [https://www.barebones.com BBEdit] (NOT a word processor such as Microsoft Word or Pages!) and add the following lines (replace ''plewis'' with your actual username on xanadu, and replace ''/Users/plewis'' with your home directory path on your Mac, which you can get by typing <tt>pwd</tt> at the command line):
host xanadu
HostName xanadu-submit-ext.cam.uchc.edu
User plewis
IdentityFile /Users/plewis/.ssh/id_rsa
Cipher blowfish
host xfer
HostName transfer.cam.uchc.edu
User plewis
IdentityFile /Users/plewis/.ssh/id_rsa
Cipher blowfish
Once you save the config file, you should be able to just type
ssh xanadu
to login to the xanadu cluster. (The second entry (xfer) will be used when transferring files to and from the cluster.)

==== SCP/SFTP ====
An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). I will show you how to transfer files using both methods, but for now you should go ahead and install [http://cyberduck.io Cyberduck]. Cyberduck provides a nice graphical user interface, but you might find that the command line clients let you get your work done faster once you get used to using them.

=== If you use Windows... ===
==== SSH ====
The program '''Git for Windows''' provides a terminal that will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). We will not actually use the "git" part of "Git for Windows", although it is there in case you need it later. Instead, you will use the Git for Windows bash shell to send commands to the cluster and see the output generated.

Visit the [https://gitforwindows.org Git for Windows] web site, press the Download button, and install Git for Windows on your system. Once installed, open Git BASH from the All Programs section of the start menu. This will open a terminal running the bash shell (a shell is a program that interprets operating system control commands) on your desktop.

Connect to the cluster with the following command:
ssh username@xanadu-submit-ext.cam.uchc.edu
where ''username'' should be replaced by your username on the cluster.

'''Creating a shortcut'''

If you want to avoid having to type the long command above every time you want to connect to the cluster, it is possible to create a shortcut. You will need to edit the ''config'' file in your ''.ssh'' directory. Open (or create, if it does not yet exist) the file named ''config'' in a text editor such as [https://notepad-plus-plus.org NotePad++] (NOT a word processor such as Microsoft Word!) and add the following lines (replace ''plewis'' with your actual username on xanadu, and replace ''/c/Users/Paul Lewis'' with your home directory path):
host xanadu
HostName xanadu-submit-ext.cam.uchc.edu
User plewis
IdentityFile "/c/Users/Paul Lewis/.ssh/id_rsa"
Cipher blowfish
host xfer
HostName transfer.cam.uchc.edu
User plewis
IdentityFile "/c/Users/Paul Lewis/.ssh/id_rsa"
Cipher blowfish
Use the <tt>pwd</tt> command to find out what your home directory path is, and use double quotes if your home directory path contains embedded spaces (note that I had to use quotes for mine).

Once you save the config file, you should be able to just type
ssh xanadu
to login to the xanadu cluster. (The second entry (xfer) will be used for transferring files to and from Xanadu.)

==== SCP/SFTP ====
An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). For now you should go ahead and install [http://cyberduck.io Cyberduck]. Cyberduck provides a nice graphical user interface, which makes moving files back and forth easy.

=== Learning enough UNIX to get around ===
I'm presuming that you do not know a lot of UNIX commands, but even if you are already a UNIX guru, please complete this section anyway because otherwise you will fail to create some files you will need later.

==== ls command: finding out what is in the present working directory ====
The '''ls''' command lists the files in the ''present working directory''. Try typing just
ls
If you need more details about files than you see here, type
ls -la
instead. This version provides information about file permissions, ownership, size, and last modification date.

==== pwd command: finding out what directory you are in ====
Typing
pwd
shows you the full path of the present working directory. The path shown should end with your username, indicating that you are currently in your home directory.

==== mkdir command: creating a new directory ====
Typing the following command will create a new directory named <tt>pauprun</tt> in your home directory:
mkdir pauprun
Use the <tt>ls</tt> command now to make sure a directory of that name was indeed created.

==== cd command: leaving the nest and returning home again ====
The cd command lets you change the present working directory. To move into the newly-created <tt>pauprun</tt> directory, type
cd pauprun
You can always go back to your home directory (no matter how lost you get!) by typing just cd by itself
cd
If you want to go down one directory level (say from pauprun back down to your home directory), you can specify the parent directory using two dots:
cd ..

==== Creating run.nex using the nano editor ====

One way to create a new file, or edit one that already exists, is to use the nano editor. You will now use nano to create a run.nex file containing a paup block. You will later execute this file in PAUP* to perform an analysis.

First use the pwd command to see where you are, then use cd to go into the pauprun directory if you are not already there. Type
nano run.nex
This will open the nano editor, and it should say <nowiki>[ New File ]</nowiki> at the bottom of the window to indicate that the run.nex file does not already exist. Note the menu of the commands along the bottom two rows. Each of these commands is invoked using the Ctrl key with the letter indicated. Thus, ^X Exit indicates that you can use the Ctrl key in combination with the letter X to exit nano.

For now, type the following into the editor:
#nexus

begin paup;
log file=algae.output.txt start replace flush;
execute algae.nex;
set criterion=likelihood autoclose;
lset nst=2 basefreq=estimate tratio=estimate rates=gamma shape=estimate;
hsearch swap=none start=stepwise addseq=random nrep=1;
lscores 1;
lset basefreq=previous tratio=previous shape=previous;
hsearch swap=tbr start=1;
savetrees file=algae.ml.tre brlens;
log stop;
quit;
end;
Once you have entered everything, use ^X to exit. Nano will ask if you want to save the modified '''buffer''' (a ''buffer'' is a predefined amount of computer memory used to store the text you type; the text stored in the buffer will be lost once you exit the program unless you save it to a file on the hard drive), at which point you should press the Y key to answer yes. Nano will now ask you whether you want to use the file name run.nex; this time just press Enter to accept. Nano should now exit and you can use cat to look at the contents of the file you just created:
cat run.nex

==== Create the gopaup file ====
Now use nano to create a second file named <tt>gopaup</tt> in your <tt>pauprun</tt> directory. To do this, type <tt>pwd</tt> to make sure you are in the <tt>pauprun</tt> directory, then type <tt>nano gopaup</tt>. This file should contain this text:
#!/bin/bash
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --job-name=pauprun
cd $HOME/pauprun
module load paup/4.0a-166
paup -n run.nex

=== Using Cyberduck to upload the algae.nex data file ===
[[File:Cyberduck-bookmark-xanadu.png|right]]
Download the file <tt>algae.nex</tt> from [http://hydrodictyon.eeb.uconn.edu/people/plewis/downloads/algae.nex here] and save it on your hard drive.

Open Cyberduck, choose Bookmark > New Bookmark from the main menu, then fill out the resulting dialog box as shown on the right (except substitute your own user name, of course). Be sure to change the protocol to SFTP (not the default FTP). Click the button to close the dialog box and you should see your bookmark appear at the bottom of the main window. Double click the bookmark to open a connection. You will then be warned that the host key is unknown - choose Allow (and go ahead and check the Always button so you do not need to do this every time.

Once you are in, you will see a listing of the files in your home directory on the cluster. To copy the <tt>algae.nex</tt> file to the cluster, you need only drag-and-drop it onto the Cyberduck window.

=== Using scp to upload the algae.nex data file ===

While you will probably want to do your file transfers with Cyberduck as described above, it is also possible to transfer files using the command line scp client. Read on if you are interested in this option, but feel free to skip this section if you are happy using Cyberduck.

In your terminal, navigate to where you saved the file. If you saved it on the desktop, you can go there by typing <tt>cd Desktop</tt>.

If you've made a shortcut in your ''.ssh/config'' file, you can use the following command to upload the ''algae.nex'' file:
scp algae.nex xfer:

If you have not made a shortcut, use this command instead:
scp algae.nex username@transfer.cam.uchc.edu:
where <tt>username</tt> should be replaced by your own user name on the cluster. (Don't overlook the colon on the very end of the line!)

=== A few more UNIX commands ===

You have now transferred a data file (algae.nex) to the cluster, but it is not in the right place. The algae.nex file is currently in your home directory, whereas the run.nex file is in the pauprun directory. The run.nex file contains a line containing the command <tt>execute algae.nex</tt>, which means that algae.nex should also be located in the pauprun directory. Use the following commands to ensure that (1) you are in your home directory, and (2) algae.nex is also in your home directory:
cd $HOME
ls algae.*
Note the use of a wildcard character (*) in the ls command. This will show you only files that begin with the letters <tt>algae</tt> followed by a period and any number of other non-whitespace characters. The <tt>$HOME</tt> is a predefined shell variable that will be replaced with your home directory. It is not necessary in this case - typing <tt>cd</tt> all by itself would take you to your home directory - but the <tt>$HOME</tt> variable is good to know about (especially for use in scripts).

==== mv command: moving or renaming a file ====
Now use the mv command to move algae.nex to the directory pauprun:
mv algae.nex pauprun
The mv command takes two arguments. The first argument is the name of the directory or file you want to move, whereas the second argument is the destination. The destination could be either a directory (which is true in this case) or a file name. If the directory pauprun did not already exist, mv would have interpreted this as a request to rename algae.nex to the file name pauprun! So, be aware that mv can rename files as well as move them.

==== cp command: copying a file ====
The cp command copies files. It leaves the original file in place and makes a copy elsewhere. You could have used this command to get a copy of algae.nex into the directory pauprun:
cp algae.nex pauprun
This would have left the original in your home directory, and made a duplicate of this file in the directory pauprun.

==== rm command: cleaning up ====
The rm command removes files. If you had used the cp command to copy algae.nex into the pauprun directory, you could remove the original file using these commands:
cd
rm algae.nex
The first cd command just ensures that the copy you are removing will be the one in your home directory (typing <tt>cd</tt> by itself acts the same as typing <tt>cd $HOME</tt>).
To delete an entire directory (don't try this now!), you can add the -rf flags. The r flag tells rm to recursively apply the remove command to everything in every subdirectory, while the f flag means force (don't ask whether each file should be deleted in each subdirectory, just do it!):
rm -rf pauprun
The above command would remove everything in the pauprun directory (without asking!), and then remove the pauprun directory itself. I want to stress that this is a particularly dangerous command, so make sure you are not distracted or sleep-deprived when you use it! Unlike the Windows or Mac graphical user interface, files deleted using rm are '''not''' moved to the Recycle Bin or Trash, they are just gone. There is '''no undo''' for the rm command.

=== Starting a PAUP* analysis ===

If you've been following the directions in sequence, you now have two files (algae.nex and run.nex) in your <tt>$HOME/pauprun</tt> directory on the cluster, whereas the gopaup file should be in <tt>$HOME</tt>. Use the cd command to make sure you are in your home directory, then the cat command to look at the contents of the gopaup file you created earlier. You should see this:
#!/bin/bash
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --job-name=pauprun
cd $HOME/pauprun
module load paup/4.0a-166
paup -n run.nex

This file will be used by software called SLURM to start your run. SLURM provides a command called <tt>sbatch</tt> that you will use to submit your analysis. The SLURM <tt>sbatch</tt> command will look for a core (i.e. processor) on a node (i.e. machine) in the cluster that is currently not being used and will start your analysis on that node. This saves you the effort of looking amongst all nodes in the cluster for a core that is not busy.

Here is an explanation of each of the lines in gopaup:
* The 1st line specifies the command interpreter to use (just include this in your scripts verbatim).
* The 2nd, 3rd, and 4th lines begin with #SBATCH and are interpreted as commands by SLURM itself. In this case, the first and second #SBATCH commands tell SLURM to use the general partition (--partition=general) and the general quality of service (--qos=general). You should always include these two lines verbatim. The last #SBATCH line gives a name to your job (--job-name=pauprun). You could change "pauprun" here to something else, but keep your job names short and without embedded spaces or punctuation. The job name will help you identify your run when checking status.
* The 5th line is simply a cd command that changes the present working directory to the pauprun directory you created earlier. This will ensure that anything saved by PAUP* ends up in this directory rather than in your home directory. Note that $HOME is like a macro that will be expanded to the full path to your home directory.
* The 6th line informs the system that you want to use a particular version of paup. If you left this line out, the command on the last line might not work at all, or might run an older version of paup. You can get a list of all available modules using the command "module avail"
* The 7th and last line starts up PAUP* and executes the run.nex file. The -n flag tells PAUP* that no human is going to be listening or answering questions, so it should just use default answers to any questions it needs to ask during the run.

==== Submitting a job using sbatch ====
Now you are ready to start the analysis. Type these commands to start your run:
cd
cd pauprun
sbatch gopaup

==== Checking status using squeue ====
You can see if your run is still going using the squeue command:
squeue
If it is running, you will see an entry named pauprun. Here is what it looked like for me:
hpc-ext-2 pauprun $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
645170 general pauprun plewis PD 0:00 1 (Priority)
The PD under ST (state) means that my job is pending (not yet running). This job goes so fast that you will be lucky to find it in the running state. If you see no jobs listed when you run squeue, it means your job has finished.

==== Killing a job using scancel ====

Sometimes it is clear that an analysis is not going to do what you wanted it to. Suppose that just after you press the Enter key to start an analysis, you realize that you forgot to put in a savetrees command in your paup block (so in the end you will not be able to see the results of the search). In such situations, you really want to just kill the job, fix the problem, and then start it up again. Use the scancel command for this. Note that in the output of the squeue command above, my run had a job-ID equal to 645170. I could kill the job like this:
scancel 645170
Be sure to delete any output files that have already been created before starting your run over again.

==== While PAUP* is running ====

While PAUP* is running, you can use cat to look at the log file:
cd pauprun
cat algae.output.txt

==== Using Cyberduck to download the log file and the tree file ====

When PAUP* finishes, squeue will no longer list your process. At this point, you need to use Cyberduck to get the log and tree files that were saved back to your local computer. Assuming you left Cyberduck open and connected to the cluster, double-click on the pauprun directory and locate the files <tt>algae.ml.tre</tt> and <tt>algae.output.txt</tt>. Select these two files and drag them out of the Cyberduck window and drop them on your desktop. After a flurry of activity, Cyberduck should report that the two files were downloaded successfully, at which point you can close the download status window.

==== Using scp to download the log file and the tree file ====

You can also use scp to get the log and tree files that were saved back to your local computer, but, again, if you are happy with Cyberduck you can skip this section. In the Terminal app on your Mac (or the Git for Windows BASH session on your Windows PC), type the following (being careful to separate the final dot character from everything else by a blank space):
scp xfer:pauprun/algae.output.txt .
scp xfer:pauprun/algae.ml.tre .
This assumes you have set up a shortcut: if not, you will need to use the longer version below (being sure to replace <tt>username</tt> with your own user name on the cluster):
scp username@transfer.cam.uchc.edu:pauprun/algae.output.txt .
scp username@transfer.cam.uchc.edu:pauprun/algae.ml.tre .

These scp commands copy the files <tt>algae.output.txt</tt> and <tt>algae.ml.tre</tt> to your current directory (this is what the single dot at the end of each line stands for).

=== Using FigTree to view tree files ===
[[File:AlgaeMLtree.png|right|400px|thumb|<tt>algae.ml.tre</tt> file viewed with FigTree]]
If you do not already have it, download and install the [http://tree.bio.ed.ac.uk/software/figtree/ FigTree] application on your laptop. FigTree is a Java application, so you will also need to install a Java Runtime Environment (JRE) if you don't already have one (just start FigTree and it will tell you if it cannot find a JRE). Once FigTree is running, choose File > Open from the menu to open the <tt>algae.ml.tre</tt> file.

==== Adjusting taxon label font ====

The first thing you will probably want to do is make the taxon labels larger or change the font. Expand the Tip Labels section on the left and play with the Font Size up/down control. You can also set font details in the preferences, which will save you a lot of time in the future

==== Line thickness ====

You can modify the thickness of the lines used by FigTree to draw the edges of the tree by expanding the Appearance tab.

==== Ladderization ====

You can ladderize the tree (make it appear to flow one way or the other) by playing with the Order Nodes option in the Trees tab.

==== Export tree as PDF ====

There are many other options that you can discover in FigTree, but one more thing to try today is to save the tree as a PDF file. Once you have the tree looking just the way you want, choose File > Export PDF...

==== Why have PAUP* create the log file algae.output.txt? ====

In your pauprun directory, SLURM saved the output that PAUP* normally sends to the console to a file named slurm-645170.out (your file will have a different number, however). You will not need this file after the run: the log command in your paup block ends up saving the same output in the file algae.output.txt. Why did we tell PAUP* to start a log file if SGE was going to save the output anyway? The main reason is that you can view the log file during the run, but you cannot view slurm-645170.out until the run is finished. There will come a day when you have a PAUP* run that has been going for several days and want to know whether it is 10% or 90% finished. At this point you will appreciate being able to view the output file!

==== Delete the slurm-xxxx.out file using the rm command ====

Because you do not need the slurm-xxxxx.out file (where the xxxx are placeholders for the job number), delete it using the rm command (the -f stands for force; i.e. don't ask if it is ok, just do it!):
cd
cd pauprun
rm -f slurm-*.out
You also no longer need the log and tree files because you downloaded them to your local computer using PSFTP:
rm -f algae.ml.tre
rm -f algae.output.txt
It is a good idea to delete files you no longer need for two reasons:
* you will later wonder whether you downloaded those files to your local machine and will have to spend time making sure you actually have saved the results locally
* our cluster only has so much disk space, and thus it is just not possible for everyone to keep every file they ever created

=== Tips and tricks ===

Here are some miscellaneous tips and tricks to make your life easier when communicating with the cluster.

==== Command completion using the tab key ====

You can often get away with only typing the first few letters of a filename; try pressing the Tab key after the first few letters and the command interpreter will try to complete the thought. For example, cd into the pauprun directory, then type
cat alg<TAB>
If algae.nex is the only file in the directory in which the first three letters are alg, then the command interpreter will type in the rest of the file name for you.

==== Wildcards ====

I've already mentioned this tip, but it bears repeating. When using most UNIX commands that accept filenames (e.g. ls, rm, mv, cp), you can place an asterisk inside the filename to stand in for any number of letters. So
ls algae*
will produce output like this
algae.ml.tre algae.nex algae.output.txt

==== Man pages ====

If you want to learn more options for any of the UNIX commands, you can use the man command to see the manual for that command. For example, here's how to see the manual describing the ls command:
man ls
It is important to know how to escape from a man page! The way to get out is to type the letter q. You can page down using Ctrl-f, page up through a man page using Ctrl-b, go to the end using Shift-g and return to the very beginning using 1,Shift-g (that is, type a 1, release it, then type Shift-g). You can also move line by line in a man page using the down and up arrows, and page by page using the PgUp and PgDn keys.

==== Xanadu information ====

You can find a lot of great information about the Xanadu cluster at the [https://bioinformatics.uconn.edu/ Computational Biology Core] web site. In particular, take time to look through the first tutorial named "Understanding the UConn Health Cluster (Xanadu)".

[[Category: Phylogenetics]]
__NOTOC__

Phylogenetics: Xanadu Cluster

2022-01-13T16:06:32Z

Paul Lewis:

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to show you how to log into the [https://bioinformatics.uconn.edu/resources-and-events/tutorials-2/xanadu/ Xanadu] computer cluster and perform a basic PAUP* analysis.
|}

= Using the UConn Xanadu cluster =
The UConn [http://bioinformatics.uconn.edu Computational Biology Core] is part of the [http://cgi.uconn.edu/ Center for Genome Innovation (CGI)]. We will use the Xanadu computing cluster located at the UConn Health Center for most of the data crunching we will do in this course. By now, you should have an account on the cluster, and today you will learn how to start analyses remotely (i.e. from your laptop), check on their status, and download the results when your analysis is finished.

== Obtaining the necessary communications software ==
You will be using a couple of simple (and free) programs to communicate with the head node of the cluster.

=== If you use Windows...please scroll down to the Windows section===
=== If you use MacOS 10.x... ===
==== SSH ====
The program '''ssh''' will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). You will use ssh to send commands to the cluster and see the output generated.

Start by opening the Terminal application, which you can find in the Applications/Utilities folder on your hard drive. Using the Terminal program, you can connect to the cluster with the following command:
ssh username@xanadu-submit-ext.cam.uchc.edu
where username should be replaced by your username on the cluster.

You may wish to install [http://www.iterm2.com/ iTerm2], which is a terminal program that makes some things easier than Terminal, but the built-in Terminal will work just fine.

'''Creating a shortcut'''

If you want to avoid having to type the long command above every time you want to connect to the cluster, it is possible to create a shortcut. You will need to edit the ''config'' file in your ''.ssh'' directory. Directories with names beginning with a period are not usually shown, but you can open this directory in Finder by typing
cd
open .ssh
in your terminal (the initial cd command changes the directory to your default, a.k.a. home, directory).

If you do not see a file named "config" in your ".ssh" director, create an empty config file using the command
touch config
Open the config file in a text editor such as [https://www.barebones.com BBEdit] (NOT a word processor such as Microsoft Word or Pages!) and add the following lines (replace ''plewis'' with your actual username on xanadu, and replace ''/Users/plewis'' with your home directory path on your Mac, which you can get by typing <tt>pwd</tt> at the command line):
host xanadu
HostName xanadu-submit-ext.cam.uchc.edu
User plewis
IdentityFile /Users/plewis/.ssh/id_rsa
Cipher blowfish
host xfer
HostName transfer.cam.uchc.edu
User plewis
IdentityFile /Users/plewis/.ssh/id_rsa
Cipher blowfish
Once you save the config file, you should be able to just type
ssh xanadu
to login to the xanadu cluster. (The second entry (xfer) will be used when transferring files to and from the cluster.)

==== SCP/SFTP ====
An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). I will show you how to transfer files using both methods, but for now you should go ahead and install [http://cyberduck.io Cyberduck]. Cyberduck provides a nice graphical user interface, but you might find that the command line clients let you get your work done faster once you get used to using them.

=== If you use Windows... ===
==== SSH ====
The program '''Git for Windows''' provides a terminal that will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). We will not actually use the "git" part of "Git for Windows", although it is there in case you need it later. Instead, you will use the Git for Windows bash shell to send commands to the cluster and see the output generated.

Visit the [https://gitforwindows.org Git for Windows] web site, press the Download button, and install Git for Windows on your system. Once installed, open Git BASH from the All Programs section of the start menu. This will open a terminal running the bash shell (a shell is a program that interprets operating system control commands) on your desktop.

Connect to the cluster with the following command:
ssh username@xanadu-submit-ext.cam.uchc.edu
where ''username'' should be replaced by your username on the cluster.

'''Creating a shortcut'''

If you want to avoid having to type the long command above every time you want to connect to the cluster, it is possible to create a shortcut. You will need to edit the ''config'' file in your ''.ssh'' directory. Open (or create, if it does not yet exist) the file named ''config'' in a text editor such as [https://notepad-plus-plus.org NotePad++] (NOT a word processor such as Microsoft Word!) and add the following lines (replace ''plewis'' with your actual username on xanadu, and replace ''/c/Users/Paul Lewis'' with your home directory path):
host xanadu
HostName xanadu-submit-ext.cam.uchc.edu
User plewis
IdentityFile "/c/Users/Paul Lewis/.ssh/id_rsa"
Cipher blowfish
host xfer
HostName transfer.cam.uchc.edu
User plewis
IdentityFile "/c/Users/Paul Lewis/.ssh/id_rsa"
Cipher blowfish
Use the <tt>pwd</tt> command to find out what your home directory path is, and use double quotes if your home directory path contains embedded spaces (note that I had to use quotes for mine).

Once you save the config file, you should be able to just type
ssh xanadu
to login to the xanadu cluster. (The second entry (xfer) will be used for transferring files to and from Xanadu.)

==== SCP/SFTP ====
An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). For now you should go ahead and install [http://cyberduck.io Cyberduck]. Cyberduck provides a nice graphical user interface, which makes moving files back and forth easy.

=== Learning enough UNIX to get around ===
I'm presuming that you do not know a lot of UNIX commands, but even if you are already a UNIX guru, please complete this section anyway because otherwise you will fail to create some files you will need later.

==== ls command: finding out what is in the present working directory ====
The '''ls''' command lists the files in the ''present working directory''. Try typing just
ls
If you need more details about files than you see here, type
ls -la
instead. This version provides information about file permissions, ownership, size, and last modification date.

==== pwd command: finding out what directory you are in ====
Typing
pwd
shows you the full path of the present working directory. The path shown should end with your username, indicating that you are currently in your home directory.

==== mkdir command: creating a new directory ====
Typing the following command will create a new directory named <tt>pauprun</tt> in your home directory:
mkdir pauprun
Use the <tt>ls</tt> command now to make sure a directory of that name was indeed created.

==== cd command: leaving the nest and returning home again ====
The cd command lets you change the present working directory. To move into the newly-created <tt>pauprun</tt> directory, type
cd pauprun
You can always go back to your home directory (no matter how lost you get!) by typing just cd by itself
cd
If you want to go down one directory level (say from pauprun back down to your home directory), you can specify the parent directory using two dots:
cd ..

==== Creating run.nex using the nano editor ====

One way to create a new file, or edit one that already exists, is to use the nano editor. You will now use nano to create a run.nex file containing a paup block. You will later execute this file in PAUP* to perform an analysis.

First use the pwd command to see where you are, then use cd to go into the pauprun directory if you are not already there. Type
nano run.nex
This will open the nano editor, and it should say <nowiki>[ New File ]</nowiki> at the bottom of the window to indicate that the run.nex file does not already exist. Note the menu of the commands along the bottom two rows. Each of these commands is invoked using the Ctrl key with the letter indicated. Thus, ^X Exit indicates that you can use the Ctrl key in combination with the letter X to exit nano.

For now, type the following into the editor:
#nexus

begin paup;
log file=algae.output.txt start replace flush;
execute algae.nex;
set criterion=likelihood autoclose;
lset nst=2 basefreq=estimate tratio=estimate rates=gamma shape=estimate;
hsearch swap=none start=stepwise addseq=random nrep=1;
lscores 1;
lset basefreq=previous tratio=previous shape=previous;
hsearch swap=tbr start=1;
savetrees file=algae.ml.tre brlens;
log stop;
quit;
end;
Once you have entered everything, use ^X to exit. Nano will ask if you want to save the modified '''buffer''' (a ''buffer'' is a predefined amount of computer memory used to store the text you type; the text stored in the buffer will be lost once you exit the program unless you save it to a file on the hard drive), at which point you should press the Y key to answer yes. Nano will now ask you whether you want to use the file name run.nex; this time just press Enter to accept. Nano should now exit and you can use cat to look at the contents of the file you just created:
cat run.nex

==== Create the gopaup file ====
Now use nano to create a second file named <tt>gopaup</tt> in your <tt>pauprun</tt> directory. To do this, type <tt>pwd</tt> to make sure you are in the <tt>pauprun</tt> directory, then type <tt>nano gopaup</tt>. This file should contain this text:
#!/bin/bash
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --job-name=pauprun
cd $HOME/pauprun
module load paup/4.0a-166
paup -n run.nex

=== Using Cyberduck to upload the algae.nex data file ===
[[File:Cyberduck-bookmark-xanadu.png|right]]
Download the file <tt>algae.nex</tt> from [http://hydrodictyon.eeb.uconn.edu/people/plewis/downloads/algae.nex here] and save it on your hard drive.

Open Cyberduck, choose Bookmark > New Bookmark from the main menu, then fill out the resulting dialog box as shown on the right (except substitute your own user name, of course). Be sure to change the protocol to SFTP (not the default FTP). Click the button to close the dialog box and you should see your bookmark appear at the bottom of the main window. Double click the bookmark to open a connection. You will then be warned that the host key is unknown - choose Allow (and go ahead and check the Always button so you do not need to do this every time.

Once you are in, you will see a listing of the files in your home directory (if you have any). To copy the <tt>algae.nex</tt> file to the cluster, you need only drag-and-drop it onto the Cyberduck window.

=== Using scp to upload the algae.nex data file ===

While you will probably want to do your file transfers with Cyberduck as described above, it is also possible to transfer files using the command line scp client. Read on if you are interested in this option, but feel free to skip this section if you are happy using Cyberduck.

In your terminal, navigate to where you saved the file. If you saved it on the desktop, you can go there by typing <tt>cd Desktop</tt>.

If you've made a shortcut in your ''.ssh/config'' file, you can use the following command to upload the ''algae.nex'' file:
scp algae.nex xfer:

If you have not made a shortcut, use this command instead:
scp algae.nex username@transfer.cam.uchc.edu:
where <tt>username</tt> should be replaced by your own user name on the cluster. (Don't overlook the colon on the very end of the line!)

=== A few more UNIX commands ===

You have now transferred a data file (algae.nex) to the cluster, but it is not in the right place. The algae.nex file is currently in your home directory, whereas the run.nex file is in the pauprun directory. The run.nex file contains a line containing the command <tt>execute algae.nex</tt>, which means that algae.nex should also be located in the pauprun directory. Use the following commands to ensure that (1) you are in your home directory, and (2) algae.nex is also in your home directory:
cd $HOME
ls algae.*
Note the use of a wildcard character (*) in the ls command. This will show you only files that begin with the letters <tt>algae</tt> followed by a period and any number of other non-whitespace characters. The <tt>$HOME</tt> is a predefined shell variable that will be replaced with your home directory. It is not necessary in this case - typing <tt>cd</tt> all by itself would take you to your home directory - but the <tt>$HOME</tt> variable is good to know about (especially for use in scripts).

==== mv command: moving or renaming a file ====
Now use the mv command to move algae.nex to the directory pauprun:
mv algae.nex pauprun
The mv command takes two arguments. The first argument is the name of the directory or file you want to move, whereas the second argument is the destination. The destination could be either a directory (which is true in this case) or a file name. If the directory pauprun did not already exist, mv would have interpreted this as a request to rename algae.nex to the file name pauprun! So, be aware that mv can rename files as well as move them.

==== cp command: copying a file ====
The cp command copies files. It leaves the original file in place and makes a copy elsewhere. You could have used this command to get a copy of algae.nex into the directory pauprun:
cp algae.nex pauprun
This would have left the original in your home directory, and made a duplicate of this file in the directory pauprun.

==== rm command: cleaning up ====
The rm command removes files. If you had used the cp command to copy algae.nex into the pauprun directory, you could remove the original file using these commands:
cd
rm algae.nex
The first cd command just ensures that the copy you are removing will be the one in your home directory (typing <tt>cd</tt> by itself acts the same as typing <tt>cd $HOME</tt>).
To delete an entire directory (don't try this now!), you can add the -rf flags. The r flag tells rm to recursively apply the remove command to everything in every subdirectory, while the f flag means force (don't ask whether each file should be deleted in each subdirectory, just do it!):
rm -rf pauprun
The above command would remove everything in the pauprun directory (without asking!), and then remove the pauprun directory itself. I want to stress that this is a particularly dangerous command, so make sure you are not distracted or sleep-deprived when you use it! Unlike the Windows or Mac graphical user interface, files deleted using rm are '''not''' moved to the Recycle Bin or Trash, they are just gone. There is '''no undo''' for the rm command.

=== Starting a PAUP* analysis ===

If you've been following the directions in sequence, you now have two files (algae.nex and run.nex) in your <tt>$HOME/pauprun</tt> directory on the cluster, whereas the gopaup file should be in <tt>$HOME</tt>. Use the cd command to make sure you are in your home directory, then the cat command to look at the contents of the gopaup file you created earlier. You should see this:
#!/bin/bash
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --job-name=pauprun
cd $HOME/pauprun
module load paup/4.0a-166
paup -n run.nex

This file will be used by software called SLURM to start your run. SLURM provides a command called <tt>sbatch</tt> that you will use to submit your analysis. The SLURM <tt>sbatch</tt> command will look for a core (i.e. processor) on a node (i.e. machine) in the cluster that is currently not being used and will start your analysis on that node. This saves you the effort of looking amongst all nodes in the cluster for a core that is not busy.

Here is an explanation of each of the lines in gopaup:
* The 1st line specifies the command interpreter to use (just include this in your scripts verbatim).
* The 2nd, 3rd, and 4th lines begin with #SBATCH and are interpreted as commands by SLURM itself. In this case, the first and second #SBATCH commands tell SLURM to use the general partition (--partition=general) and the general quality of service (--qos=general). You should always include these two lines verbatim. The last #SBATCH line gives a name to your job (--job-name=pauprun). You could change "pauprun" here to something else, but keep your job names short and without embedded spaces or punctuation. The job name will help you identify your run when checking status.
* The 5th line is simply a cd command that changes the present working directory to the pauprun directory you created earlier. This will ensure that anything saved by PAUP* ends up in this directory rather than in your home directory. Note that $HOME is like a macro that will be expanded to the full path to your home directory.
* The 6th line informs the system that you want to use a particular version of paup. If you left this line out, the command on the last line might not work at all, or might run an older version of paup. You can get a list of all available modules using the command "module avail"
* The 7th and last line starts up PAUP* and executes the run.nex file. The -n flag tells PAUP* that no human is going to be listening or answering questions, so it should just use default answers to any questions it needs to ask during the run.

==== Submitting a job using sbatch ====
Now you are ready to start the analysis. Type these commands to start your run:
cd
cd pauprun
sbatch gopaup

==== Checking status using squeue ====
You can see if your run is still going using the squeue command:
squeue
If it is running, you will see an entry named pauprun. Here is what it looked like for me:
hpc-ext-2 pauprun $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
645170 general pauprun plewis PD 0:00 1 (Priority)
The PD under ST (state) means that my job is pending (not yet running). This job goes so fast that you will be lucky to find it in the running state. If you see no jobs listed when you run squeue, it means your job has finished.

==== Killing a job using scancel ====

Sometimes it is clear that an analysis is not going to do what you wanted it to. Suppose that just after you press the Enter key to start an analysis, you realize that you forgot to put in a savetrees command in your paup block (so in the end you will not be able to see the results of the search). In such situations, you really want to just kill the job, fix the problem, and then start it up again. Use the scancel command for this. Note that in the output of the squeue command above, my run had a job-ID equal to 645170. I could kill the job like this:
scancel 645170
Be sure to delete any output files that have already been created before starting your run over again.

==== While PAUP* is running ====

While PAUP* is running, you can use cat to look at the log file:
cd pauprun
cat algae.output.txt

==== Using Cyberduck to download the log file and the tree file ====

When PAUP* finishes, squeue will no longer list your process. At this point, you need to use Cyberduck to get the log and tree files that were saved back to your local computer. Assuming you left Cyberduck open and connected to the cluster, double-click on the pauprun directory and locate the files <tt>algae.ml.tre</tt> and <tt>algae.output.txt</tt>. Select these two files and drag them out of the Cyberduck window and drop them on your desktop. After a flurry of activity, Cyberduck should report that the two files were downloaded successfully, at which point you can close the download status window.

==== Using scp to download the log file and the tree file ====

You can also use scp to get the log and tree files that were saved back to your local computer, but, again, if you are happy with Cyberduck you can skip this section. In the Terminal app on your Mac (or the Git for Windows BASH session on your Windows PC), type the following (being careful to separate the final dot character from everything else by a blank space):
scp xfer:pauprun/algae.output.txt .
scp xfer:pauprun/algae.ml.tre .
This assumes you have set up a shortcut: if not, you will need to use the longer version below (being sure to replace <tt>username</tt> with your own user name on the cluster):
scp username@transfer.cam.uchc.edu:pauprun/algae.output.txt .
scp username@transfer.cam.uchc.edu:pauprun/algae.ml.tre .

These scp commands copy the files <tt>algae.output.txt</tt> and <tt>algae.ml.tre</tt> to your current directory (this is what the single dot at the end of each line stands for).

=== Using FigTree to view tree files ===
[[File:AlgaeMLtree.png|right|400px|thumb|<tt>algae.ml.tre</tt> file viewed with FigTree]]
If you do not already have it, download and install the [http://tree.bio.ed.ac.uk/software/figtree/ FigTree] application on your laptop. FigTree is a Java application, so you will also need to install a Java Runtime Environment (JRE) if you don't already have one (just start FigTree and it will tell you if it cannot find a JRE). Once FigTree is running, choose File > Open from the menu to open the <tt>algae.ml.tre</tt> file.

==== Adjusting taxon label font ====

The first thing you will probably want to do is make the taxon labels larger or change the font. Expand the Tip Labels section on the left and play with the Font Size up/down control. You can also set font details in the preferences, which will save you a lot of time in the future

==== Line thickness ====

You can modify the thickness of the lines used by FigTree to draw the edges of the tree by expanding the Appearance tab.

==== Ladderization ====

You can ladderize the tree (make it appear to flow one way or the other) by playing with the Order Nodes option in the Trees tab.

==== Export tree as PDF ====

There are many other options that you can discover in FigTree, but one more thing to try today is to save the tree as a PDF file. Once you have the tree looking just the way you want, choose File > Export PDF...

==== Why have PAUP* create the log file algae.output.txt? ====

In your pauprun directory, SLURM saved the output that PAUP* normally sends to the console to a file named slurm-645170.out (your file will have a different number, however). You will not need this file after the run: the log command in your paup block ends up saving the same output in the file algae.output.txt. Why did we tell PAUP* to start a log file if SGE was going to save the output anyway? The main reason is that you can view the log file during the run, but you cannot view slurm-645170.out until the run is finished. There will come a day when you have a PAUP* run that has been going for several days and want to know whether it is 10% or 90% finished. At this point you will appreciate being able to view the output file!

==== Delete the slurm-xxxx.out file using the rm command ====

Because you do not need the slurm-xxxxx.out file (where the xxxx are placeholders for the job number), delete it using the rm command (the -f stands for force; i.e. don't ask if it is ok, just do it!):
cd
cd pauprun
rm -f slurm-*.out
You also no longer need the log and tree files because you downloaded them to your local computer using PSFTP:
rm -f algae.ml.tre
rm -f algae.output.txt
It is a good idea to delete files you no longer need for two reasons:
* you will later wonder whether you downloaded those files to your local machine and will have to spend time making sure you actually have saved the results locally
* our cluster only has so much disk space, and thus it is just not possible for everyone to keep every file they ever created

=== Tips and tricks ===

Here are some miscellaneous tips and tricks to make your life easier when communicating with the cluster.

==== Command completion using the tab key ====

You can often get away with only typing the first few letters of a filename; try pressing the Tab key after the first few letters and the command interpreter will try to complete the thought. For example, cd into the pauprun directory, then type
cat alg<TAB>
If algae.nex is the only file in the directory in which the first three letters are alg, then the command interpreter will type in the rest of the file name for you.

==== Wildcards ====

I've already mentioned this tip, but it bears repeating. When using most UNIX commands that accept filenames (e.g. ls, rm, mv, cp), you can place an asterisk inside the filename to stand in for any number of letters. So
ls algae*
will produce output like this
algae.ml.tre algae.nex algae.output.txt

==== Man pages ====

If you want to learn more options for any of the UNIX commands, you can use the man command to see the manual for that command. For example, here's how to see the manual describing the ls command:
man ls
It is important to know how to escape from a man page! The way to get out is to type the letter q. You can page down using Ctrl-f, page up through a man page using Ctrl-b, go to the end using Shift-g and return to the very beginning using 1,Shift-g (that is, type a 1, release it, then type Shift-g). You can also move line by line in a man page using the down and up arrows, and page by page using the PgUp and PgDn keys.

==== Xanadu information ====

You can find a lot of great information about the Xanadu cluster at the [https://bioinformatics.uconn.edu/ Computational Biology Core] web site. In particular, take time to look through the first tutorial named "Understanding the UConn Health Cluster (Xanadu)".

[[Category: Phylogenetics]]
__NOTOC__

Phylogenetics: Xanadu Cluster

2022-01-13T16:04:22Z

Paul Lewis:

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://phylogeny.uconn.edu/courses/ EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to show you how to log into the [https://bioinformatics.uconn.edu/resources-and-events/tutorials-2/xanadu/ Xanadu] computer cluster and perform a basic PAUP* analysis.
|}

= Using the UConn Xanadu cluster =
The UConn [http://bioinformatics.uconn.edu Computational Biology Core] is part of the [http://cgi.uconn.edu/ Center for Genome Innovation (CGI)]. We will use the Xanadu computing cluster located at the UConn Health Center for most of the data crunching we will do in this course. By now, you should have an account on the cluster, and today you will learn how to start analyses remotely (i.e. from your laptop), check on their status, and download the results when your analysis is finished.

== Obtaining the necessary communications software ==
You will be using a couple of simple (and free) programs to communicate with the head node of the cluster.

=== If you use Windows...please scroll down to the Windows section===
=== If you use MacOS 10.x... ===
==== SSH ====
The program '''ssh''' will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). You will use ssh to send commands to the cluster and see the output generated.

Start by opening the Terminal application, which you can find in the Applications/Utilities folder on your hard drive. Using the Terminal program, you can connect to the cluster with the following command:
ssh username@xanadu-submit-ext.cam.uchc.edu
where username should be replaced by your username on the cluster.

You may wish to install [http://www.iterm2.com/ iTerm2], which is a terminal program that makes some things easier than Terminal, but the built-in Terminal will work just fine.

'''Creating a shortcut'''

If you want to avoid having to type the long command above every time you want to connect to the cluster, it is possible to create a shortcut. You will need to edit the ''config'' file in your ''.ssh'' directory. Directories with names beginning with a period are not usually shown, but you can open this directory in Finder by typing
cd
open .ssh
in your terminal (the initial cd command changes the directory to your default, a.k.a. home, directory).

If you do not see a file named "config" in your ".ssh" director, create an empty config file using the command
touch config
Open the config file in a text editor such as [https://www.barebones.com BBEdit] (NOT a word processor such as Microsoft Word or Pages!) and add the following lines (replace ''plewis'' with your actual username on xanadu, and replace ''/Users/plewis'' with your home directory path on your Mac, which you can get by typing pwd at the command line):
host xanadu
HostName xanadu-submit-ext.cam.uchc.edu
User plewis
IdentityFile /Users/plewis/.ssh/id_rsa
Cipher blowfish
host xfer
HostName transfer.cam.uchc.edu
User plewis
IdentityFile /Users/plewis/.ssh/id_rsa
Cipher blowfish
Once you save the config file, you should be able to just type
ssh xanadu
to login to the xanadu cluster. (The second entry (xfer) will be used when transferring files to and from the cluster.)

==== SCP/SFTP ====
An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). I will show you how to transfer files using both methods, but for now you should go ahead and install [http://cyberduck.io Cyberduck]. Cyberduck provides a nice graphical user interface, but you might find that the command line clients let you get your work done faster once you get used to using them.

=== If you use Windows... ===
==== SSH ====
The program '''Git for Windows''' provides a terminal that will allow you to communicate with the cluster using a protocol known as SSH (Secure Shell). We will not actually use the "git" part of "Git for Windows", although it is there in case you need it later. Instead, you will use the Git for Windows bash shell to send commands to the cluster and see the output generated.

Visit the [https://gitforwindows.org Git for Windows] web site, press the Download button, and install Git for Windows on your system. Once installed, open Git BASH from the All Programs section of the start menu. This will open a terminal running the bash shell (a shell is a program that interprets operating system control commands) on your desktop.

Connect to the cluster with the following command:
ssh username@xanadu-submit-ext.cam.uchc.edu
where ''username'' should be replaced by your username on the cluster.

'''Creating a shortcut'''

If you want to avoid having to type the long command above every time you want to connect to the cluster, it is possible to create a shortcut. You will need to edit the ''config'' file in your ''.ssh'' directory. Open (or create, if it does not yet exist) the file named ''config'' in a text editor such as [https://notepad-plus-plus.org NotePad++] (NOT a word processor such as Microsoft Word!) and add the following lines (replace ''plewis'' with your actual username on xanadu, and replace ''/c/Users/Paul Lewis'' with your home directory path):
host xanadu
HostName xanadu-submit-ext.cam.uchc.edu
User plewis
IdentityFile "/c/Users/Paul Lewis/.ssh/id_rsa"
Cipher blowfish
host xfer
HostName transfer.cam.uchc.edu
User plewis
IdentityFile "/c/Users/Paul Lewis/.ssh/id_rsa"
Cipher blowfish
Use the <tt>pwd</tt> command to find out what your home directory path is, and use double quotes if your home directory path contains embedded spaces (note that I had to use quotes for mine).

Once you save the config file, you should be able to just type
ssh xanadu
to login to the xanadu cluster. (The second entry (xfer) will be used for transferring files to and from Xanadu.)

==== SCP/SFTP ====
An SCP or SFTP client is needed to to transfer files back and forth using the Secure Copy Protocol (SCP) or Secure File Transfer Protocol (SFTP). For now you should go ahead and install [http://cyberduck.io Cyberduck]. Cyberduck provides a nice graphical user interface, which makes moving files back and forth easy.

=== Learning enough UNIX to get around ===
I'm presuming that you do not know a lot of UNIX commands, but even if you are already a UNIX guru, please complete this section anyway because otherwise you will fail to create some files you will need later.

==== ls command: finding out what is in the present working directory ====
The '''ls''' command lists the files in the ''present working directory''. Try typing just
ls
If you need more details about files than you see here, type
ls -la
instead. This version provides information about file permissions, ownership, size, and last modification date.

==== pwd command: finding out what directory you are in ====
Typing
pwd
shows you the full path of the present working directory. The path shown should end with your username, indicating that you are currently in your home directory.

==== mkdir command: creating a new directory ====
Typing the following command will create a new directory named <tt>pauprun</tt> in your home directory:
mkdir pauprun
Use the <tt>ls</tt> command now to make sure a directory of that name was indeed created.

==== cd command: leaving the nest and returning home again ====
The cd command lets you change the present working directory. To move into the newly-created <tt>pauprun</tt> directory, type
cd pauprun
You can always go back to your home directory (no matter how lost you get!) by typing just cd by itself
cd
If you want to go down one directory level (say from pauprun back down to your home directory), you can specify the parent directory using two dots:
cd ..

==== Creating run.nex using the nano editor ====

One way to create a new file, or edit one that already exists, is to use the nano editor. You will now use nano to create a run.nex file containing a paup block. You will later execute this file in PAUP* to perform an analysis.

First use the pwd command to see where you are, then use cd to go into the pauprun directory if you are not already there. Type
nano run.nex
This will open the nano editor, and it should say <nowiki>[ New File ]</nowiki> at the bottom of the window to indicate that the run.nex file does not already exist. Note the menu of the commands along the bottom two rows. Each of these commands is invoked using the Ctrl key with the letter indicated. Thus, ^X Exit indicates that you can use the Ctrl key in combination with the letter X to exit nano.

For now, type the following into the editor:
#nexus

begin paup;
log file=algae.output.txt start replace flush;
execute algae.nex;
set criterion=likelihood autoclose;
lset nst=2 basefreq=estimate tratio=estimate rates=gamma shape=estimate;
hsearch swap=none start=stepwise addseq=random nrep=1;
lscores 1;
lset basefreq=previous tratio=previous shape=previous;
hsearch swap=tbr start=1;
savetrees file=algae.ml.tre brlens;
log stop;
quit;
end;
Once you have entered everything, use ^X to exit. Nano will ask if you want to save the modified '''buffer''' (a ''buffer'' is a predefined amount of computer memory used to store the text you type; the text stored in the buffer will be lost once you exit the program unless you save it to a file on the hard drive), at which point you should press the Y key to answer yes. Nano will now ask you whether you want to use the file name run.nex; this time just press Enter to accept. Nano should now exit and you can use cat to look at the contents of the file you just created:
cat run.nex

==== Create the gopaup file ====
Now use nano to create a second file named <tt>gopaup</tt> in your <tt>pauprun</tt> directory. To do this, type <tt>pwd</tt> to make sure you are in the <tt>pauprun</tt> directory, then type <tt>nano gopaup</tt>. This file should contain this text:
#!/bin/bash
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --job-name=pauprun
cd $HOME/pauprun
module load paup/4.0a-166
paup -n run.nex

=== Using Cyberduck to upload the algae.nex data file ===
[[File:Cyberduck-bookmark-xanadu.png|right]]
Download the file <tt>algae.nex</tt> from [http://hydrodictyon.eeb.uconn.edu/people/plewis/downloads/algae.nex here] and save it on your hard drive.

Open Cyberduck, choose Bookmark > New Bookmark from the main menu, then fill out the resulting dialog box as shown on the right (except substitute your own user name, of course). Be sure to change the protocol to SFTP (not the default FTP). Click the button to close the dialog box and you should see your bookmark appear at the bottom of the main window. Double click the bookmark to open a connection. You will then be warned that the host key is unknown - choose Allow (and go ahead and check the Always button so you do not need to do this every time.

Once you are in, you will see a listing of the files in your home directory (if you have any). To copy the <tt>algae.nex</tt> file to the cluster, you need only drag-and-drop it onto the Cyberduck window.

=== Using scp to upload the algae.nex data file ===

While you will probably want to do your file transfers with Cyberduck as described above, it is also possible to transfer files using the command line scp client. Read on if you are interested in this option, but feel free to skip this section if you are happy using Cyberduck.

In your terminal, navigate to where you saved the file. If you saved it on the desktop, you can go there by typing <tt>cd Desktop</tt>.

If you've made a shortcut in your ''.ssh/config'' file, you can use the following command to upload the ''algae.nex'' file:
scp algae.nex xfer:

If you have not made a shortcut, use this command instead:
scp algae.nex username@transfer.cam.uchc.edu:
where <tt>username</tt> should be replaced by your own user name on the cluster. (Don't overlook the colon on the very end of the line!)

=== A few more UNIX commands ===

You have now transferred a data file (algae.nex) to the cluster, but it is not in the right place. The algae.nex file is currently in your home directory, whereas the run.nex file is in the pauprun directory. The run.nex file contains a line containing the command <tt>execute algae.nex</tt>, which means that algae.nex should also be located in the pauprun directory. Use the following commands to ensure that (1) you are in your home directory, and (2) algae.nex is also in your home directory:
cd $HOME
ls algae.*
Note the use of a wildcard character (*) in the ls command. This will show you only files that begin with the letters <tt>algae</tt> followed by a period and any number of other non-whitespace characters. The <tt>$HOME</tt> is a predefined shell variable that will be replaced with your home directory. It is not necessary in this case - typing <tt>cd</tt> all by itself would take you to your home directory - but the <tt>$HOME</tt> variable is good to know about (especially for use in scripts).

==== mv command: moving or renaming a file ====
Now use the mv command to move algae.nex to the directory pauprun:
mv algae.nex pauprun
The mv command takes two arguments. The first argument is the name of the directory or file you want to move, whereas the second argument is the destination. The destination could be either a directory (which is true in this case) or a file name. If the directory pauprun did not already exist, mv would have interpreted this as a request to rename algae.nex to the file name pauprun! So, be aware that mv can rename files as well as move them.

==== cp command: copying a file ====
The cp command copies files. It leaves the original file in place and makes a copy elsewhere. You could have used this command to get a copy of algae.nex into the directory pauprun:
cp algae.nex pauprun
This would have left the original in your home directory, and made a duplicate of this file in the directory pauprun.

==== rm command: cleaning up ====
The rm command removes files. If you had used the cp command to copy algae.nex into the pauprun directory, you could remove the original file using these commands:
cd
rm algae.nex
The first cd command just ensures that the copy you are removing will be the one in your home directory (typing <tt>cd</tt> by itself acts the same as typing <tt>cd $HOME</tt>).
To delete an entire directory (don't try this now!), you can add the -rf flags. The r flag tells rm to recursively apply the remove command to everything in every subdirectory, while the f flag means force (don't ask whether each file should be deleted in each subdirectory, just do it!):
rm -rf pauprun
The above command would remove everything in the pauprun directory (without asking!), and then remove the pauprun directory itself. I want to stress that this is a particularly dangerous command, so make sure you are not distracted or sleep-deprived when you use it! Unlike the Windows or Mac graphical user interface, files deleted using rm are '''not''' moved to the Recycle Bin or Trash, they are just gone. There is '''no undo''' for the rm command.

=== Starting a PAUP* analysis ===

If you've been following the directions in sequence, you now have two files (algae.nex and run.nex) in your <tt>$HOME/pauprun</tt> directory on the cluster, whereas the gopaup file should be in <tt>$HOME</tt>. Use the cd command to make sure you are in your home directory, then the cat command to look at the contents of the gopaup file you created earlier. You should see this:
#!/bin/bash
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --job-name=pauprun
cd $HOME/pauprun
module load paup/4.0a-166
paup -n run.nex

This file will be used by software called SLURM to start your run. SLURM provides a command called <tt>sbatch</tt> that you will use to submit your analysis. The SLURM <tt>sbatch</tt> command will look for a core (i.e. processor) on a node (i.e. machine) in the cluster that is currently not being used and will start your analysis on that node. This saves you the effort of looking amongst all nodes in the cluster for a core that is not busy.

Here is an explanation of each of the lines in gopaup:
* The 1st line specifies the command interpreter to use (just include this in your scripts verbatim).
* The 2nd, 3rd, and 4th lines begin with #SBATCH and are interpreted as commands by SLURM itself. In this case, the first and second #SBATCH commands tell SLURM to use the general partition (--partition=general) and the general quality of service (--qos=general). You should always include these two lines verbatim. The last #SBATCH line gives a name to your job (--job-name=pauprun). You could change "pauprun" here to something else, but keep your job names short and without embedded spaces or punctuation. The job name will help you identify your run when checking status.
* The 5th line is simply a cd command that changes the present working directory to the pauprun directory you created earlier. This will ensure that anything saved by PAUP* ends up in this directory rather than in your home directory. Note that $HOME is like a macro that will be expanded to the full path to your home directory.
* The 6th line informs the system that you want to use a particular version of paup. If you left this line out, the command on the last line might not work at all, or might run an older version of paup. You can get a list of all available modules using the command "module avail"
* The 7th and last line starts up PAUP* and executes the run.nex file. The -n flag tells PAUP* that no human is going to be listening or answering questions, so it should just use default answers to any questions it needs to ask during the run.

==== Submitting a job using sbatch ====
Now you are ready to start the analysis. Type these commands to start your run:
cd
cd pauprun
sbatch gopaup

==== Checking status using squeue ====
You can see if your run is still going using the squeue command:
squeue
If it is running, you will see an entry named pauprun. Here is what it looked like for me:
hpc-ext-2 pauprun $ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
645170 general pauprun plewis PD 0:00 1 (Priority)
The PD under ST (state) means that my job is pending (not yet running). This job goes so fast that you will be lucky to find it in the running state. If you see no jobs listed when you run squeue, it means your job has finished.

==== Killing a job using scancel ====

Sometimes it is clear that an analysis is not going to do what you wanted it to. Suppose that just after you press the Enter key to start an analysis, you realize that you forgot to put in a savetrees command in your paup block (so in the end you will not be able to see the results of the search). In such situations, you really want to just kill the job, fix the problem, and then start it up again. Use the scancel command for this. Note that in the output of the squeue command above, my run had a job-ID equal to 645170. I could kill the job like this:
scancel 645170
Be sure to delete any output files that have already been created before starting your run over again.

==== While PAUP* is running ====

While PAUP* is running, you can use cat to look at the log file:
cd pauprun
cat algae.output.txt

==== Using Cyberduck to download the log file and the tree file ====

When PAUP* finishes, squeue will no longer list your process. At this point, you need to use Cyberduck to get the log and tree files that were saved back to your local computer. Assuming you left Cyberduck open and connected to the cluster, double-click on the pauprun directory and locate the files <tt>algae.ml.tre</tt> and <tt>algae.output.txt</tt>. Select these two files and drag them out of the Cyberduck window and drop them on your desktop. After a flurry of activity, Cyberduck should report that the two files were downloaded successfully, at which point you can close the download status window.

==== Using scp to download the log file and the tree file ====

You can also use scp to get the log and tree files that were saved back to your local computer, but, again, if you are happy with Cyberduck you can skip this section. In the Terminal app on your Mac (or the Git for Windows BASH session on your Windows PC), type the following (being careful to separate the final dot character from everything else by a blank space):
scp xfer:pauprun/algae.output.txt .
scp xfer:pauprun/algae.ml.tre .
This assumes you have set up a shortcut: if not, you will need to use the longer version below (being sure to replace <tt>username</tt> with your own user name on the cluster):
scp username@transfer.cam.uchc.edu:pauprun/algae.output.txt .
scp username@transfer.cam.uchc.edu:pauprun/algae.ml.tre .

These scp commands copy the files <tt>algae.output.txt</tt> and <tt>algae.ml.tre</tt> to your current directory (this is what the single dot at the end of each line stands for).

=== Using FigTree to view tree files ===
[[File:AlgaeMLtree.png|right|400px|thumb|<tt>algae.ml.tre</tt> file viewed with FigTree]]
If you do not already have it, download and install the [http://tree.bio.ed.ac.uk/software/figtree/ FigTree] application on your laptop. FigTree is a Java application, so you will also need to install a Java Runtime Environment (JRE) if you don't already have one (just start FigTree and it will tell you if it cannot find a JRE). Once FigTree is running, choose File > Open from the menu to open the <tt>algae.ml.tre</tt> file.

==== Adjusting taxon label font ====

The first thing you will probably want to do is make the taxon labels larger or change the font. Expand the Tip Labels section on the left and play with the Font Size up/down control. You can also set font details in the preferences, which will save you a lot of time in the future

==== Line thickness ====

You can modify the thickness of the lines used by FigTree to draw the edges of the tree by expanding the Appearance tab.

==== Ladderization ====

You can ladderize the tree (make it appear to flow one way or the other) by playing with the Order Nodes option in the Trees tab.

==== Export tree as PDF ====

There are many other options that you can discover in FigTree, but one more thing to try today is to save the tree as a PDF file. Once you have the tree looking just the way you want, choose File > Export PDF...

==== Why have PAUP* create the log file algae.output.txt? ====

In your pauprun directory, SLURM saved the output that PAUP* normally sends to the console to a file named slurm-645170.out (your file will have a different number, however). You will not need this file after the run: the log command in your paup block ends up saving the same output in the file algae.output.txt. Why did we tell PAUP* to start a log file if SGE was going to save the output anyway? The main reason is that you can view the log file during the run, but you cannot view slurm-645170.out until the run is finished. There will come a day when you have a PAUP* run that has been going for several days and want to know whether it is 10% or 90% finished. At this point you will appreciate being able to view the output file!

==== Delete the slurm-xxxx.out file using the rm command ====

Because you do not need the slurm-xxxxx.out file (where the xxxx are placeholders for the job number), delete it using the rm command (the -f stands for force; i.e. don't ask if it is ok, just do it!):
cd
cd pauprun
rm -f slurm-*.out
You also no longer need the log and tree files because you downloaded them to your local computer using PSFTP:
rm -f algae.ml.tre
rm -f algae.output.txt
It is a good idea to delete files you no longer need for two reasons:
* you will later wonder whether you downloaded those files to your local machine and will have to spend time making sure you actually have saved the results locally
* our cluster only has so much disk space, and thus it is just not possible for everyone to keep every file they ever created

=== Tips and tricks ===

Here are some miscellaneous tips and tricks to make your life easier when communicating with the cluster.

==== Command completion using the tab key ====

You can often get away with only typing the first few letters of a filename; try pressing the Tab key after the first few letters and the command interpreter will try to complete the thought. For example, cd into the pauprun directory, then type
cat alg<TAB>
If algae.nex is the only file in the directory in which the first three letters are alg, then the command interpreter will type in the rest of the file name for you.

==== Wildcards ====

I've already mentioned this tip, but it bears repeating. When using most UNIX commands that accept filenames (e.g. ls, rm, mv, cp), you can place an asterisk inside the filename to stand in for any number of letters. So
ls algae*
will produce output like this
algae.ml.tre algae.nex algae.output.txt

==== Man pages ====

If you want to learn more options for any of the UNIX commands, you can use the man command to see the manual for that command. For example, here's how to see the manual describing the ls command:
man ls
It is important to know how to escape from a man page! The way to get out is to type the letter q. You can page down using Ctrl-f, page up through a man page using Ctrl-b, go to the end using Shift-g and return to the very beginning using 1,Shift-g (that is, type a 1, release it, then type Shift-g). You can also move line by line in a man page using the down and up arrows, and page by page using the PgUp and PgDn keys.

==== Xanadu information ====

You can find a lot of great information about the Xanadu cluster at the [https://bioinformatics.uconn.edu/ Computational Biology Core] web site. In particular, take time to look through the first tutorial named "Understanding the UConn Health Cluster (Xanadu)".

[[Category: Phylogenetics]]
__NOTOC__

Kurt Schwenk

2022-01-05T20:24:21Z

Paul Lewis:

__NOEDITSECTION__
<span style="font-size: 300%">'''KURT SCHWENK''' </span><span style="font-size: 150%">(Ph.D., University of California, Berkeley) </span><br>

<span style="font-size: 150%">'''Professor of Ecology and Evolutionary Biology''' </span><br>

<span style="font-size: 150%">'''University of Connecticut'''</span>
= =
{|border=1 cellpadding=8

[[Image:SchwenkLabLogoRevised5_13smaller.jpg|center|]]
<br>
[[Image:ThamSirt7thFlickDownstrokeComposite.jpg|center|]]

<br><br>
<h2 style="margin:0;background-color:#FFFF00;font-size:150%;font-weight:bold;border:1px solid #a3bfb1;text-align:left;color:#000;padding:0.2em 0.4em;">CONTACT INFORMATION</h2>
<br>
'''Mailing Address:'''
<br>
[http://hydrodictyon.eeb.uconn.edu/eebwww/ Department of Ecology and Evolutionary Biology]<br>
[http://uconn.edu/ University of Connecticut]<br>
75 N. Eagleville Road <br>
Storrs, CT 06269-3043 <br><br>

'''Office:''' Torrey Life Sciences Building, Rm. 360<br>
'''Lab:''' Torrey Life Sciences Building, Rm. 365<br>
'''Office phone:''' (860) 486-0351<br>
'''Lab phone''': 860-486-4158 (use this number for students)<br>
'''Fax:''' (860) 486-6364<br>
'''Email:''' kurt.schwenk@uconn.edu<br><br>

{|border=1 cellpadding=8
| [[Image:Kurt2RacersAtFenton4_10sm.jpg|thumb|center|Kurt with two black racers, ''Coluber constrictor'' (photo by S. von Eicken)]] || [[Image:KurtWithRacerFenton4.15sm.jpg|thumb|center|Kurt with black racer (photo by S. von Eicken)]] || [[Image:KurtCopperhead03.jpg|thumb|center|Kurt with copperhead (photo by Chuck Smith)]] || [[Image:KurtCalifWesternRingneck4_11bSm.JPG|thumb|center|Kurt with western ringneck snake (''Diadophis'') in Calif. (photo by William Campbell)]]
|-
|}

<br><br>

<h2 style="margin:0;background-color:#FFFF00;font-size:150%;font-weight:bold;border:1px solid #a3bfb1;text-align:left;color:#000;padding:0.2em 0.4em;">EDUCATIONAL AND PROFESSIONAL HISTORY</h2>
<br>
[[Image:Diego and Buster close.jpg |frame|right|Former grad student Diego Sustaita (Rubega lab), now faculty at California State University San Marcos, and lab iguana, Buster, during weekly 'beermorph' discussion]]
[[Image:KurtBronxZoo77.jpg |thumb|left|Kurt in 1977 as Bronx Zoo mammal keeper, with juvenile guanaco (and hair!)]]
[[Image:HognoseTFsm.jpg |thumb|right|Eastern hognose snake (''Heterodon platyrhinos'') tongue-flicking (photo by KS)]]

*1976: ''Zookeeper (intern):'' [http://www.bronxzoo.com/ Bronx Zoo] (Herpetology)<br>
*1977: '''''B.A.''''' (high honors) [http://www.oberlin.edu/ Oberlin College], studied with Warren F. Walker, Jr., and [http://faculty.etsu.edu/stewarjr/ Jim Stewart]<br>
*1977-78: ''Zookeeper:'' [http://www.bronxzoo.com/ Bronx Zoo] (Mammalogy)<br>
*1984: '''''Ph.D.''''' Dept. of Zoology (now within [http://ib.berkeley.edu/ Integrative Biology]), [http://berkeley.edu/ University of California, Berkeley], with [http://ib.berkeley.edu/labs/mwake/ Marvalee H. Wake]<br>
*1984-87: ''Postdoctoral Research Associate, Instructor in Anatomy:'' Dept. of Oral Anatomy (now within [http://dentistry.uic.edu/departments/oral_biology/ Oral Biology)], [http://dentistry.uic.edu/ University of Illinois at Chicago College of Dentistry] with Karen Hiiemae (deceased, 2007)<br>
*1987-89: ''NIH Postdoctoral Fellow, Lecturer on Biology:'' [http://www.mcz.harvard.edu/ Museum of Comparative Zoology], [http://www.oeb.harvard.edu/ Dept. of Organismic Biology], [http://www.harvard.edu/ Harvard University] with [http://www.oeb.harvard.edu/faculty/crompton/ Fuzz Crompton]<br>
*1989-present: ''Assistant, Associate and Full Professor:'' [http://hydrodictyon.eeb.uconn.edu/eebwww/ Dept. of Ecology & Evolutionary Biology], [http://uconn.edu/ University of Connecticut]<br>
*2007-09: ''Chair'', [http://sicb.org/divisions/dvm.php3/ Division of Vertebrate Morphology] (DVM), [http://sicb.org/ Society for Integrative and Comparative Biology] (SICB)<br>
*2004-06: ''Associate Editor'', [http://www.wiley.com/bw/journal.asp?ref=0014-3820&site=1 ''Evolution'']<br>
*2006-10: ''Associate Editor'', [http://www3.interscience.wiley.com/journal/117928901/grouphome/home.html?CRETRY=1&SRETRY=0 ''Journal of Comparative Zoology Part A: Ecological Genetics and Physiology'']<br>
*2010-2021: ''Editorial Board'', ''Journal of Comparative Zoology Part A: Ecological Genetics and Physiology''
<br>2006-present: ''Editorial Board'', [http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1469-7580 ''Journal of Anatomy'']<br><br><br><br>

<h2 style="margin:0;background-color:#FFFF00;font-size:150%;font-weight:bold;border:1px solid #a3bfb1;text-align:left;color:#000;padding:0.2em 0.4em;">LINKS AND DOWNLOADS</h2>
<br>
[[Image:StenocercusTRANSweb.jpg |thumb|left|Transverse section of the tongue in an iguanid lizard]]
[[Image:CatTongueTRANSweb.jpg |thumb|right|Sagittal section of a cat tongue close to the median septum]]

*My [http://hydrodictyon.eeb.uconn.edu/people/schwenk/CV.Schwenk12_11_14.pdf '''CV''']<br>
*My '''Classic Works in Evolutionary Biology''' pages: [http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Classic_Works_in_Evolutionary_Biology '''Introduction'''] and [http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Classic_Works_in_Evolutionary_Biology—The_List_With_Links '''Annotated List''']<br>
*My student guide to proper scientific citation: [http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkCitationPrimer3_12.pdf '''''Faculty appearance and faculty quality: is there a connection?''''']<br>
*My student guide to writing: '''''How to write real good''''' (in preparation)<br>
*Some of our work featured in a [http://snakesarelong.blogspot.com/2014/06/why-do-snakes-flick-their-tongues.html '''recent blog'''].
*Former PhD student [https://ryersonlab.wordpress.com '''Bill Ryerson's'''] page <br>
*Former PhD student [https://amphibianfoundation.org/index.php/tobias/ '''Dr. Tobias Landberg's'''] page<br>
*Former PhD student [http://sites.google.com/site/copperheaddata/ '''Dr. Chuck Smith's'''] page (Assoc. Prof., Wofford College, SC)<br>
*Newspaper article about Chuck's work while a grad student {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/HartCourantChuckSmith.pdf}}<br>
*Chuck's [http://www.copperheadinstitute.org/'''''Copperhead Institute''''']<br>
*My brother [http://johnschwenk.info/ '''John Schwenk's'''] page (follow link to view some of my father's [George Schwenk] paintings)<br>
*[http://www.schwenktheworld.com/'''Schwenk the World'''] (too weird to explain—you have to see for yourself...)
*[https://www.youtube.com/watch?v=IWHR4wfHFbs '''''Winessing Evolution's Great Truths'''''] UConn PR video about me (don't ask)
*[https://www.youtube.com/watch?v=wjuVq9BRNAI '''''True Facts: Killer Tongues'''''] Hilarious Ze Frank video about our work
<br><br>
[[Image:TrueFactsKillerTongues.png|left|https://www.youtube.com/watch?v=wjuVq9BRNAI]]
[[Image:EvolutionGreatTruths.png|right|https://www.youtube.com/watch?v=IWHR4wfHFbs]]
<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>

<h2 style="margin:0;background-color:#FFFF00;font-size:150%;font-weight:bold;border:1px solid #a3bfb1;text-align:left;color:#000;padding:0.2em 0.4em;">COURSE LINKS</h2>
<br>
[[Image:KurtStranglesSkeleton08.jpg|frame|right|Kurt demonstrates proper strangulation technique to comparative anatomy students (photo by T. Landberg)]]
[[Image:3254_ClassInField.jpg|thumb|left|Kurt with mammalogy class in the field, 2013 (photo by Sean Flynn)]]
[[Image:MammClassZoo11_11.jpg|left|thumb|Mammalogy class at the Bronx Zoo]]
[[Image:ErinTessDanLab14sm.jpg|left|thumb|Lab undergrads Erin Mounce and Dan O'Donnell (left and right) and high school intern, Tess Shaw (middle)]]
[[Image:SaraWithChumlee.jpg|thumb|left|Masters student, Sara Horwitz, with her bearded dragon, Chumlee (''Pogona vitticeps'')]]
<br>

*'''EEB 3254/5254''' [http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Mammalogy '''Mammalogy''']
:''(next offered fall 2023)''<br/>
:[http://today.uconn.edu/blog/2013/11/classrooms-without-walls/ '''''UConn Today'' article about teaching outside the classroom, featuring Mammalogy''']<br/><br/>

*'''EEB 3273''' [http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Comparative_Vertebrate_Anatomy '''Comparative Vertebrate Anatomy''']
:''(next offered fall 2022)''<br/><br/>

*'''EEB 3265 Herpetology'''
:''(next offered spring 2023)''<br/><br/>

<br><br><br><br><br><br><br><br>
<h2 style="margin:0;background-color:#FFFF00;font-size:150%;font-weight:bold;border:1px solid #a3bfb1;text-align:left;color:#000;padding:0.2em 0.4em;">PUBLIC INFORMATION PAGE SERIES (LINKS)</h2>
[[Image:ManWhoKnewTooMuch.jpg|frame|right]]
<br>
The list below represents a series of public information pages I will be creating to address commonly asked questions about biological issues—particularly issues and questions about vertebrate animals that I have direct or personal knowledge of. I have been motivated to create these pages because of running across web pages that purport to provide 'answers' to people's questions about animals, evolution and biology generally. While some of the information available on the web is reasonably accurate, I have found that most of it is misleading or downright erroneous. The 'answers' are usually written by people who, although well-intentioned, are mostly ignorant about the topics they address. In any case, the information is nearly always cobbled together from secondary and tertiary sources of information—or worse—rather than direct knowledge of the science or the primary literature on the topic. The aim of these pages, therefore, is to provide accurate, scientifically validated information on some topics in my areas of expertise that come to my attention as being of general interest. I was motivated to do this mostly because of the widespread misinformation being propagated on the web about the first question, below. Since the information used to 'answer' this question is almost always based on a distorted or misunderstood representation of my own research, it seems appropriate for me to set the record straight.<br><br>

*1. [http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Why_do_snakes_have_forked_tongues%3F '''Why do snakes have forked tongues?''']<br><br>

*2. Why do snakes flick their tongues? [in preparation]<br><br>

*3. How do snakes eat? [in preparation]<br><br>

*4. How do lizards eat? [in preparation]<br><br>

*5. Can snakes hear? [in preparation]

*6. Are Komodo dragons (and other lizards) venomous? [in preparation] Short answer—NO!br><br>

<br><br><br><br><br><br><br><br>

<h2 style="margin:0;background-color:#FFFF00;font-size:150%;font-weight:bold;border:1px solid #a3bfb1;text-align:left;color:#000;padding:0.2em 0.4em;">MAJOR RESEARCH INTERESTS</h2>
<br>
[[Image:NirvanaWithCigar08.jpg|thumb|right|Former student, Dr. Nirvana Filoramo, celebrates getting the hell out of the lab]]
[[Image:Copperhead Tongue-Flick web.jpg |thumb|left|Copperhead (''Agkistrodon contortrix'') tongue-flicking. Photo by KS and C. Smith]]

* Phenotypic evolution<br>
* Evolutionary constraint<br>
* Evolutionary and functional morphology of vertebrates<br>
* Evolutionary and functional morphology of feeding in tetrapod vertebrates, especially lizards<br>
* Evolutionary and functional morphology of chemoreception in lizards and snakes<br>
* Evolutionary and functional morphology of the vertebrate tongue<br><br>

My research program is three-pronged: I pursue empirical studies related to the functional and evolutionary morphology of squamate '''feeding''' and '''chemoreception''', and theoretical work related to '''evolution''' and '''constraint'''. Feeding and chemoreception are functionally and evolutionarily related in squamates owing to their shared use of a single, complex organ, the tongue. From a biomechanical point of view, optimization of the tongue for feeding function makes it less effective in (vomeronasal) chemoreception and vice versa. Thus, there is a classic functional (and evolutionary) trade-off between the two principal functions of the tongue. Phylogenetic character analysis reveals how each major clade of squamates has found a unique 'solution' to the problem of this trade-off. The dynamic nature of the evolutionary tension created by competing sources of selection pressure has led to my theoretical work on ''internal selection'', ''functional integration'', ''phenotypic stability'' and ''evolutionary constraint''. Much of this work has been done in collaboration with Günter Wagner at Yale University. Although theoretical, the work is firmly grounded in my empirical work on squamate feeding and chemosensory systems, which have proven to be compelling model systems for approaching these broader issues.<br><br><br><br>

<h2 style="margin:0;background-color:#FFFF00;font-size:150%;font-weight:bold;border:1px solid #a3bfb1;text-align:left;color:#000;padding:0.2em 0.4em;">INFORMATION FOR PROSPECTIVE GRADUATE STUDENTS</h2> ['''note that I am no longer accepting new graduate students''']
<br>
[[Image:SaraLaurenBillSouthFenton6_10.jpg |thumb|left|Lab grad students Sara Horwitz, Lauren Jones and Bill Ryerson after a snake hunt.]]
[[Image:KurtSkullOffice08.JPG|thumb|right|Kurt ponders the fate of a grad student who refused to bend to his will...]]
[[Image:ChuckSmithBlackRacer4_09.jpg |thumb|left|Former student, Dr. Chuck Smith, is an Associate Professor at Wofford College, SC]]
[[Image:TobiasAndNerodiaFenton08.jpg |frame|right|Happy graduate student, Tobias Landberg, in the field (with copperhead, ''Agkistrodon contortrix'')]]
[[Image:SaraHorwitzWithAnaconda_2012.jpg|thumb|left|Sara in Amazonian Peru with green anaconda (''Eunectes murinus'')!]]

Students in my laboratory develop their own, independent research programs under my supervision. Although I expect there to be some overlap or mutual interest in student projects, I do not require students to work in my specific research areas. Ideally students will incorporate elements of morphology, evolution and/or function into their projects. Ecological or conservation-related projects are discouraged (because they lie outside my areas of expertise), although these can be elements of a research program centered on the former topics. Although I am best able to supervise work on squamate reptiles, I am open to projects dealing with any vertebrate group. I principally do laboratory-based work, but recent graduate students have included significant field components in their research. Applications from potential doctoral students are preferred, but doing a Masters is possible in some circumstances.<br>

The Department of Ecology and Evolutionary Biology at UConn is very integrative and interactive, and there is a great deal of cross-fertilization among labs. The department comprises 30 full-time faculty, all of whom work in the general area of organismal biology. There are an additional 60+ biologists in our sister departments of Physiology and Neurobiology, and Molecular and Cell Biology - and this is not to mention a variety of wildlife biologists in the School of Agriculture, biomedical researchers in the School of Medicine, etc. Thus, there is virtually no area of expertise unavailable to students when they assemble their research advisory committees.<br>

There are nine vertebrate biology faculty in the department (4 herpetology, 3 ornithology, 1 ichthyology, 1 mammalogy), and along with postdocs, graduate and undergraduate students, they constitute a very active and interactive research group. We have informal weekly meetings called [http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Vertlunch '''Vertlunch'''] in which we read and critique recent papers (and laugh a lot) and every Friday at 4:00 the Schwenk and Rubega labs (and others) meet for [http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Beermorph'''Beermorph'''] in which - well, it's pretty self-explanatory. For those morphologists with a developmental bent, we also have sporadic meetings of an [http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/EvoDevo_Journal_Club '''Evo-Devo Journal Club'''] in which we read and discuss current literature. And this is not to mention, of course, the frequent graduate seminars on various topics offered by faculty in the department, as well as weekly departmental seminars and occasional 'Tuesday Evening Seminars' run by the EEB graduate students. All-in-all, a very active place where you can be intellectually challenged and exposed to a variety of viewpoints - often while drinking at the same time.<br>

Before applying directly to the department for admission into the graduate program, you should contact me by email and describe your research interests and goals so that we can determine if there is an appropriate match. You should also explore the [http://hydrodictyon.eeb.uconn.edu/eebwww/ '''departmental web page'''] to get as much information about EEB as you can. If you have any questions at all about the department or the University, don't hesitate to email me. I can also put you in touch with current graduate students if you would like to hear about the program from their perspectives.<br>

Students accepted into the doctoral program are guaranteed 5 years of support (mostly by means of Teaching Assistantships). Support beyond 5 years is usually possible for students making good progress, but is not guaranteed. Masters students are guaranteed 2 years of support. The support package includes a tuition waiver and full health benefits.<br><br><br><br>

[[Image:FeedingCover_sm.jpg |thumb|right]]
[[Image:ZoologyCover05_sm.jpg |thumb|right|Whole-issue copies available - email a request]]

<h2 style="margin:0;background-color:#FFFF00;font-size:150%;font-weight:bold;border:1px solid #a3bfb1;text-align:left;color:#000;padding:0.2em 0.4em;">PUBLICATIONS</h2>
<br>

'''Email for reprints not available here as pdfs:''' kurt.schwenk@uconn.edu [PDF LINKS TEMPORARILY BROKEN—EMAIL REQUESTS FOR PDFs]<br><br>

<h2 style=font-size:115%>'''BOOKS:'''</h2>

Schwenk, K. (editor) (2000) ''Feeding: Form, Function and Evolution in Tetrapod Vertebrates''. Academic Press, San Diego. xv + 537 pp.<br>

:REVIEWS Of ''FEEDING'':
:*''Quarterly Review of Biology'' by T. H. Frazzetta {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/QRBReviewFeeding02.pdf}}<br>
:*''Copeia'' by Al Savitzky {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/CopeiaReviewFeeding02.pdf}}<br>
:*''Ibis'' by Paul M. Barrett {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/IbisReviewFeeding02.pdf}}<br>
:*''Palaeontology Newsletter'' by Ian Jenkins {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/PalaeoNewsletterReviewFeeding01.pdf}}<br><br>

<h2 style=font-size:115%>'''EDITED COMPILATION:'''</h2>

[[Image:BillOnBikeCrop3.jpg|thumb|right|Bill on his bike (although my grad students are not ''required'' to ride a motorcycle, it helps…]]
[[Image:KurtWideGlide.jpg|thumb|right|Biking to snake collecting sites!]]
[[Image:SaraWithSavannahMonitor5.14sm.jpg|thumb|right|Sara with monitor lizard.]]

Schwenk, K., and J. M. Starck (eds.) (2005) ''Integrative organismal biology: papers in honor of Professor Marvalee H. Wake.'' Zoology 108(4):261-356. [http://www.sciencedirect.com/science/journal/09442006 '''LINK'''] [email me for a copy of the entire issue]<br><br><br>

<h2 style=font-size:115%>'''PAPERS, BOOK CHAPTERS AND REVIEWS:'''</h2>
::(names in '''bold''' are current or former students)<br><br>

Schwenk, K., '''A. Les''', G. Mayor, M. Leal and L. Mahler (in prep.) The evolution of lingual displays in ''Anolis'' lizards.<br><br>

Schwenk, K. (in prep) Evolution of the chameleon tongue, an 'organ of extreme perfection'.<br><br>

Schwenk, K., '''E. Mounce''', '''D. O'Donnell''' and '''T. Shaw''' (in prep.) Chameleon-like lingual prey prehension in generalized iguanian lizards.<br><br>

'''Horwitz, S.''', K. Schwenk and '''W.G. Ryerson''' (in prep.) The kinematics of tongue-flicking in Gila monsters (''Heloderma suspectum'').<br><br>

'''Filoramo, N.''', and K. Schwenk (in prep.) Tongue tips, tropotaxis and the mechanism of chemical delivery to the vomeronasal organs in fork-tongued squamates (Reptilia).<br><br>

'''Ryerson, W.G.''', and K. Schwenk (''manuscript'') Why snakes flick their tongues.<br><br>

[[Image:JEBCover2020.jpg|thumb|left|Phillips, Hewes and Schwenk (2020)]]

:SOME PRE-PUBLICATION PUBLICITY ON THIS TOPIC:<br>

:*UConn Today—[http://today.uconn.edu/blog/2011/04/snakes-lizards-and-tongues/''Snakes, Lizards, and Tongues'']
:*UConn Magazine—''Studying Snakes'' {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/UConnMagBlurb11.pdf}}
:*'Super Slo-Mo Tuesdays', ''Daily Planet'', Discovery Channel, Canada
:*''New Scientist,'' on Bill's SICB presentation on aquatic tongue-flicking in ''Nerodia''<br><br>

'''Hewes, A. E.''', and K. Schwenk (2021) Functional morphology of lingual prey capture in a scincid lizard, ''Tiliqua scincoides'' (Reptilia, Squamata). Journal of Morphology 282:127-145. {{pdf|http://hydrodictyon.eeb.uconn.edu/html/2021.Hewes%26Schwenk.pdf}}<br><br>

'''Phillips, J. R., Hewes, A. E.''', and K. Schwenk (2020) The mechanics of air-breathing in gray tree frog tadpoles, ''Hyla versicolor'' LeConte 1825 (Anura: Hylidae). Journal of Experimental Biology 223(5). (http://doi:10.1242/jeb.219311)<br><br>

Schwenk, K., and '''J. R. Phillips''' (2020) Circumventing surface tension: tadpoles suck bubbles to breathe air. Proceedings of the Royal Society B 287:20192888 (http://doi:10.1098/rspb.2019.2704)<br><br>

:SOME PRESS ON SCHWENK & PHILLIPS (2020):<br>

:*UConn Today
:*UConn Daily Campus
:*New Scientist
:*Popular Science
:*The Scientist Magazine, Image of the Day
:*Seeker (seeker.com)
:*Bionieuws (The Netherlands)<br><br>

Schwenk, K. (2020) The snakes of East Haddam: foul and loathsome creatures? East Haddam (CT) Land Trust Nature Calendar.<br><br>

Schwenk, K. (2017) Ingestive behavior. Pp. 787-814. In: ''APA Handbook of Comparative Psychology: Vol. 1. Basic Concepts, Methods, Neural Substrate, and Behavior''. J. Call (ed.). American Psychological Association:Washington DC. <font color="#CC000">[email for a pdf copy of this chapter: kurt.schwenk@uconn.edu]<font color="#000000"> <br><br>

[[Image:RyersonSchwenk12Cover.jpg|thumb|left|Ryerson and Schwenk (2012)]]

'''Ryerson, W.G.''', and '''S. Horwitz''' (2014) ''EUNECTES MURINUS'' (Green Anaconda). BEHAVIOR / SIDEWINDING. Herpetological Review 45:337-338. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/AnacondaSidewindingRyersonHorwitz14.pdf}}<br><br>

'''Ryerson, W.G.''' (2013) Jumping in the salamander ''Desmognathus ocoee''. Copeia 2013:512-516. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/RyersonDesmogJumping13.pdf}}<br><br>

'''Ryerson, W.G.''', and K. Schwenk (2012) A simple, inexpensive system for digital particle image velocimetry (DPIV) in biomechanics. J. Exp. Zool. 317A:127-140. (JEZA featured paper)
{{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/RyersonSchwenkDPIVJEZA12.pdf}}<br><br>

Schwenk, K. (2011) Letter to the Editor, ''Oberlin Alumni Magazine'' (in response to an article suggesting that social media, e.g., 'tweeting', provide good training for writing). {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkOnWritingLetter11.pdf}} <br><br>

Schwenk, K. (2010) Implementing the organismal agenda . BioScience 60:673-674. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkGCOBBioSci10.pdf}} <br><br>

'''Ryerson, W.G.'''. and S. Deban (2010) Buccal pumping mechanics of ''Xenopus laevis'' tadpoles: effects of biotic and abiotic factors. J. Exp. Biol. 213:2444-2452. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/RyersonDebanXenopusFeed10.pdf}}<br><br>

Schwenk, K., and G. P. Wagner (2010) Visualizing vertebrates: new methods in functional morphology (editorial). J. Exp. Zool. 313A:241-243. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkWagnerJEZAed10.pdf}}<br><br>

Flores-Villela, O., C. A. Ríos-Muñoz, K. Schwenk, G. Zamudio-Varela and G. Magaña-Cota (2010) An unpublished manuscript of Alfredo Dugès related to the classification of lizards according to tongue morphology, c. 1898-1899. Archives of Natural History 37:246-254. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/FloresVillelaEADuguesTongue10.pdf}}<br><br>

'''Smith C. F'''., G. W. Schuett and K. Schwenk (2010) Plasma sex sterioids and mating season in wild-living copperheads (''Agkistrodon contortrix'') at the northeastern extreme of their range. Journal of Zoology 280:362-370. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SmithEAcopperheadSteroidsJZ10.pdf}}<br><br>

'''Smith C. F.''', G. W. Schuett, R. L. Earley, and K. Schwenk. (2009) The spatial and reproductive ecology of copperheads, ''Agkistrodon contortrix'' (Serpentes: Viperidae), at the northeastern extreme of their range. Herpetological Monographs 23:43-73.
{{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SmithEAcopperheads09.pdf}}<br><br>

[[Image:GrandChallengesLogo.jpg|thumb|left|Schwenk et al. (2009)]]

Schwenk, K.*, D. Padilla*, G. Bakken* and R. Full* (2009) Grand challenges in organismal biology. Integrative and Comparative Biology 49:7-14. (*authorship equally shared)
{{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkEAGrandChallengesICB09.pdf}}<br><br>

:*''Editorial introduction to Grand Challenges'' by ICB editor, Harold Heatwole, with GC schematic figure {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/GrandChallengesIntroWithFig09.pdf}}<br><br>

'''Filoramo, N.''', and K. Schwenk (2009) The mechanism of chemical delivery to the vomeronasal organs in squamate reptiles: a comparative morphological approach. J. Exp. Zool. 311A:20-34. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/FiloramoSchwenkJEZA09.pdf}}<br><br>

Schwenk, K. (2008) Aristotle’s ghost. Wild River Review. October 2008. [Online reprint of Schwenk (2002)], ''Wild River Review'' [http://www.wildriverreview.com/ '''home page''']<br><br>

Sherbrooke, W. C.,* and K. Schwenk.* (2008) Horned lizards (''Phrynosoma'') incapacitate dangerous ant prey with mucus. J. Exp. Zool. 309A:447-459. (*authorship equally shared) (''JEZA featured paper'') {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SherbrookeSchwenkPhryno08.pdf}}<br><br>

:SOME PRESS ON SHERBROOKE & SCHWENK (2008):<br>

:*''Journal of Experimental Biology'', 'Lizards incapacitate ants with mucus', by Stefan Pulver (''see last page of pdf'') {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/JEBcommentarySherbrookeSchwenk08.pdf}}
:*''ScienceNOW'', 'How to eat a nasty ant', by Greg Miller [http://sciencenow.sciencemag.org/cgi/content/full/2008/929/2 '''LINK'''] or pdf {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/ScienceOnlinePhrynoCommet08}}
:*''Natural History'' (12/08-1/09), 'How to Harvest a Harvester', by Graciela Flores {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/NaturalHistoryBlurb08.pdf}}
:*[http://www.discoverychannel.ca/Showpage.aspx?sid=13287 '''The Daily Planet'''] television segment, ''Discovery Channel'' (Canada). First broadcast 25 March 2009, approx. 6 min. (working on getting a clip posted)<br><br>

'''Smith, C. F.''', K. Schwenk, R. L. Earley and G. W. Schuett (2008) Sexual size dimorphism of the tongue in a North American pitviper. Journal of Zoology 274:367-374. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SmithSchwenkTongueDimorph08.pdf}}<br><br>

Schwenk, K. (2008) Comparative anatomy and physiology of chemical senses in non-avian aquatic reptiles. In: ''Sensory Evolution on the Threshold. Adaptations in Secondarily Aquatic Vertebrates''. J. G. M. Thewissen and S. Nummela (eds.). Univ. of California Press, Berkeley. Pp. 65-81. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkAquatChemo08big.pdf}}<br><br>

Schwenk, K., and J. G. M. Thewissen (2008) Aquatic and semi-aquatic reptiles. In: ''Sensory Evolution on the Threshold. Adaptations in Secondarily Aquatic Vertebrates''. J. G. M. Thewissen and S. Nummela (eds.). Univ. of California Press, Berkeley. Pp. 7-23. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkThewissenAquatRept08.pdf}}<br><br>

Eisthen, H., and Schwenk, K. (2008) The chemical stimulus and its detection. In: ''Sensory Evolution on the Threshold. Adaptations in Secondarily Aquatic Vertebrates''. J. G. M. Thewissen and S. Nummela (eds.). Univ. of California Press, Berkeley. Pp. 35-41. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/EisthenSchwenkChemStim08.pdf}}<br><br>

Schwenk, K. (2006) Evolution illustrated (Letter to the Editor). The Hartford Courant, 4 March:A9. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkEvolLetterCourant06.pdf}}<br><br>

Schwenk, K.,and M. Rubega (2005) Diversity of vertebrate feeding systems. Pp. 1-41. In: ''Physiological and Ecological Adaptations to Feeding in Vertebrates''. J. M. Starck and T. Wang (eds.). Science Publishers, Enfield, NH. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkRubegaVertFeeding05a.pdf}}<br><br>

Schulp, A.S., E.W.A. Mulder and K. Schwenk (2005) Did mosasaurs have forked tongues? Neth. J. Geosci. 84:359-371. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchulpEAMosasaurTongues05.pdf}}<br><br>

:SOME PRESS ON SCHULP ET AL. (2005):<br>

:*''WIRED'', 'Mosasaurs: Masters of the Bronx Cheer', by Brian Switek {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/MosasaurForkedTongueWeb.pdf}}<br><br>

Schwenk, K. , W. Korff and J. M. Starck (2005) Preface. Integrative organismal biology: papers in honor of Professor Marvalee H. Wake. Zoology 108:261-267. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkEAMHWpreface05.pdf}}<br><br>

Schwenk, K., and G. P. Wagner (2004) The relativism of constraints on phenotypic evolution. Pp. 390-408. In: ''Phenotypic Integration: Studying the Ecology and Evolution of Complex Phenotypes''. M. Pigliucci & K. Preston (eds.). Oxford Univ. Press, Oxford. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkWagnerConstraintRel04.pdf}}<br><br>

Schwenk, K. (2004) REVIEWS OF: ''Introduction to Horned Lizards of North America'', by Wade C. Sherbrooke, and ''Horned Lizards: The Book of Horny Toads'', by Jane Manaster. Copeia 2004:190-192. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkREVIEWSherbrookeManaster04.pdf}}<br><br>

Schwenk, K. (2004) Leapin’ non-ophidian squamates! REVIEW OF: Lizards. Windows to the Evolution of Diversity, by E. R. Pianka and L. J. Vitt. Trends in Ecology and Evolution 19:357-358. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkREVIEWTREE04.pdf}}<br><br>

Vitt, L. J., E. R. Pianka, W. E. Cooper and K. Schwenk (2003) History and the global ecology of squamate reptiles. American Naturalist 162:44-60. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/VittEAAmNat03.pdf}}<br><br>

Schwenk, K., and G. P. Wagner. (2003) Constraint. Pp. 52-61. In: ''Key Words and Concepts in Evolutionary Developmental Biology''. B. K. Hall & W. M. Olson (eds.). Harvard University Press, Cambridge. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkWagnerConstraint03.pdf}}<br><br>

Schwenk, K. (2002) Constraint. Pp. 196-199. In: ''Encyclopedia of Evolution'', M. Pagel (ed.). Oxford Univ. Press, Oxford. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkConstraint02.pdf}}<br><br>

Schwenk, K. (2002) Aristotle’s ghost. Creative Nonfiction No.19:32-40 (Special Issue: “Diversity Dialogues”). {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkAristotle02.pdf}}<br><br>

:SOME PRESS ON SCHWENK (2002):<br>

:*''Chronical of Higher Education'', 'Thoughts on Prejudice, Diversity, and Evolution' {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/ChronHighEdAristotleGhost02.pdf}}<br><br>

Schwenk, K. (2001) Extrinsic vs. intrinsic lingual muscles: a false dichotomy? Bull. Mus. Comp. Zool. (Harvard) 156:219-235. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkTongueMuscles01.pdf}}<br><br>

Schwenk, K., and G. P. Wagner (2001) Function and the evolution of phenotypic stability: connecting pattern to process. American Zoologist 41:552-563. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkWagnerAmZool01.pdf}}<br><br>

Schwenk, K. (2001) Functional units and their evolution. Pp. 165-198. In: ''The Character Concept in Evolutionary Biology''. G. P. Wagner (ed.). Academic Press, San Diego. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkFuncUnits01.pdf}}<br><br>

Nishikawa, K. C., and K. Schwenk (2001) Ingestion in amphibians and reptiles. In: ''Encyclopedia of Life Sciences''. John Wiley & Sons, Ltd: Chichester [doi:10.1038/npg.els.0001835] (pdf = 7 pp) {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/NishiSchwenkELS01.pdf}} [http://www.els.net/'''LINK TO ELS SITE'''] <br><br>

Schwenk, K. (2000) The apian way: from beehives to burrows, animal building sheds new light on biology. REVIEW OF: ''The Extended Organism. The Physiology of Animal-Built Structures'', by J. Scott Turner. The New York Times Book Review, 10 Dec., p. 37. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkNYTimesBkRev00.pdf}} OR [http://partners.nytimes.com/books/00/12/10/reviews/001210.10schwent.html '''SEE IT ONLINE HERE''']<br><br>

Schwenk, K. (2000) Preface. Pp. xiii-xv. In: ''Feeding: Form, Function and Evolution in Tetrapod Vertebrates''. K. Schwenk (ed.). Academic Press, San Diego. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkFeedPreface00.pdf}}<br><br>

Schwenk, K. (2000) Tetrapod feeding in the context of vertebrate morphology. Pp. 3-20. In: ''Feeding: Form, Function and Evolution in Tetrapod Vertebrates''. K. Schwenk (ed.). Academic Press, San Diego. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkFeedTetMorph00.pdf}}<br><br>

Schwenk, K. (2000) An introduction to tetrapod feeding. Pp. 21-61. In: ''Feeding: Form, Function and Evolution in Tetrapod Vertebrates''. K. Schwenk (ed.). Academic Press, San Diego. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkFeedIntro00.pdf}}<br><br>

Schwenk, K. (2000) Feeding in lepidosaurs. Pp. 175-291. In: ''Feeding: Form, Function and Evolution in Tetrapod Vertebrates''. K. Schwenk (ed.). Academic Press, San Diego. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkFeedLepidosaurs00.pdf}}<br><br>

Schwenk, K. (2000) A bibliography of turtle feeding. Pp. 169-171. In: ''Feeding: Form, Function and Evolution in Tetrapod Vertebrates''. K. Schwenk (ed.). Academic Press, San Diego. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkFeedTurtleBiblio00.pdf}}<br><br>

Wagner, G. P.,* and K. Schwenk* (2000) Evolutionarily Stable Configurations: functional integration and the evolution of phenotypic stability. Pp. 155-217. In: ''Evolutionary Biology'', vol. 31. M. K. Hecht, R. J. MacIntyre & M. T. Clegg (eds.). Kluwer Academic/Plenum Press, New York. (*authorship equally shared).<font color="#CC000"> '''''YOU CAN DOWNLOAD A PDF OF THIS PAPER''''' [http://uconn.academia.edu/KurtSchwenk/Papers/279293/Evolutionarily_Stable_Configurations_Functional_Integration_and_the_Evolution_of_Phenotypic_Stability '''HERE''']<font color="#000000"><br><br>

Schwenk, K. (1998) REVIEW OF: ''Lizards, Vols. 1 & 2''. By M. Rogner. Copeia 1998:1114-1116. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkREVIEWLizards98.pdf}}<br><br>

Schwenk, K. (1998) REVIEW OF: ''Comparative Osteological Examinations of Geckonids, Eublepharids and Uroplatids'', by V. Wellborn (translated by A. P. Russell, A. M. Bauer & A. Deufel). Herpetological Translations No. 1. Breck Bartholomew, Bibliomania, Logan, Utah. Copeia 1998:259-260. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkREVIEWGekkoOsteo98.pdf}}<br><br>

Schwenk, K. (1997) Snakes and the evolution of Harry Greene. REVIEW OF: ''Snakes. The Evolution of Mystery in Nature'', by H. W. Greene. Natural History 106:8-9 (July/August). {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkREVGreene97.pdf}}<br><br>

Dial, B. E., and K. Schwenk (1996) Olfaction and predator detection in ''Coleonyx brevis'' (Squamata: Eublepharidae) with comments on the functional significance of buccal pulsing in geckos. J. Exp. Zool. 276:415-424. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/DialSchwenk96.pdf}}<br><br>

Pigliucci, M., C. D. Schlichting, C. S. Jones and K. Schwenk (1996) Developmental reaction norms: the interactions among allometry, ontogeny and plasticity. Plant Species Biology 11:69-85. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/PigliuccieEA96.pdf}}<br><br>

[[Image:TREECover95_sm.jpg |thumb|right|Schwenk (1995)]]
[[Image:NaturalHistoryCover95_sm.jpg |thumb|right|Schwenk (1995)]]

Schwenk, K. (1996) REVIEW OF: ''Vertebrate Life'', 4th ed., by F. H. Pough et al., Quart. Rev. Biol. 71:581-582.<br><br>

Schwenk, K. (1995) REVIEW OF: ''The Lizard Man Speaks'', by E. R. Pianka. Quart. Rev. Biol. 70:328-329. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkREVIEWPiankaQRB95.pdf}}<br><br>

Schwenk, K. (1995) Of tongues and noses: chemoreception in lizards and snakes. Trends in Ecology & Evolution 10:7-12. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkChemoTREE95.pdf}}<br><br>

Schwenk, K. (1995) A utilitarian approach to evolutionary constraint. Zoology 98:251-262. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkConstraint95.pdf}}<br><br>

Schwenk, K., and H. W. Greene (1995) No electrostatic sense in snakes. Nature 373:26. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkGreeneNature95.pdf}}<br><br>

Schwenk, K. (1995) The serpent's tongue. Natural History 104:48-55 (April). {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkNatHist95.pdf}}<br><br>

:*Letter to the Editor re: ''The serpent's tongue'' and Schwenk response {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkNatHistLetterResponse95.pdf}}<br><br>

Schwenk, K. (1994) Why snakes have forked tongues. Science 263:1573-1577. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkForkedTongues94.pdf}}<br>

:'''SOME PRESS ON SCHWENK (1994):'''<br><br>

:RADIO INTERVIEWS
:*National Public Radio (All Things Considered) (see ''The NPR Interviews'', 1995. R. Siegel, ed.)
:*BBC World News Service
:*BBC-4
:*Voice of America
:*CBC (Canadian Broadcasting Corp.)
:*AAAS Science Update (Mutual Radio Network)
:*WFIU Radio (Indiana Univ., ‘'A Moment of Science'’)<br>

:TELEVISION
:*ABC news, New Haven, with Geoff Fox (television)
:*TV Ontario (segment for children's show)<br>

:PRINT
:*Associated Press (newspapers throughout North America and Europe) Example: {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/APArticleForkedTongues94.pdf}}
:*''New Scientist'' {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/NewScientistForkedTongues94.pdf}}
:*''Chronical of Higher Education'' {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/ChronicleHigherEd94.pdf}}
:*''Discover Magazine'' {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/DiscoverAug94.pdf}}
:*''National Geographic Magazine''
:*''Australia Nature'' {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/AustraliaNature95.pdf}}
:*''Readers' Diges''t
:*''Omni Magazine''
:*''Weekly Reader Magazine''
:*''Scholastic Super Science" {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/ScholasticSuperScience07.pdf}}
:*''International Wildlife''
:*''Wilson Quarterly'' {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/ForkedTongueCommentWilsonQuart95.pdf}}
:*''Washington Post''
:*''USA Today'' (front page: {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/USATodayForkTongue.pdf}})
:*''International Herald Tribune''
:*''Boston Globe'' {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/ForkTongueBostonGlobe94.pdf}}
:*''Daily Telegraph'' (London)
:*''La Guardia'' (Spain)
:*''Hartford Courant'' {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/ForkedTongueHartfordCourant94.pdf}}
:*''New Haven Register''''
:*''Manchester Journal Inquirer''
:*''San Jose Mercury News''
:*''Willimantic Chronicle''
:*''College and University Dialogue'' (Adventist journal) {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/DialogueAdventistSnakeTongue96.pdf}}<br>

:BOOKS
:*''Encyclopaedia Britanica Yearbook of Science and the Future'' (1995)
:*''Blue Genes and Polyester Plants'', by S. McGrayne (1997)
:*''The NPR Interviews, 1995'', edited by Robert Siegel (1995)<br><br>

Schwenk, K. (1994) Craniology: getting a head. REVIEW OF: ''The Skull'', 3 vols. J. Hanken & B. K. Hall (eds.). Science 263:1779-1780. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkREVIEWSkullsScience94.pdf}}<br><br>

Schwenk, K. (1994) Comparative biology and the importance of cladistic classification: a case study from the sensory biology of squamate reptiles. Biological J. Linnean Soc. 52:69-82. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkCladClassBJLS94.pdf}}<br><br>

Schwenk, K. (1994) Systematics and subjectivity: the phylogeny and classification of iguanian lizards reconsidered. Herpetological Review 25:53-57. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkHerpRev94.pdf}}<br><br>

Schwenk, K., and D. B. Wake (1993) Prey processing in ''Leurognathus marmoratus'' and the evolution of form and function in desmognathine salamanders (Plethodontidae). Biol. J. Linn. Soc. 49:141-162. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkWakeBJLS93.pdf}}<br><br>

Schwenk, K. (1993) Are geckos olfactory specialists? J. Zool., Lond. 229:289-302. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkGeckoOlfaction93.pdf}}<br><br>

Schwenk, K. (1993) The evolution of chemoreception in squamate reptiles: a phylogenetic approach. Brain, Behavior and Evolution 41:124-137. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkBBE93.pdf}}<br><br>

Schwenk, K. and G. C. Mayer (1991) Tongue display in anoles and its evolutionary basis. 4th ''Anolis'' Newsletter. J. Losos & G. Mayer (eds.). National Museum of Natural History (Smithsonian), Division of Amphibians and Reptiles, Washington, DC. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkMayer91.pdf}}<br><br>

Schwenk, K. (1989) REVIEW OF: ''The Evolution of Vertebrate Design'', by L. B. Radinsky. American Scientist 77:84.<br><br>

Schwenk, K. and G. S Throckmorton (1989) Functional and evolutionary morphology of lingual feeding in squamate reptiles: phylogenetics and kinematics. J. Zool., Lond. 219:153-175. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkThrockmorton89.pdf}}<br><br>

Schwenk, K. and D. A. Bell (1988) A cryptic intermediate in the evolution of chameleon tongue projection. Experientia 44:697-700. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkBellChamTongue88.pdf}}<br><br>

Schwenk, K. (1988) Comparative morphology of the lepidosaur tongue and its relevance to squamate phylogeny. In: R. Estes & G. Pregill (eds.). ''Phylogenetic Relationships of the Lizard Families''. Stanford Univ. Press, Stanford, 569-598. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/Schwenk1988lg.pdf}}<br><br>

Schwenk, K. and H. W. Greene (1987) Water collection and drinking in ''Phrynocephalus helioscopus'': a possible condensation mechanism. J. Herpetology 21:134-139. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkGreenePhrynocephalus87.pdf}}<br><br>

Schwenk, K. (1986) Morphology of the tongue in the tuatara, ''Sphenodon punctatus'' (Reptilia: Lepidosauria), with comments on function and phylogeny. J. Morphology 188:129-156. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkJMorph86.pdf}}<br><br>

Wake, M. H. and K. Schwenk (1986) A preliminary report on the morphology and distribution of taste buds in gymnophiones, with comparison to other amphibians. J. Herpetology 20:254-256. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/WakeSchwenkCaecilianTB86.pdf}}<br><br>

Schwenk, K. (1985) Occurrence, distribution and functional significance of taste buds in lizards. Copeia 1985:91-101. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/SchwenkTasteBuds85.pdf}}<br><br>

Good, D. A., and K. Schwenk (1985) A new species of ''Abronia'' (Lacertilia: Anguidae) from Oaxaca, Mexico. Copeia 1985:135-141. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/GoodSchwenkAbronia85.pdf}}<br><br>

Schwenk, K. (1984) ''Evolutionary Morphology of the Lepidosaur Tongue''. Ph.D. dissertation, University of California, Berkeley.<br><br>

Houck, L., and K. Schwenk (1984) The potential for long-term sperm competition in a plethodontid salamander. Herpetologica 40:410-415. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/HouckSchwenkSperm84.pdf}}<br><br>

Jaksic, F. M., and K. Schwenk (1983) Natural history observations on ''Liolaemus magellanicus'', the southernmost lizard in the world. Herpetologica 39:457-461. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/JaksicSchwenkLiolaemus83.pdf}}<br><br>

Bemis, W., K. Schwenk and M. H. Wake (1983) Morphology and function of the feeding apparatus in ''Dermophis mexicanus'' (Amphibia: Gymnophiona). Zool. J. Linn. Soc. 77:75-96. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/BemisSchwenkWakeDermophis83.pdf}}<br><br>

Jaksic, F. M., H. W. Greene, K. Schwenk and R. L. Seib (1982) Predation upon reptiles in Mediterranean habitats of Chile, California, and Spain: a comparative analysis. Oecologia 53:152-159. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/JaksicEApredation82.pdf}}<br><br>

Schwenk, K., S. K. Sessions and D. M. Peccinini-Seale (1982) Karyotypes of the basiliscine lizards ''Corytophanes cristatus'' and ''Corytophanes hernandesii'', with comments on the relationship between chromosomal and morphological evolution in lizards. Herpetologica 38:493-501. {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/1982SchwenkEAKaryotypes.pdf}}<br><br>

[[Image:SkeletonsDancingSequence.gif|none|frame|center|''memento mori'']]

[[Category:EEB Faculty|Schwenk]] [[Category:EEB People|Schwenk]]

Classic Works in Evolutionary Biology—The List With Links

2022-01-05T17:45:57Z

Paul Lewis: /* WHAT IS THIS PAGE?: */

== '''WHAT IS THIS PAGE?:''' ==
<br/>
'''NOTE 1: Most of the links on this page can only be used by members of the Dept. of Ecology & Evolutionary Biology at the University of Connecticut. Some links, however, are open and can be used by anyone, as can the information on the page, of course.'''<br><br>

'''NOTE 2: To see the latest incarnation of the graduate seminar out of which this page grew, go to this''' [http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Classic_Works_in_Evolutionary_Biology_S2014 '''LINK''']<br><br>

For a complete explanation/introduction, see [http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Classic_Works_in_Evolutionary_Biology '''EVOLUTIONARY CLASSICS MAIN PAGE''']<br>
(hydrodictyon.eeb.uconn.edu/eebedia/index.php/Classic_Works_in_Evolutionary_Biology)
<br/><br/>

This page maintained by [http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Kurt_Schwenk'''Kurt Schwenk''']<br/><br/>



'''TO DOWNLOAD A PDF COPY OF THIS LIST, CLICK ON THE ICON:''' {{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/ClassicWorksEvolBiolList1_1_13.pdf}}<br/><br/>

'''annotation initials key:'''<br/>

KS = Kurt Schwenk<br/>
CS = Carl Schlichting<br/><br/>

'''Note 1:''' a name in parentheses after a citation indicates a faculty member who has a hard copy of the listed work she/he is willing to loan<br/>
'''Note 2:''' PDFs of many of the listed papers are being uploaded on a regular basis (it takes awhile) - keep checking back!<br/><br/><br/>

== '''BOOKS:''' ==
<br/>
'''Baldwin, J. M. 1902. ''Development and Evolution''. MacMillan, London.'''<br/>
{{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/BaldwinDevoEvo1902.pdf}}<br/><br/>

'''Bateson, W. 1902/1909. ''Mendel's Principles of Heredity. A Defense.'' Cambridge Univ. Press, Cambridge.''' [http://www.esp.org/books/bateson/mendel/facsimile/ ['''READ 1902 EDITION ONLINE HERE''']]<br>
[Mendel's work on plant breeding and inheritance (see below) was all but lost when Bateson resurrected it in this book (KS)] <br>
{{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/BatesonMendelPrinciples1902.pdf}}1902 (1st) edition<br>
{{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/BatesonMendelPrinciples1909.pdf}}1909 edition (greatly expanded)<br><br>

'''Bonner, J. T. 1958. ''The Evolution of Development''. Cambridge Univ. Press, Cambridge.'''<br/><br/>

'''Bonner, J. T. 1988. ''The Evolution of Complexity by Means of Natural Selection''. Princeton Univ. Press, Princeton.''' (Schwenk)<br/><br/>

'''Darwin, C. 1859. On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. 1st ed. John Murray, London.''' [http://darwin-online.org.uk/ ['''DARWIN'S WORKS ONLINE HERE''']]<br>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/DarwinOriginSelections.pdf}}EXCERPTS HERE<br/><br/>

'''Darwin, C. 1872. ''On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life''. 6th ed. with additions and corrections. John Murray, London.''' [http://darwin-online.org.uk/ ['''DARWIN'S WORKS ONLINE HERE''']]<br/><br/>

'''Darwin, C. 1871. The Descent of Man, and Selection in Relation to Sex. Vols. 1 & 2. 1st ed. John Murray, London.''' [http://darwin-online.org.uk/ ['''DARWIN'S WORKS ONLINE HERE''']]<br/><br/>

'''Dawkins, R. 1976. ''The Selfish Gene''. Oxford Univ. Press, Oxford.''' (Schwenk)<br/>
[phenotypes as contrivances of genes to replicate themselves (KS)]<br/><br/>

'''de Beer, G. R. (editor).1938. ''Evolution. Essays on Aspects of Evolutionary Biology Presented to Professor E. S. Goodrich on His Birthday''. Oxford/Clarendon Press, Oxford.''' (Schwenk)<br/><br/>

'''de Beer, G. R. 1940. ''Embryos and Ancestors''. 1st ed. Oxford/Clarendon Press, Oxford.'''<br/>
[Early ideas on evo-devo. 3rd edition published in 1958. Schwenk has copy of latter. pdf courtesy of Hao Wang (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/deBeerEmbryoAncest1940.pdf}}<br/><br/>

'''Dobzhansky, T. 1937. ''Genetics and the Origin of Species''. Columbia Univ. Press, New York.''' (Schwenk)<br/>
[+2nd and 3rd editions in 1941 and 1951, respectively (KS)]<br/><br/>

'''Dobzhansky, T. 1970.'' Genetics of the Evolutionary Process''. Columbia Univ. Press, New York.''' (Schwenk)<br/><br/>

'''Endler, J. A. 1986. ''Natural Selection in the Wild''. Monographs in Population Biology No. 21. Princeton Univ. Press, Princeton, NJ.''' (Schwenk)<br/>
[in addition to technical material on measuring selection, etc., Endler has an excellent and thoughtful general/philosophical discussion of natural selection - a very good introduction to the concept (KS)]<br/><br/>

'''Fisher, R. A. 1930. ''The Genetical Theory of Natural Selection''. Clarendon Press, Oxford.''' [http://openlibrary.org/details/geneticaltheoryo031631mbp '''FULL TEXT LINK HERE''']<br/>{{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/FisherGeneticalTheory1930.pdf}}<br/><br/>

'''Frazzetta, T. H. 1975. ''Complex Adaptations in Evolving Populations''. Sinauer Assoc., Sunderland, MA.''' (Schwenk)<br/>
[a quirky and often overlooked book on the evolution of 'complex' adaptations, character complexes, etc. (KS)]<br/><br/>

'''Goldschmidt, R. 1940. ''The Material Basis of Evolution''. Yale Univ. Press, New Haven.''' (Schwenk)<br/>
[ever wonder where the term "hopeful monster" came from? (KS)]<br/><br/>

'''Gould, S. J. 1977. ''Ontogeny and Phylogeny''. Belknap/Harvard Univ. Press, Cambridge.''' (Schwenk)<br/>
[read this book for the first half - a fantastic history of developmental morphology - Haeckel, von Baer - those guys (KS)]<br/><br/>

'''Grant, V. 1963. ''The Origin of Adaptations''. Columbia Univ. Press, New York.''' (Schwenk)<br/><br/>

'''Gregory, W. K. 1951. ''Evolution Emerging. A Survey of Changing Patterns From Primeval Life to Man''. Vols. 1 & 2. MacMillan, New York.''' (Schwenk)<br/>
[The first volume is a history of life and was probably a bit dated even when it was published; the second volume is an amazing collection of figures of organisms, skulls, fossils, etc. Together they represent an amazing post-war treatise on biodiversity and the physical evidence for evolution (KS)]<br/><br/>

'''Haldane, J. B. S. 1932. ''The Causes of Evolution''. Longmans, Green and Co., London.'''<br/>
[Birth of quantitative genetics; appendix has summary of his papers on selection intensity, etc. (CS); reprinted 1990 by Princeton Univ. Press, which you can buy or read about [http://press.princeton.edu/titles/4618.html '''HERE'''] (KS)]<br/><br/>

'''Hennig, W. 1966. ''Phylogenetic Systematics''. Univ. of Illinois Press, Urbana.''' (Schwenk)<br/>
[translated by D. Dwight Davis and R. Zangerl; this book represents a revised and expanded version of Hennig’s ''Grundzüge einer Theorie der phylogenetischen Systematik'' (1950) and is therefore a new book rather than a simple translation. This is the bible of cladistics that when introduced in this country caused a paradigm shift in systematics (KS)]<br/><br/>

'''Huxley, J. S. 1942. ''Evolution, the Modern Synthesis''. Harper, New York.''' (Schwenk)<br/>
[new edition in 1963, Allen and Unwin, London, with new Introduction by Huxley (KS)]<br/><br/>

'''Jepsen, G. L., G. G. Simpson and E. Mayr (editors). 1949. ''Genetics, Paleontology and Evolution.'' Princeton Univ. Press, Princeton, NJ.''' (Schwenk)<br/>
[the short Foreward by Jepsen is a nice capsule summary of the aims and scope of the Synthesis (KS)]<br/><br/>

'''Malthus, T. 1798. ''An Essay on the Principle of Population, as it Affects the Future Improvement of Society with Remarks on the Speculations of Mr. Godwin, M. Condorcet, and Other Writers.'' Printed for J. Johnson, in St. Paul's Church-yard, London.'''<br/>
[not a work on evolution, but this famous essay was instrumental to Darwin in formulating the notion of the 'struggle for existence', which played a critical part in his theory of natural selection (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/MalthusPopulation1798.pdf}}<br/><br/>

'''Mayr, E. 1942. ''Systematics and the Origin of Species''. Columbia Univ. Press, New York.''' (Schwenk)<br/><br/>

'''Mayr, E. 1963. ''Animal Species and Evolution''. Belknap/Harvard Univ. Press, Cambridge, MA.''' (Schwenk)<br/>
[synthesis of the Synthesis from the man who gave us the Synthesis; explicates Mayr’s view on geographic/allopatric speciation, among other things (KS)]<br/><br/>

'''Morgan, T. H. 1919. ''The Physical Basis of Heredity''. Monographs on Experimental Biology. J. B. Lippincott Co., Philadelphia.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/PhysicalBasisHeredityTHMorgan1919.pdf}}<br/><br/>

'''Rensch, B. 1959. ''Evolution Above the Species Level''. Columbia Univ. Press, New York.''' (Schwenk)<br/>
[originally published in 1954 in German; Rensch was a dual PhD in philosophy and biology; his philosophical bent is obvious in his writings. Interestingly, his student, Gerhard Roth - who works on evolutionary neuroanatomy of the brain and especially sensory systems in amphibians - is also a dual PhD in philosophy and biology, as is Schlichting's former student, Massimo Pigliucci...) (KS)]<br/><br/>

'''Riedl, R. 1978. ''Order in Living Organisms''. John Wiley and Sons, New York.'''<br/>
[fascinating, but rather opaque; read his 1977 paper (below) for the essentials; Günter Wagner at Yale was a Riedl student (KS)]<br/><br/>

'''Schmalhausen, I. I. 1949. ''Factors of Evolution. The Theory of Stabilizing Selection.'' The Blakiston Co., Philadelphia.''' (Schwenk)<br/>
[reprinted 1986 by the Univ. of Chicago Press. Schmalhausen worked in isolation in Stalinist Russia and his accomplishments are all the more remarkable because of this (KS)]<br/><br/>

'''Simpson, G. G. 1944. ''Tempo and Mode in Evolution''. Columbia Univ. Press, New York.''' (Schwenk)<br/><br/>

'''Simpson, G. G. 1949. ''The Meaning of Evolution''. Yale Univ. Press, New Haven.''' (Schwenk)<br/>
[more of a popular book, but influential (KS)]<br/><br/>

'''Simpson, G. G. 1953. ''The Major Features of Evolution''. Columbia Univ. Press, New York.''' (Schwenk)<br/>
[a complete reworking of ''Tempo and Mode''; virtually a new, more synthetic book. If you can read only one Simpson book, make it this one. The job of relating the population/genetic/microevolutionary phenomena of concern to most of the ‘synthesists’ to macroevolutionary/deep time patterns evident in the fossil record fell to Simpson. He makes a heroic effort here and was way ahead of his time. Originates the notion of the 'adaptive zone' and discusses the relation between adaptive zones and adaptive radiations at length (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/SimpsonMajorFeatures1953.pdf}} EXCERPTS<br/><br/>

'''Stebbins, G. L., Jr. 1950. ''Variation and Evolution in Plants''. Columbia Univ. Press, New York.'''<br/>
[Stebbins was the one botanist ‘officially’ welcomed into the Synthesis fold (KS)]<br/>
[Interestingly, despite the emphasis of the synthesis on "population thinking", this book is largely focused at the macroevolutionary level. (CS)]<br/><br/>

'''Thompson, D. W. 1942. ''On Growth and Form: A New Edition''. Cambridge Univ. Press, Cambridge.''' (Schwenk)<br/>
[reprinted unabridged by Dover Press, 1992. An amazing book dealing with allometry, morphological transformation and quantification, among other things. Introduces the application of Cartesian coordinates to examine 2-D shape change (KS)]<br/><br/>

'''Waddington, C. H. 1957. ''The Strategy of the Genes''. A Discussion of Some Aspects of Theoretical Biology. Macmillan, New York'''. (Schwenk)<br/>
[the dawning of the modern evo-devo movement; a critically important, but often neglected book - a 'must read' for people interested in development and phenotypic evolution. Explicates the important concepts of 'canalization' and the 'epigenetic landscape', among others (KS)]<br/><br/>

'''White, M. J. D. 1978. ''Modes of Speciation''. W. H. Freeman, San Francisco.''' (Schwenk)<br/>
[the view from cytogenetics, which at the time was very big (cytogenetics has been largely supplanted by molecular-genetic approaches (KS)]<br/><br/>

'''Whyte, L. L. 1965. Internal Factors in Evolution. George Braziller, New York.''' (Schwenk)<br/>
[more philosophical than biological, this book, virtually ignored at the time, is becoming increasingly influential; deals with organismal ‘homeostasis’ and introduces the important concept of 'internal selection' (KS)]<br/><br/>

'''Williams, G. C. 1966. ''Adaptation and Natural Selection''. Princeton University Press, Princeton NJ.''' (Schwenk)<br/><br/>

'''Williams, G. C. 1992. ''Natural Selection. Domains, Levels, and Challenges''. Oxford Univ. Press, Oxford.''' (Schwenk)<br/><br/>

'''Wright, S. 1968. ''Evolution and the Genetics of Populations. Vol. 1. Genetics and Biometric Foundations.'' Univ. of Chicago Press, Chicago.'''<br/>

:Wright, S. 1969. ''Evolution and the Genetics of Populations. Vol. 2. The Theory of Gene Frequencies''. Univ. of Chicago Press, Chicago.<br/>
:Wright, S. 1977. ''Evolution and the Genetics of Populations. Vol. 3. Experimental Results and Evolutionary Deductions''. Univ. of Chicago Press, Chicago.<br/>
:Wright, S. 1978. ''Evolution and the Genetics of Populations. Vol. 4. Variability Within and Among Natural Populations''. Univ. of Chicago Press, Chicago. (Schwenk)<br/><br/>

== '''ARTICLES AND BOOK CHAPTERS:''' ==
<br/>
'''Alberch, P., S. J. Gould, G. F. Oster and D. B. Wake. 1979. Size and shape in ontogeny and phylogeny. Paleobiology 5:296-317.'''<br/>
[rejects Gould’s (1977) ‘clock model’ of heterochrony and formalizes the notion of ‘ontogenetic trajectories’; proposes a formal lexicon of heterochrony terms (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/AlberchEAHeterochrony79.pdf}}<br/><br/>

'''Arnold, S. J. 1983. Morphology, performance and fitness. American Zoologist 23:347-361.'''<br/>
[a critical paper for anyone interested in functional biology and evolution (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/ArnoldMorphPerfFitness83.pdf}}<br/><br/>

'''Antonovics, J. 1976. The nature of limits to natural selection. Annals of the Missouri Botanical Garden 63:224-247.'''<br/>
[insufficient genetic variability and the swamping effects of gene flow are inadequate explanations of limits to natural selection. Comparison of evolutionary responses in different populations subjected to similar selective forces, comparison of rare and widespread species, and comparison of marginal and central populations are all neglected research areas that bear on the nature of limits to natural selection. Plant populations provide us with well-defined, operationally viable systems for addressing these comparisons. Several possible constraints on range extension of ecologically marginal populations are considered in detail. Selection on fitness components that are themselves negatively correlated will be ineffective: such negative correlations are to be expected in natural populations. Small size of marginal populations will reduce severely the probability of obtaining appropriate character combinations; it will increase the swamping effects of gene flow; and it may lead to inbreeding depression effects. Gene flow will have different effects depending on whether the genes concerned are effectively neutral, advantageous, or deleterious in the population into which they migrate. Gene flow will spread beneficial genes rapidly, but may retard divergence if density of marginal populations is low and swamping effects are high. Finally a population entering a new habitat is likely to meet new competitors and predators: the coevolutionary responses of the latter may counteract adaptive responses by the species undergoing range extension. All these factors are likely to interact in important ways in marginal populations. The study of limits to natural selection is likely to be a fruitful future research area, and one in which the detailed documentation of the systematist will provide invaluable baseline information (CS)]<br/><br/>

'''Baldwin, J. M. 1896. A new factor in evolution. Amer. Nat. 30:441-451, 536-553.'''<br/>
[for some reason this has been called 'the Baldwin Effect'; see http://en.wikipedia.org/wiki/Baldwin_effect for more information. Also note that Baldwin's paper was divided into two portions in Am. Nat., hence the two pdfs. The 'Baldwin Effect' remains contentious. For some modern invocations, see the papers below. For an especially lucid historical and conceptual discussion about the Baldwin Effect and its relationship to 'genetic assimilation' and 'genetic accommodation', see M. J. West-Eberhard's [2003] book, ''Developmental Plasticity and Evolution [Oxford Univ. Press''] (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/BaldwinNewFactorEvol1896a.pdf}} {{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/BaldwinNewFactorEvol1896b.pdf}}<br/>

:Ananth, M. 2005. Psychological altruism vs. biological altruism: narrowing the gap with the Baldwin Effect. Acta Biotheoretica 53:217-239. {{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/AltruismBaldwinEffect05.pdf}}<br/>
:Crispo, E. 2007. The Baldwin Effect and genetic assimilation: revisiting two mechanisms of evolutionary change mediated by phenotypic plasticity. Evolution 61:2469-2479. {{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/PlasticityGenAssimilationBaldwin07.pdf}}<br/><br/>

'''Boag, P. T., and P. R. Grant. 1981. Intense natural selection in a population of Darwin’s finches in the Galapagos. Science 214: 82-85.'''<br/>
[see Grant & Grant paper below (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/BoagGrantNatSelFinches81.pdf}}<br/><br/>

'''Bock, W. J., and G. von Wahlert. 1965. Adaptation and the form-function complex. Evolution 19:269-299.'''<br/>
[distinguishes the concepts of structure, function and biological role; extremely useful for those who think in terms of 'the target of selection' , i.e., what is actually being selected for? (also relevant to ideas about 'levels' of selection) (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/BockvonWahlertForm-Function65.pdf}}<br/><br/>

'''Brown, W. L., Jr., and E. O. Wilson. 1956. Character displacement. Systematic Zoology 5:49-64.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/BrownWilsonCharDisplace56.pdf}}<br/><br/>

'''Clausen, J., and W. M. Hiesey. 1958. Experimental studies on the nature of species. IV. Genetic structure of ecological races. Publication 615, Carnegie Institution of Washington, Washington, DC.'''<br/><br/>

'''Coddington, J. A. 1988. Cladistic tests of adaptational hypotheses. Cladistics 4:3-22.'''<br/>
[Perhaps not yet old enough to be a true 'classic', but given how fast systematics has progressed over the last 25 years, I think it's fair to include this; a seminal paper demonstrating the necessity of taking phylogenetic branching pattern into account when drawing conclusions about adaptation—in this example, the fact that 'messy' spider webs are actually derived compared to the esthetically pleasing orb webs, which were thought to be the derived state largely because of intuition and bias about how evolution 'should' proceed, i.e., from disordered to ordered, not the other way around! (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/AdaptationCladisticsCoddington88.pdf}}<br/><br/>

'''Conway Morris, S. 1989. Burgess shale faunas and the Cambrian explosion. Science 246: 339-346.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/ConwayMorrisBurgessShaleScience89.pdf}}<br/><br/>

'''Coyne, J. A. and H. A. Orr. 1989. Patterns of speciation in ''Drosophila''. Evoluion 43:362-381.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/CoyneOrrPatternsSpeciationDros89.pdf}}<br/>

:Coyne, J. A. and H. A. Orr. 1997. “Patterns of speciation in ''Drosophila''” revisited. Evolution 51:295-303. {{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/CoyneOrrSpeciationDrosRevisited97.pdf}}<br/><br/>

'''Dobzhansky, T. and O. Pavlovsky. 1957. An experimental study of interaction between genetic drift and natural selection. Evolution 11:311-319.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/DobzhanskyPavlovskyGenDrift57.pdf}}<br/><br/>

'''Ehrlich, P. R. and P. H. Raven. 1964. Butterflies and plants: a study in coevolution. Evolution 18:586-608.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/EhrlichRavenCoevolultion64.pdf}}<br/><br/>

'''Ehrlich, P. R. and P. H. Raven. 1969. Differentiation of populations. Science 165:1228-1232.'''<br/>
[seminal paper for our understanding of gene flow and species coherence (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/EhrlichRavenScience69.pdf}}<br/><br/>

'''Eldredge, N. and S. J. Gould. 1972. Punctuated equilibria: An alternative to phyletic gradualism. In T. J. M. Schopf (ed.), ''Models in Paleobiology'', pp. 82-115. Freeman, Cooper and Company, San Francisco.'''<br/>
[see Gould and Eldredge, below. This was the first paper to establish the 'PE' model of phenotypic evolution, but it is more or less an abstract. The model is much better developed in the second paper (KS)]<br>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/EldredgeGouldPuncEquil72.pdf}}<br/><br/>

'''Epling, C. and T. Dobzhansky. 1942. Genetics of natural populations. VI. Microgeographic races in ''Linanthus parryae''. Genetics 27:317-332.'''<br/>
[populations seem to represent the action of genetic drift: fixation of one or the other of two alleles (CS)]<br/><br/>

'''Felsenstein, J. 1985. Phylogenies and the comparative method. Amer. Nat. 125:1-15.'''<br/>
[this is the paper that really started the whole emphasis on 'comparative methods' in the sense of statistically controlling for the effects of evolutionary history/phylogeny (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/PhylogCompMethodFelsenstein85.pdf}}<br/><br/>

'''Fisher, R. A. 1932. The evolutionary modification of genetic phenomena. Pp. 165-172. In: ''Proceedings of the Sixth International Congress of Genetics'', Vol. 1. D. F. Jones (ed.).''' <br>
[http://www.esp.org/books/6th-congress/facsimile/title3.html '''FULL TEXT''']<br/><br/>

'''Frazzetta, T. H. 1970. From hopeful monsters to bolyerine snakes? Amer. Nat. 104:55-72.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/FrazzettaHopefulMonsters70.pdf}}<br/><br/>

'''Gottlieb, L. D. 1984. Genetics and morphological evolution in plants. American Naturalist 123:681-709.'''<br/>
[The genetic basis of differences in morphology within and between flowering plant species is reviewed in order to elucidate how many genetic changes are responsible for the evolution of new characters. Two broad morphological categories are evident. Differences in structure, shape, orientation, and presence versus absence are frequently discrete and appear to be governed by one or two genes. Differences in dimensions, weight, and number usually exhibit continuous variation and are influenced by numerous genes, though many of them probably act only indirectly via general effects at the whole organ or whole plant levels. Although it is difficult to specify the relative contributions of the two morphological categories during evolutionary divergence, it is clear that discrete character differences are more common in plants than in animals. I propose that their prevalence in plants is a direct consequence of the open, less integrative, and plastic patterns of plant morphogenesis which permit large changes in morphology on the basis of relatively few genetic changes. Morphological divergence among genera or families of flowering plants may reflect many fewer genetic changes than is the case for similar taxonomic levels of higher animals. Accurate estimates of the number of genes responsible for character divergence require knowledge of the ontogenetic and anatomical details of character development and these must be coordinated with genetic analyses. Until this knowledge becomes available, general conclusions about the number of genetic changes responsible for morphological diversity are premature (CS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/GottliebGenetMorphEvolPlants84.pdf}}<br/><br/>

'''Gould, S. J. and N. Eldredge. 1977. Punctuated equilibria: the tempo and mode of evolution reconsidered. Paleobiology 3:115-151.'''<br/>
[A formalization and expansion of the concept introduced by Eldredge and Gould in 1972 as Gould gradually makes the concept his own. An attempt to find a rapprochement between microevolution/population-level phenomena and macroevolutionary patterns revealed in the fossil record. The concept remains contentious - for example, see Levinton and Simon critique below (KS)]<br>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/GouldEldredge77.pdf}}<br/>

:Levinton, J. S., and C. M. Simon. 1980. A Critique of the punctuated equilibria model and implications for the detection of speciation in the fossil record. Systematic Zoology 29:130-142. {{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/LevintonSimonPuncEquilCritique80.pdf}}<br/>
:[our own Chris Simon critiqued the punctuated equilibria model on several levels, including its restriction of speciation models to peripheral isolates, the confounding of species identification in the fossil record with 'stasis' and the assertion that species selection is random with respect to phenotypic trends (KS)]<br/><br/>

'''Gould, S. J. and R. C. Lewontin. 1979. The spandrels of San Marco and the Panglossian paradigm: a critique of the adaptationist programme. Proceedings of the Royal Society of London. Series B. 205:581-598.'''<br/>
[Whatever you think of this paper, there is no denying that it was absolutely seminal; probably the most critiqued paper ever published in terms of its cross-disciplinary appeal - including an entire edited volume analyzing it from a rhetorical perspective! It marks the beginning of the modern era of constraint theory. Also, if you've wondered about the reference to the "Panglossian paradigm", this refers to Dr. Pangloss, a character in Voltaire's book 'Candide' (1759) who, even in the worst of circumstances, continues to explain why everything is just as it must be and that this is the most perfect of all worlds. It's a hilarious story and biting social commentary, with a great deal of relevance to biology and especially academics, generally (e.g., "what a great genius this Pococurante must be! Nothing can please him" and "but still, there must certainly be a pleasure in criticising everything, and in seeing faults where others think they see beauties." And for grad students regarding their advisors: "...but when I realized that he had doubts about everything, I figured I knew as much as himself, and had no need of a guide to learn ignorance." Finally, who can beat, "I have grown old in misery and disgrace, living with only one buttock..."?). But I digress... (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/GouldLewontinSpandrels79.pdf}}<br/>

:Queller, D. C. 1995. The spaniels of St. Marx and the Panglossian paradox: A critique of a rhetorical programme. Quart. Rev. Biology 70:485-489. {{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/QuellerSpanielsSt.Marx95.pdf}}<br/>
:[one of many critiques published about G & L – an indication of its seminal influence (KS)]<br/>
:Cain, A. J. 1964. The perfection of animals. Pp. 36-63. In: Viewpoints in Biology. Carthy, J. D. and C. L. Duddington (eds.). Butterworths, London. {{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/CainPerfectionOfAnimals89.pdf}}<br/>
:[reprinted in Biol. J. Linn. Soc. 36:3-29 (1989); G & L are often criticized for creating a ‘straw man’ in the ‘adaptationist programme’, but this paper exemplifies some of the extreme adaptationist thinking of the time (KS)]<br/>
:Du Brul, E. L. and H. Sicher. 1954. ''The Adaptive Chin''. American Lecture Series, Publication No. 180. Charles C Thomas, Springfield. (Schwenk)<br/>
:[G & L refer to the human chin as their “favorite example” of incorrect identification of an atomized character. This monograph provides the foundation for their views (the chin as an integrated field, not a ‘thing’). Gould cites it in his 1977 book, but G & L do not, despite referring to it implicitly (KS)]<br/><br/>

'''Gould, S. J. and E. S. Vrba. 1982. Exaptation—a missing term in the science of form. Paleobiology 8:4-15.'''<br/>
[turns out it really wasn’t missing – it was just called ‘preadaptation’, which, for reasons they never fully justify, they eschew. They build a new vocabulary around the term ‘aptation’ to distinguish past vs. present function and selection. Although their terminology is sometimes applied, it is often done so self-consciously, i.e., a term such as "exaptation" is used in a sentence, but it is followed by the citation, as such: (sensu Gould and Verba, 1982) - a sign that the language hasn't really caught on and become part of our evolutionary vernacular (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/Gould&Vrba82.pdf}}<br/><br/>

'''Grant, P. R. and B. R. Grant. 2002. Unpredictable evolution in a 30-year study of Darwin’s Finches. Science 296:707-711.'''<br/>
[the one exception to the ‘too recent to qualify as a classic rule’ – the summation of 30 years work on the action of natural selection is simply too incredible and important not to include. The Grants demonstrate remarkable phenotypic lability in the beak related to climate change and its effect on food availability. Can 30 years of data be generalized to macroevolutionary patterns? —the big question! (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/GrantGrantFinchesScience02.pdf}}<br/><br/>

'''Haffer, J. 1969. Speciation in Amazonian forest birds. Science 165:131-137.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/HafferSpeciationAmazonBirdsRefugia69.pdf}}<br/><br/>

'''Haldane, J. B. S. 1924. A mathematical theory of natural and artificial selection. Part I. Trans. Cambridge Phil. Soc. 23:19-41.'''<br/>
[In a series of 10 papers from 1924-1934, Haldane outlines the first mathematical models for many cases of evolution due to selection, an important concept in the modern evolutionary synthesis]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/HaldaneMathTheorySelectionI24.pdf}}<br/>

:Haldane, J. B. S. 1924. A mathematical theory of natural and artificial selection. Part II. The influence of partial self-fertilisation, inbreeding, assortative mating and selective fertilisation on the composition of Mendelian populations and on natural selection. Proc. Cambridge Phil. Soc. 1:158-163.<br/>
:Haldane, J. B. S. 1926. A mathematical theory of natural and artificial selection. Part III. Proc. Cambridge Phil. Soc. 23:363-372.<br/>
:Haldane, J. B. S. 1927. A mathematical theory of natural and artificial selection. Part IV. Proc. Cambridge Phil. Soc. 23:607-615.<br/>
:Haldane, J. B. S. 1927. A mathematical theory of natural and artificial selection. Part V. Selection and mutation. Proc. Cambridge Phil. Soc. 23:838-844.<br/>
:Haldane, J. B. S. 1930. A mathematical theory of natural and artificial selection. Part VI. Isolation. Proc. Cambridge Phil. Soc. 26:220-230.<br/>
:Haldane, J. B. S. 1931. A mathematical theory of natural and artificial selection. Part VII. Selection intensity as a function of mortality rate. Proc. Cambridge Phil. Soc. 27:131-136.<br/>
:Haldane, J. B. S. 1932. A mathematical theory of natural and artificial selection. Part VIII. Metastable populations. Proc. Cambridge Phil. Soc. 26:220-230.<br/>
:Haldane, J. B. S. 1932. A mathematical theory of natural and artificial selection. Part IX. Rapid selection. Proc. Cambridge Phil. Soc. 28:244-248.<br/>
:Haldane, J. B. S. 1934. A mathematical theory of natural and artificial selection. Part X. Some theorems in artificial selection. Genetics 19:412-429. {{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/HaldaneMathTheorySelectionX34.pdf}}<br/><br/>

'''Haldane, J. B. S. 1932. The time of action of genes, and its bearing on some evolutionary problems. American Naturalist 66:5-24.'''<br/>
[points out the importance of knowledge about the age/stage of gene expression - gametophytes and gametes to zygotes, embryos and immature and mature organisms (CS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/HaldaneTimeOfActionOfGenes32.pdf}}<br/><br/>

'''Haldane, J. B. S. 1932. Can evolution be explained in terms of known genetical facts? Pp. 185-189. In: ''Proceedings of the Sixth International Congress of Genetics, Vol. 1.'' D. F. Jones (ed.).''' [http://www.esp.org/books/6th-congress/facsimile/title3.html '''FULL TEXT BOOK''']<br/><br/>

'''Haldane, J. B. S. 1957. The cost of natural selection. Journal of Genetics 55:511-524.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/HaldaneCostOfNatSel1957.pdf}}<br/><br/>

'''Hamilton, W. D. 1964. The genetical evolution of social behavior. 1. J. Theor. Biol. 7:1-16.'''<br/><br/>

'''Hamilton, W. D. 1964. The genetical evolution of social behavior. 2. J. Theor. Biol. 7:17-52.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/HamiltonSocialBehavII64.pdf}}<br/><br/>

'''Hamilton W. D. and M. Zuk. 1982. Heritable true fitness and bright birds: a role for parasites? Science 218:384-387.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/HamiltonZukScience82.pdf}}<br/><br/>

'''Hardy, G. H. 1908. Mendelian proportions in a mixed population. Science 28:49-50.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/HardyMendelianPopulation1908.pdf}}<br/><br/>

'''Huxley, J. S. 1924. Constant differential growth-ratios and their significance. Nature 114:895-896.'''<br/>
[The first formalization of 'isometry' and 'allometry' (KS)]<br/><br/>

'''Jacob, F. 1977. Evolution and tinkering. Science 196:1161-1166.'''<br/>
[A wonderful 'perspective' essay that puts together many modern themes around the notion of "hierarchy" - ahead of its time (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/JacobEvolutionTinkering77.pdf}}<br/><br/>

'''Johannsen, W. 1911. The genotype conception of heredity. Amer. Nat. 45:129-159.'''<br/>
[This paper is the first to propose the concept of the phenotype - genotype dichotomy. Forceful argument for the "genotype-concept". Discusses Woltereck's work on reaction norms as consistent with "g-c", as the variations are phenotypic in nature. Describes pleiotropy. Dismisses idea of chromosomes being units of heredity (CS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/GenotypeConceptionHeredity11.pdf}}<br/><br/>

'''Kimura, M. 1968. Evolutionary rate at the molecular level. Nature 217:624-626.'''<br/><br/>

'''King, J. L. & Jukes, T. H. 1969. Non-Darwinian evolution. Science 164:788-798.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/KingJukesNonDarwinianEvol69.pdf}}<br/><br/>

'''King, M.-C. & Wilson, A. C. 1975. Evolution at two levels in humans and chimpanzees. Science 188: 107-116.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/KingWilsonHumansChimps75.pdf}}<br/><br/>

'''Lande, R. 1982. A quantitative genetic theory of life history evolution. Ecology 63:607-615.'''<br/>
[Dynamic models of quantitative (polygenic) characters are more generally applicable in the analysis of life history evolution than are static optimization methods or one and two locus genetic models. A dynamic theory of life history evolution is derived by synthesizing population demography with quantitative genetics. In a population under weak selection with a nearly stable age distribution, the relative fitness of individuals with a particular life history phenotype can be approximated as an average of age-specific relative fecundity and mortality rates, weighted respectively by the present productivity and future reproductive value of each age-class. An adaptive topography is constructed showing that, with phenotype- and age-specific fecundity and mortality rates constant in time, evolution of the mean life history maximizes the intrinsic rate of increase of a population. However, the rate and direction of evolution in response to selection are strongly influenced by genetic correlations among characters. Negative genetic correlations among major components of fitness are often obscured phenotypically by positive environmental correlations, but commonly constitute the ultimate constraint on life history evolution, as illustrated by artificial selection experiments. Methods are suggested for measuring selective forces and evolutionary constraints that effect life history characters in natural populations (CS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/LandeQuantGenLifeHist1982.pdf}}<br/><br/>

'''Lande, R. and S. J. Arnold. 1983. The measurement of selection on correlated characters. Evolution 37:1210-1226.'''<br/>
[reanalysis of Bumpus' data (CS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/LandeArnoldCorrelatedChar83.pdf}}<br/><br/>

'''Lewontin, R. C. 1957. The adaptations of populations to varying environments. Cold Spring Harbor Symposium on Quantitative Biology 22:395-408.'''<br/>
[homeostasis of populations and individuals (CS)]<br/><br/>

'''Lewontin, R. C. 1978. Adaptation. Scientific American 239:212-228.'''<br/>
[strangely enough, this article published in the popular press Scientific American is one of the best single treatments of this critical, yet slippery, concept (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/LewontinAdaptation78.pdf}}<br/><br/>

'''Lewontin R. C. and L. C. Birch. 1966. Hybridization as a source of variation for adaptation to new environments. Evolution 20: 315-336.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/LewontinBirchHyrbridization66.pdf}}<br/><br/>

'''Lewontin, R. C. and J. L. Hubby. 1966. A molecular approach to the study of genic heterozygosity in natural populations. II. Amount of variation and degree of heterozygosity in natural populations of ''Drosophia pseudoobscura''. Genetics 54:595-609.'''<br/>
[a pretty boring and seemingly unremarkable paper to read now, but it marks the beginning of molecular population genetics; part of a series of papers by Lewontin and colleagues - see below (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/LewontinHubbyII66.pdf}}<br/>
:Hubby, J. L., and R. C. Lewontin. 1966. A molecular approach to the study of genic heterozygosity in natural populations. I. The number of alleles at different loci in ''Drosophila pseudoobscura''. Genetics 54:577-594.{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/HubbyLewontinI66.pdf}}<br/>
:Prakash, S., and R. C. Lewontin. 1968. A molecular approach to the study of genic heterozygosity in natural populations, III. Direct evidence of coadaptation in gene arrangements of ''Drosophila.'' Proc. Natl. Acad. Sci. 59:398-405. {{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/PrakashLewontinIII68.pdf}}<br/>
:Prakash, S., R. C. Lewontin and J. L. Hubby. 1969. A molecular approach to the study of genic heterozygosity in natural populations IV. Patterns of genic variation in central, marginal and isolated populations of ''Drosophila pseudoobscura.'' Genetics 61:841-858.{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/PrakashEAIV69.pdf}}<br/><br/>

'''Maynard Smith, J. 1966. Sympatric speciation. American Naturalist 100:637-650.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/MaynardSmithSympSpeciation66.pdf}}<br/><br/>

'''Maynard Smith, J., R. Burian, S. Kauffman, P. Alberch, J. Campbell, B. Goodwin, R. Lande, D. Raup and L. Wolpert. 1985. Developmental constraints and evolution: a perspective from the Mountain Lake Conference on Development and Evolution. Quart. Rev. Biol. 60:265-287.'''<br/>
[an important and oft-cited work on evolutionary/developmental constraint marred by internal contradictions reflecting its having been ‘written by committee’ (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/MaynardSmithEADevoConstraint85.pdf}}<br/><br/>

'''Mayr, E. 1940. Speciation phenomena in birds. American Naturalist 74: 249-278.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/MayrSpeciationBirds40.pdf}}<br/><br/>

'''Mayr, E. 1949. Speciation and selection. Proc. Amer. Phil. Soc. 93:514-519.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/MayrSpeciationSelection49.pdf}}<br/><br/>

'''Mayr, E. 1949. Speciation and systematics. Proc. Pp. 281-298, In: Jepsen, G. L., G. G. Simpson and E. Mayr (eds.), ''Genetics, Paleontology and Evolution''. Princeton Univ. Press, Princeton, NJ.''' (Schwenk)<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/MayrSpeciationSystematics49.pdf}}<br/><br/>

'''Mayr, E. 1954. Change of genetic environment and evolution. Pp. 157-180. In: Evolution as a Process. J. Huxley, A. C. Hardy & E. B. Ford (eds.). Allen and Unwin, London.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/MayrGeneticEnvironEvol54.pdf}}<br/><br/>

'''Mayr, E. 1981. Biological classification: toward a synthesis of opposing methodologies. Science 214:510-516.'''<br/>
[written at the height of the 'classification wars', this paper makes the case for 'evolutionary taxonomy' - a hybrid approach that allows for cladistic methods in phylogeny reconstruction, but which emphasizes morphological disparity among taxa by allowing formal paraphyletic groups, e.g., a 'Reptilia' that excludes birds (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/MayrClassificationSynthesis81.pdf}}<br/><br/>

'''Northcutt, R. G., and C. Gans. 1983. The genesis of neural crest and epidermal placodes: a reinterpretation of vertebrate origins. Quarterly Review of Biology 58:1-28.'''<br/>
[okay, this one is really of most interest to vertebrate biologists, but it is a fantastic story about the origin of a novel tissue type/germ layer - the neural crest- and how it is almost single-handedly responsible for the vertebrate skull/head - a true novelty and complex structure if ever there was one (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/NorthcuttGansVertOriginsQRB83.pdf}}<br/><br/>

'''Orr, H. A., and J. A. Coyne. 1992. The genetics of adaptation: a reassessment. American Naturalist 140:725-742.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/OrrCoyneGenetAdaptation92.pdf}}<br/><br/>

'''Raup, D. M. 1961. The geometry of coiling in gastropods. Proceedings of the National Academy of Sciences USA 47:602-609.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/RaupCoilingGastropodsPNAS61.pdf}}<br/><br/>

'''Raup, D. M. 1966. Geometric analysis of shell coiling: general problems. Journal of Paleontology 40:1178-1190.'''<br/>
[critically important paper for constraint theory—although ironically, Raup does not invoke constraint himself; Raup formalizes the notion of potential morphospace as a way of assessing observed biodiversity (KS)]<br>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/RaupShellCoilingMorphospace66.pdf}}<br/><br/>

'''Riedl, R. 1977. A systems-analytical approach to macro-evolutionary phenomena. Quart. Rev. Biol. 52:351-370.'''<br/>
[a must read for complex systems/phenotype/constraint freaks (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/RiedlSystemsMacroEvolQRB77.pdf}}<br/><br/>

'''Roth, V. L. 1984. On homology. Biol. J. Linn. Soc. 22:13-29.'''<br/>
[the best synthesis of the homology concept to that time; the starting point for many subsequent treatments (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/HomologyRoth84.pdf}}<br/><br/>

'''Stebbins, G. L., Jr. 1949. Reality and efficacy of selection in plants. Proc. Amer. Phil. Soc. 93:501-513.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/StebbinsSelectPlants49.pdf}}<br/><br/>

'''Stern, C. 1943. The Hardy-Weinberg law. Science 97:137-138.'''<br/>
[have you ever wondered why it's called the 'Hardy-Weinberg' principle? (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/SternHardyWeinbergLaw1943.pdf}}<br/><br/>

'''Trivers, R. L. 1971. The evolution of reciprocal altruism. Quarterly Review of Biology 46:35-57.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/TriversReciprocalAltruism71.pdf}}<br/><br/>

'''Trivers, R. L. 1974. Parent-offspring conflict. American Zoologist 14:249-264.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/TriversParentOffspringConflict74.pdf}}<br/><br/>

'''Turesson, G. 1922. The genotypical response of the plant species to the habitat. Hereditas 3:211-350.'''<br/>
[full text [http://www.archive.org/details/genotypicalrespo00turerich '''HERE''']]<br/><br/>

'''Van Valen, L. 1973. A new evolutionary law. Evolutionary Theory 1:1-30.'''<br/>
[the ‘red queen hypothesis’- the idea that you must evolve just to stay in place [http://pespmc1.vub.ac.be/REDQUEEN.html ("for an evolutionary system, continuing development is needed just in order to maintain its fitness relative to the systems it is co-evolving with")]; predator-prey arms races are a specific example of the red queen phenomenon. Also notable for being published in the journal Van Valen, himself, started - "dedicated to content over form" - basically a typed sheaf of papers Van Valen produced and distributed. It lasted many years and had a number of important papers published in it, including this one (KS)]<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/VanValenNewEvolutionaryLaw73.pdf}}<br/><br/>

'''Vermeij, G. 1974. Adaptation, versatility, and evolution. Systematic Zoology 22:466-477.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/VermeijAdaptVersatilityEvol73.pdf}}<br/><br/>

'''Weldon, W. F. R. 1901/1902. A first study of natural selection in ''Clausilia laminata'' (Montagu). Biometrika 1:109-115.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/WeldonSelectionClausilia1901.pdf}}<br/><br/>

'''Williams, G. C. 1957. Pleiotropy, natural selection and the evolution of senescence. Evolution 11:398-411.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/WilliamsSenescence57.pdf}}<br/><br/>

'''Wright, S. 1931. Evolution in Mendelian populations. Genetics 16:97-159.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/WrightEvolMendelianPop31.pdf}}<br/><br/>

'''Wright. S. 1932. The roles of mutation, inbreeding, crossbreeding and selection in evolution. Pp. 356-366. In: ''Proceedings of the Sixth International Congress of Genetics'', Vol. 1. D. F. Jones (ed.).'''<br/>
{{pdf|http://hydrodictyon.eeb.uconn.edu/people/schwenk/Wright1932.pdf}}<br/>
[A terrific paper that introduces several fundamental concepts including Wright's famous 'adaptive landscape' metaphor; you can get open access to the whole volume [http://www.esp.org/books/6th-congress/facsimile/title3.html '''HERE'''] (KS)]<br/><br/>

'''Wright, S. 1951. The genetical structure of populations. Annals of Eugenics 15: 323-354.'''<br/>
{{pdf|https://hydrodictyon.eeb.uconn.edu/projects/classics/Wright1951.pdf}}<br/><br/>

'''SOME USEFUL, NON-TRADITIONAL HISTORICAL TREATMENTS'''<br/><br/>

'''Comment:''' The list above is reserved for actual scientific contributions in the history of post-Darwinian evolutionary biology—it does not contain secondary sources, i.e., books and papers primarily about the history and philosophy of evolution. There are many such contributions, particularly histories of Darwinism and neo-Darwinism (the Synthesis). Some of the latter have been written by historians and philosophers of biology, some by biologists—notably Ernst Mayr’s various historical treatments of the Synthesis. William Provine is another important author in this area who is much less dogmatic. However, we are presently in the midst of what might eventually be interpreted as a Kuhnian ‘paradigm shift’ in evolutionary biology, with the roles of development, ontogeny and organismal phenotype playing an increasingly important part in our views about evolutionary mechanisms and patterns. The modern evo-devo movement, though often thought of as ‘new’, is actually based on a very old tradition originating in the early 19th century, primarily from the German school of evolutionary and developmental morphology ('transcendental morphology', e.g., Haeckel, von Baer), carried into the 20th century by Baldwin, de Beer and Waddington, to name a few. There is ongoing debate about the importance of evo-devo in a general theory of evolution—one extreme views it as virtually irrelevant and that the genetic-population-selection principles of the neo-Darwinian Synthesis are both necessary and sufficient to account for phenotypic evolution—'macroevoluion' is simply the extrapolation of microevolutionary processes over deep time. The opposite extreme views selection as a minor player in phenotypic evolution and posits that developmental ‘rules of form’ have primacy. It is likely, as usual, that the truth lies somewhere in between. The particular books listed below ae notable because they deal with the history of modern evolutionary theory explicitly from the vantage point of evo-devo, and as such, provide novel perspectives and analyses. They also discuss historical figures usually ignored or even denigrated in ‘traditional’ histories. Beware historical revisionism in Science, as well as politics—it is up to us to sift through the various views to see which is consonant with the primary literature. Of course, this is not always easy to do—for example, not everyone can read through von Baer’s (1828) 300+ page ''Über Entwickelungsgeschichte der Thiere: Beobachtung und Reflexion'' (“On the Developmental History of Animals: Observations and Reflection”).<br/><br/>

'''Amundson, R. 2005. ''The Changing Role of the Embryo in Evolutionary Thought. Roots of Evo-Devo.'' Cambridge Studies in Philosophy and Biology. Cambridge Univ. Press, Cambridge''' (Schwenk)<br/>
[Ron Amundson is an extraordinarily thoughtful and insightful historian/philosopher of biology. His essays on adaptation and constraint are top notch. I expect this book is the same (KS)].<br/><br/>

'''Richards, R. J. 1992. ''The Meaning of Evolution. The Morphological Construction and Ideological Reconstruction of Darwin’s Theory.'' Univ. of Chicago Press, Chicago.''' (Schwenk)<br/>
[Richards provides an excellent short history of the role and importance of the (mostly) German school of morphology and evolutionary morphology/development and its role in the formation of evolutionary theory—a perspective that is not easy to come by and which speaks to the modern evo-devo movement. An alternative to the Mayrian neoDarwinian, revisionist view of evolutionary history.]<br/><br/>

'''Richards, R. J. 2002. ''The Romantic Conception of Life. Science and Philosphy in the Age of Goethe.'' Univ. of Chicago Press, Chicago.'''<br/><br/>

[[Category:EEB Seminars]]

Systematics Seminar

2021-08-26T18:03:49Z

Paul Lewis:

<span style="color:red">The home page of the Systematics Seminar has moved to [https://uconneeb.github.io/systseminar/ https://uconneeb.github.io/systseminar/]. This EEBedia page is no longer maintained or updated.</span>

This is the home page of the UConn EEB department's Systematics Seminar (EEB 6486). This is a graduate seminar devoted to issues of interest to graduate students and faculty who make up the systematics program at the University of Connecticut.

[[Systematics Listserv|Click here for information about joining and using the Systematics email list]]

== Meeting time and place ==

We meet on Fridays at 2 PM in the Bamford Room (TLS 171b).

== Theme and Schedule for Fall 2019 ==

[https://lukejharmon.github.io/pcm/ We will be reading Luke J. Harmon's book on comparative phylogenetic methods]

Students registered for the course shall pick one chapter of the book to elaborate on, either by choosing and assigning a paper relevant to the chapter, or by bringing in their own project/data to present.

==August 30==
Discussion of chapter 1 - A Macroevolutionary Research Program, an organizational meeting

==September 6==
Discussion of chapter 2 - Fitting Statistical Models to Data, [http://phytools.org/mexico2018/ex/2/Intro-to-phylogenies.html Introduction to phylogenies in R]

==September 13==
Discussion of chapter 3 - Introduction to Brownian Motion

==September 20==
Discussion of chapter 4 - Fitting Brownian Motion

==September 27==
Discussion of chapter 5 - Multivariate Brownian Motion

==October 4==
Discussion of chapter 6 - Beyond Brownian Motion<br>[https://github.com/kevinliam/Miscellaneous/blob/master/add_tree_info.zip Kevin shows us how to add images to plotted trees in R]

==October 11==
Discussion of chapter 7 - Models of discrete character evolution — Lisa Terlova

==October 18==
Discussion of chapter 8 - Fitting models of discrete character evolution — Lisa Terlova

==October 25==
Discussion of chapter 9 - Beyond the Mk model - Kevin Keegan

==November 1==
Discussion of chapter 10 - Introduction to birth-death models — Zach Muscavitch

==November 8==
Discussion of chapter 11 - Fitting birth-death models — Tanner Matson

==November 15==
Discussion of chapter 12 - Beyond birth-death models - Katie Taylor

==November 22==
Discussion of chapter 13 - Characters and diversification rates - Amanda Hewes

==December 6==
Discussion of chapter 14 - Summary

== Information for discussion leaders ==
'''Seminar Format:''' Registered students be prepared to lead discussions, perhaps more than once depending on the number of participants.

The leader(s) will be responsible both for (1) selection of readings, (2) announcing the selection, (3) an introductory presentation, (4) driving discussion and (5) setting up and putting away the projector.

'''Readings:''' In consultation with the instructors, each leader should assign one primary paper for discussion and up to two other ancillary papers or resources. The readings should be posted to EEBedia at least 5 days in advance.

'''Announcing the reading:''' The leader should add an entry to the schedule (see below) by editing this page. There are two ways to create a link to the paper:

1. If the paper is available online through our library, it is sufficient to create a link to the DOI:
<nowiki>:[http://dx.doi.org/10.1093/sysbio/syv041 Doyle et al. 2015. Syst. Biol. 64:824-837.]</nowiki>
In this case, you need not give all the citation details because the DOI should always be sufficient to find the paper. The colon (:) at the beginning of the link causes the link to be indented an placed on a separate line. Note that the DOI is in the form of a URL, starting with <code><nowiki>http://dx.doi.org/</nowiki></code>. Here is how the above link looks embedded in this EEBedia page:
:[http://dx.doi.org/10.1093/sysbio/syv041 Doyle et al. 2015. Syst. Biol. 64:824-837.]

2. If the paper is not available through the library, upload a PDF of the paper to [http://dropbox.uconn.edu the UConn dropbox], being sure to use the secure version so that it can be password protected. Copy the URL provided by dropbox, and create a link to it as follows (see the [[Dropbox Test]] page for other examples):
<nowiki>:[https://dropbox.uconn.edu/dropbox?n=SystBiol-2015-Doyle-824-37.pdf&p=ELPFIc5NtO3c4V44Ls Doyle et al. 2015.]</nowiki>
In this case, you should provide a full citation to the paper for the benefit of those that visit the site long after the dropbox link has expired; however, the full details need not be part of the link text. Here is what this kind of link looks like embedded in this EEBedia page:

:[https://dropbox.uconn.edu/dropbox?n=SystBiol-2015-Doyle-824-37.pdf&p=ELPFIc5NtO3c4V44Ls Doyle et al. 2015.] Full citation: Vinson P. Doyle, Randee E. Young, Gavin J. P. Naylor, and Jeremy M. Brown. 2015. Can We Identify Genes with Increased Phylogenetic Reliability? Systematic Biology 64 (5): 824-837. doi:10.1093/sysbio/syv041

If you have ancillary papers, upload those to the dropbox individually and create separate links.

Finally, send a note to the [[Systematics Listserv]] letting everyone know that a paper is available.

'''Introductory PowerPoint/KeyNote Presentation:''' Introduce your topic with a 10- to 15-minute PowerPoint or KeyNote presentation. Dedicate at least 2/3 of that time to placing the subject into the broader context of the subject areas/themes and at most 1/3 of it introducing paper, special definitions, taxa, methods, etc. Never exceed 15 minutes. (For example, for a reading on figs and fig-wasps, broaden the scope to plant-herbivore co-evolution.). Add images, include short movie clips, visit web resources, etc. to keep the presentation engaging. Although your presentation should not be a review of the primary reading, showing key figures from the readings may be helpful (and appreciated). You may also want to provide more detail and background about ancillary readings which likely have not been read by all.

'''Discussion:''' You are responsible for driving the discussion. Assume everyone in attendance has read the main paper. There are excellent suggestions for generating class discussions on Chris Elphick’s Current Topics in Conservation Biology course site. See section under expectations.

Prepare 3-5 questions that you expect will spur discussion. Ideally, you would distribute questions a day or two before our class meeting.

'''Projector:'''
The Bamford room has joined the modern world--you should just need to plug in your computer or USB key to project.

== Past Seminars ==
* [[Systematics Seminar Spring 2019|Spring 2019]]
* [[Systematics Seminar Fall 2018|Fall 2018]]
* [[Systematics Seminar Spring 2018|Spring 2018]]
* [[Systematics Seminar Fall 2017|Fall 2017]]
* [[Systematics Seminar Fall 2014|Fall 2014]]
* [[Systematics Seminar Fall 2013|Fall 2013]]
* [[Systematics Seminar Spring 2012|Spring 2012]]
* [[Systematics Seminar Fall 2011|Fall 2011]]
* [http://darwin.eeb.uconn.edu/wiki/index.php/Statistical_phylogeography Spring 2011] (we joined Kent Holsinger's seminar on Statistical Phylogeography this semester)
* [[Systematics Seminar Fall 2010|Fall 2010]]
* [[Systematics Seminar Spring 2010|Spring 2010]]
* [[Systematics Seminar Fall 2009|Fall 2009]]
* [[Systematics Seminar Fall 2008|Fall 2008]]
* [[Systematics Seminar Spring 2008|Spring 2008]]
* [[Systematics Seminar Fall 2007|Fall 2007]]
* [[Systematics Seminar Spring 2007|Spring 2007]]
* [http://hydrodictyon.eeb.uconn.edu/courses/systematicsseminar/SystSemFall2006.html Fall 2006]
* [http://hydrodictyon.eeb.uconn.edu/courses/systematicsseminar/SystSemSpring2005.html Spring 2005]
* [http://hydrodictyon.eeb.uconn.edu/courses/systematicsseminar/SystSemFall2004.html Fall 2004]
* [http://hydrodictyon.eeb.uconn.edu/courses/phylomath/ Spring 2004]

[[Category:EEB Seminars]]

Phylogenetics: RevBayes Lab

2020-05-12T23:38:12Z

Paul Lewis:

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml and RevBayes modules:
module load paml/4.9
module load RevBayes/1.0.13

There are some issues with filesystems on Xanadu currently, so you may get some warning messages ("Skipping mount...doesn't exist in container" )that you can ignore.

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, enter the following at the command prompt with the name of your revscript file last:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
rb divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
rv divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

<div style="background-color: #ccccff">
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}
</div>

== Where to continue ==

If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.

== What to turn in ==

Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.

[[Category:Phylogenetics]]

Phylogenetics: RevBayes Lab

2020-04-23T16:32:54Z

Paul Lewis: /* Divergence times */

Phylogenetics: RevBayes Lab

2020-04-23T16:32:38Z

Paul Lewis: /* Obtaining credible intervals under the prior */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml and RevBayes modules:
module load paml/4.9
module load RevBayes/1.0.13

There are some issues with filesystems on Xanadu currently, so you may get some warning messages ("Skipping mount...doesn't exist in container" )that you can ignore.

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, enter the following at the command prompt with the name of your revscript file last:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
rv divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

<div style="background-color: #ccccff">
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}
</div>

== Where to continue ==

If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.

== What to turn in ==

Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.

Phylogenetics: RevBayes Lab

2020-04-23T16:32:08Z

Paul Lewis: /* Relaxed clocks */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml and RevBayes modules:
module load paml/4.9
module load RevBayes/1.0.13

There are some issues with filesystems on Xanadu currently, so you may get some warning messages ("Skipping mount...doesn't exist in container" )that you can ignore.

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, enter the following at the command prompt with the name of your revscript file last:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

<div style="background-color: #ccccff">
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}
</div>

== Where to continue ==

If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.

== What to turn in ==

Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.

Phylogenetics: RevBayes Lab

2020-04-23T16:31:47Z

Paul Lewis: /* Run RevBayes */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml and RevBayes modules:
module load paml/4.9
module load RevBayes/1.0.13

There are some issues with filesystems on Xanadu currently, so you may get some warning messages ("Skipping mount...doesn't exist in container" )that you can ignore.

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, enter the following at the command prompt with the name of your revscript file last:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

<div style="background-color: #ccccff">
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}
</div>

== Where to continue ==

If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.

== What to turn in ==

Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.

Phylogenetics: RevBayes Lab

2020-04-23T16:31:19Z

Paul Lewis: /* Login to Xanadu */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml and RevBayes modules:
module load paml/4.9
module load RevBayes/1.0.13

There are some issues with filesystems on Xanadu currently, so you may get some warning messages ("Skipping mount...doesn't exist in container" )that you can ignore.

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

Download the singularity image for RevBayes as follows:

cd ~/rblab # just to make sure you are in the right place
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg

To run RevBayes, enter the following at the command prompt with the name of your revscript file last:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

<div style="background-color: #ccccff">
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}
</div>

== Where to continue ==

If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.

== What to turn in ==

Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.

Phylogenetics: RevBayes Lab

2020-04-23T16:31:08Z

Paul Lewis: /* Login to Xanadu */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml and RevBayes modules:
module load paml/4.9
module load RevBayes/1.0.13



There are some issues with filesystems on Xanadu currently, so you may get some warning messages ("Skipping mount...doesn't exist in container" )that you can ignore.

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

Download the singularity image for RevBayes as follows:

cd ~/rblab # just to make sure you are in the right place
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg

To run RevBayes, enter the following at the command prompt with the name of your revscript file last:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

<div style="background-color: #ccccff">
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}
</div>

== Where to continue ==

If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.

== What to turn in ==

Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.

Phylogenetics: RevBayes Lab

2020-04-23T16:30:46Z

Paul Lewis: /* Login to Xanadu */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml and singularity modules:
module load paml/4.9
module load RevBayes/1.0.13



There are some issues with filesystems on Xanadu currently, so you may get some warning messages ("Skipping mount...doesn't exist in container" )that you can ignore.

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

Download the singularity image for RevBayes as follows:

cd ~/rblab # just to make sure you are in the right place
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg

To run RevBayes, enter the following at the command prompt with the name of your revscript file last:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

<div style="background-color: #ccccff">
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}
</div>

== Where to continue ==

If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.

== What to turn in ==

Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.

Phylogenetics: RevBayes Lab

2020-04-23T03:18:04Z

Paul Lewis: /* Obtaining credible intervals under the prior */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml and singularity modules:
module load paml/4.9
module load singularity/3.5.2

The sys-admins had difficulty installing RevBayes directly, so we will use singularity to run RevBayes inside a singularity container. Don't worry, it's easy.

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

Download the singularity image for RevBayes as follows:

cd ~/rblab # just to make sure you are in the right place
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg

To run RevBayes, enter the following at the command prompt with the name of your revscript file last:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

<div style="background-color: #ccccff">
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}
</div>

== Where to continue ==

If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.

== What to turn in ==

Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.

Phylogenetics: RevBayes Lab

2020-04-23T03:17:36Z

Paul Lewis: /* warning: this section is a work in progress */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml and singularity modules:
module load paml/4.9
module load singularity/3.5.2

The sys-admins had difficulty installing RevBayes directly, so we will use singularity to run RevBayes inside a singularity container. Don't worry, it's easy.

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

Download the singularity image for RevBayes as follows:

cd ~/rblab # just to make sure you are in the right place
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg

To run RevBayes, enter the following at the command prompt with the name of your revscript file last:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
rb divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

<div style="background-color: #ccccff">
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}
</div>

== Where to continue ==

If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.

== What to turn in ==

Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.

Phylogenetics: RevBayes Lab

2020-04-23T03:16:57Z

Paul Lewis: /* Run RevBayes */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml and singularity modules:
module load paml/4.9
module load singularity/3.5.2

The sys-admins had difficulty installing RevBayes directly, so we will use singularity to run RevBayes inside a singularity container. Don't worry, it's easy.

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

Download the singularity image for RevBayes as follows:

cd ~/rblab # just to make sure you are in the right place
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg

To run RevBayes, enter the following at the command prompt with the name of your revscript file last:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

=== warning: this section is a work in progress ===

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
rb divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
rb divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

<div style="background-color: #ccccff">
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}
</div>

== Where to continue ==

If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.

== What to turn in ==

Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.

Phylogenetics: RevBayes Lab

2020-04-23T03:16:39Z

Paul Lewis: /* Relaxed clocks */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml and singularity modules:
module load paml/4.9
module load singularity/3.5.2

The sys-admins had difficulty installing RevBayes directly, so we will use singularity to run RevBayes inside a singularity container. Don't worry, it's easy.

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

Download the singularity image for RevBayes as follows:

cd ~/rblab # just to make sure you are in the right place
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg

To run RevBayes, enter the following at the command prompt with the name of your revscript file last:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg myscript.rev strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

=== warning: this section is a work in progress ===

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
rb divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
rb divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

<div style="background-color: #ccccff">
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}
</div>

== Where to continue ==

If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.

== What to turn in ==

Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.

Phylogenetics: RevBayes Lab

2020-04-23T03:15:58Z

Paul Lewis: /* Run RevBayes */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml and singularity modules:
module load paml/4.9
module load singularity/3.5.2

The sys-admins had difficulty installing RevBayes directly, so we will use singularity to run RevBayes inside a singularity container. Don't worry, it's easy.

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

Download the singularity image for RevBayes as follows:

cd ~/rblab # just to make sure you are in the right place
curl -LO https://github.com/revbayes/revbayes.archive/releases/download/v1.0.13/RevBayes_Singularity_v1.0.13.simg

To run RevBayes, enter the following at the command prompt with the name of your revscript file last:
singularity run --app rb RevBayes_Singularity_v1.0.13.simg myscript.rev strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

=== warning: this section is a work in progress ===

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
rb divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
rb divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

<div style="background-color: #ccccff">
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}
</div>

== Where to continue ==

If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.

== What to turn in ==

Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.

Phylogenetics: RevBayes Lab

2020-04-23T03:11:14Z

Paul Lewis: /* Login to Xanadu */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml and singularity modules:
module load paml/4.9
module load singularity/3.5.2

The sys-admins had difficulty installing RevBayes directly, so we will use singularity to run RevBayes inside a singularity container. Don't worry, it's easy.

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

=== warning: this section is a work in progress ===

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
rb divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
rb divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

<div style="background-color: #ccccff">
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}
</div>

== Where to continue ==

If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.

== What to turn in ==

Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.

Phylogenetics: RevBayes Lab

2020-04-21T23:54:00Z

Paul Lewis: /* Obtaining credible intervals under the prior */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml, paup, and revbayes modules
module load paml/4.9
module load paup/4.0a-166
module load RevBayes/xxx

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

=== warning: this section is a work in progress ===

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
rb divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior and slide move for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
#moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
rb divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

<div style="background-color: #ccccff">
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}
</div>

== Where to continue ==

If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.

== What to turn in ==

Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.

Phylogenetics: RevBayes Lab

2020-04-21T23:46:53Z

Paul Lewis: /* Obtaining credible intervals under the prior */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml, paup, and revbayes modules
module load paml/4.9
module load paup/4.0a-166
module load RevBayes/xxx

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

=== warning: this section is a work in progress ===

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
rb divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
rb divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

<div style="background-color: #ccccff">
* ''Does the data contain information about substitution rates?'' {{title|yes, the credible intervals are much smaller when the data is used, so the data has greatly reduced the number of rate combinations that are plausible, which is the definition of information|answer}}
</div>

== Where to continue ==

If you need to estimate divergence times, and especially if you have fossils that can help you calibrate the molecular clock (so that you don't have to pin the root time at 1 like we did here), you should continue with the tutorial [Relaxed Clocks and Time Trees](https://revbayes.github.io/tutorials/clocks/) on the RevBayes web site.

== What to turn in ==

Use FigTree to create a PDF figure of your ''divtimeMAP.tre'' with credible intervals indicated by bars and turn that in to get credit.

Phylogenetics: RevBayes Lab

2020-04-21T23:35:25Z

Paul Lewis: /* Obtaining credible intervals under the prior */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml, paup, and revbayes modules
module load paml/4.9
module load paup/4.0a-166
module load RevBayes/xxx

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

=== warning: this section is a work in progress ===

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
rb divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

To finish up the lab, let's see what the credible interval sizes are under the prior. While we could explore the actual prior, the results would be a little disappointing. For example, under the pure birth model we are using, sampling from the prior would yield many thousands of very different tree topologies, and the consensus of all these disparate trees would be a star tree, which would not be very interesting. Similarly, allowing the birth_rate to be sampled from its prior (which has mean 100 and variance 10000!) would produce trees that, on average, look so different from the tree we used to simulate our data that comparison of divergence time credible intervals would be difficult. So, we will fix the tree topology and birth_rate prior to their true values but keep the ucln_mu, ucln_sigma, and the branch_rates priors at their original values. This means that we will only be looking at the prior on rates, not node times.

Copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. Change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. Comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. Comment out the existing lines setting up the prior for birth_rate and replace with a single line making the birth_rate a constant node:

#birth_rate ~ dnExponential(0.01)
#birth_rate.setValue(1.0)
birth_rate <- 2.6

4. Change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

5. Change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
rb divprior.Rev

Open both ''divpriorMAP.tre'' and ''divtimeMAP.tre'' and make the node bars equal the 95% HPD intervals in each.

Phylogenetics: RevBayes Lab

2020-04-21T18:11:44Z

Paul Lewis: /* Obtaining credible intervals under the prior */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml, paup, and revbayes modules
module load paml/4.9
module load paup/4.0a-166
module load RevBayes/xxx

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

=== warning: this section is a work in progress ===

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
rb divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

This has been a long lab, but there is one more thing I want you to try before you go. Let's see what the credible interval sizes are under the prior. We should not change the tree topology this time, as the prior on tree topology is flat across all possible tree topologies, so we will end up with a star tree if we allow the topology to be modified.

You should copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and make the following changes in the new file:

1. change output file names to have prefix ''divprior'' rather than ''divtime' so that you will not overwrite previous files (don't forget to do this in the readTreeTrace and mapTree commands);

2. comment out the 3 lines setting up moves (mvNNI, mvNarrow, mvFNPR) that change tree topology;

3. change the setup for the mnScreen monitor to have <tt>printgen=10000</tt> rather than <tt>printgen=100</tt>; and

4. change the MCMC burnin and run commands to include <tt>underPrior=TRUE</tt>, and change the number of generations in the run command to 1 million (don't worry, it goes fast if you don't ever calculate a likelihood!):

mymcmc.burnin(generations=1000, tuningInterval=100, underPrior=TRUE)
mymcmc.run(generations=1000000, underPrior=TRUE)

Now run the file as usual:
rb divprior.Rev

Open the ''revpriorMAP.tre'' file and make the node bars equal the 95% HPD intervals. This time it should look quite different!

Phylogenetics: RevBayes Lab

2020-04-21T15:45:02Z

Paul Lewis: /* Review results of the divergence time analysis */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml, paup, and revbayes modules
module load paml/4.9
module load paup/4.0a-166
module load RevBayes/xxx

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

=== warning: this section is a work in progress ===

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
rb divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer.

Also open the ''divtimeMAP.tre'' file in FigTree, check the Node Bars checkbox, then specify age_95%_HPD for Display after expanding the Node Bars section. You may also wish to use File > New... from the FigTree main menu to open up a new window, then paste the tree description from trees.txt into the new window for comparison. It also helps to expand the Trees section of FigTree and check the Order nodes checkbox so that both trees are ''ladderized'' the same direction.

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''Does the MAP tree have the same topology as the true tree'' {{title|yes, it did in my case anyway|answer}}
* ''Are we more confident about recent nodes or ancient nodes in the tree?'' {{title|the credible intervals for recent nodes are smaller, so we are more confident in those|answer}}
* ''What would you conclude if these credible intervals were exactly the same size if we had performed the MCMC analysis on the prior only, with no data?'' {{title|we must conclude that there is no information in the data about divergence times; we would hope that the credible intervals would be smaller when the data is used|answer}}
</div>

== Obtaining credible intervals under the prior ==

This has been a long lab, but there is one more thing I want you to try before you go. Let's see what the credible interval sizes are under the prior. We should not change the tree topology this time, as the prior on tree topology is flat across all possible tree topologies, so we will end up with a star tree if we allow the topology to be modified.

You should copy your ''divtime.Rev'' file to create a new file named ''divprior.Rev'' and change all the file names to have prefix ''divprior'' so as not to overwrite your previous results, but other than that, the only thing that needs to be changes is the mcmc command:

Phylogenetics: RevBayes Lab

2020-04-21T15:25:31Z

Paul Lewis: /* Divergence times */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml, paup, and revbayes modules
module load paml/4.9
module load paup/4.0a-166
module load RevBayes/xxx

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

=== warning: this section is a work in progress ===

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Add the following to the very end of the divtime.Rev file. This will read all the sampled trees (each will be different this time because we added moves to modify tree topology and node times) and create a consensus tree showing 95% credible intervals around each divergence time:
# Summarize divergence times

tt = readTreeTrace("output/divtime.trees", "clock")
tt.summarize()
mapTree(tt, "output/divtimeMAP.tre")

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
rb divtime.Rev

Be prepared to wait for a little longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''2nd question'' {{title|xxx|answer}}
* ''3rd question'' {{title|xxx|answer}}
* ''4th question'' {{title|xxx|answer}}
* ''5th question'' {{title|xxx|answer}}
</div>

Phylogenetics: RevBayes Lab

2020-04-21T14:56:32Z

Paul Lewis: /* Divergence times */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml, paup, and revbayes modules
module load paml/4.9
module load paup/4.0a-166
module load RevBayes/xxx

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

=== warning: this section is a work in progress ===

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
rb divtime.Rev

Be prepared to wait for awhile longer this time; we've added a lot of extra work to the analysis.

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''2nd question'' {{title|xxx|answer}}
* ''3rd question'' {{title|xxx|answer}}
* ''4th question'' {{title|xxx|answer}}
* ''5th question'' {{title|xxx|answer}}
</div>

Phylogenetics: RevBayes Lab

2020-04-21T14:51:27Z

Paul Lewis: /* Review results of the divergence time analysis */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml, paup, and revbayes modules
module load paml/4.9
module load paup/4.0a-166
module load RevBayes/xxx

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

=== warning: this section is a work in progress ===

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
rb divtime.Rev

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''We've added moves, but no parameters or priors for the node times. Why not?'' {{title|the prior for node times is provided by dnBDP, the birth-death process distribution, which has one hyperparameter birth_rate, so the parameters and priors for node times have been there all along!|answer}}
* ''2nd question'' {{title|xxx|answer}}
* ''3rd question'' {{title|xxx|answer}}
* ''4th question'' {{title|xxx|answer}}
* ''5th question'' {{title|xxx|answer}}
</div>

Phylogenetics: RevBayes Lab

2020-04-21T14:46:04Z

Paul Lewis:

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml, paup, and revbayes modules
module load paml/4.9
module load paup/4.0a-166
module load RevBayes/xxx

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

== Divergence times ==

=== warning: this section is a work in progress ===

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

Be sure to change your output files to have the prefix "divtime" rather than "relaxed" so that you don't overwrite the previous results, then run the new model:
rb divtime.Rev

== Review results of the divergence time analysis ==

Open the ''divtime.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''1st question'' {{title|xxx|answer}}
* ''2nd question'' {{title|xxx|answer}}
* ''3rd question'' {{title|xxx|answer}}
* ''4th question'' {{title|xxx|answer}}
* ''5th question'' {{title|xxx|answer}}
</div>

Phylogenetics: RevBayes Lab

2020-04-21T14:42:27Z

Paul Lewis: /* Relaxed clocks */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml, paup, and revbayes modules
module load paml/4.9
module load paup/4.0a-166
module load RevBayes/xxx

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
branch_rates[i].setValue(1.0)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mu and ucln_sigma).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is tricky in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume if you are used to mu and sigma used with normal distributions. In a Lognormal distribution, mu and sigma are, instead, the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

Note this line:
branch_rates[i].setValue(1.0)
This sets the starting value of all branch rate parameters to 1. This seems to be kind of important. If you let RevBayes choose starting values for branch rates, it will start by drawing values for hyperparameters ucln_mu and ucln_sigma (which, due to the high variances we've given to their hyperpriors, can result in some pretty crazy values) and then will draw values from Lognormal(ucln_mu, ucln_sigma) to serve as starting values for the branch rates. This procedure could start us way away from a reasonable constellation of parameter values and the MCMC analysis may never find its way to a reasonable configuration, at least with the length of run we are able to manage in a lab period. It is '''always a good idea to start an MCMC analysis with all parameters set to their MLEs''', or at least reasonable values. This does not violate Bayesian principles in any way, and it saves on the amount of burnin that needs to be done. This is especially true for complex models where the amount of information for estimating parameters is low. Here I'm cheating a bit and setting the branch rates to what I know is the true value (1.0), but if we used our estimated clock_rate from the previous analysis things would work out just as well. Note that I'm not using setValue for any other parameters; the analysis seems to behave without starting those off at reasonable values, indicating either that there's a enough information about those parameters (e.g. birth_rate, state_freqs, exchangeabilities) or the parameters have less influence due to the fact that they are hyperparameters one level removed from the likelihood (e.g. ucln_mu, ucln_sigma).

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Divergence times ==

=== warning: this section is a work in progress ===

So far we've not estimated divergence times; we've assumed the true tree topology and true divergence times for all of our analyses. In reality, our main interest probably lies in estimating divergence times. We don't really care about substitution rates or how much variation there is in those rates across edges. The rates are just nuisance parameters that must be handled reasonably in order to get at what we really want, the divergence times.

In this lab, we will not be using fossil information to calibrate divergence times. We will assume that the root has age 1.0 and focus on estimating ''relative'' divergence times. There are several good tutorials on the [RevBayes Tutorials web page](https://revbayes.github.io/tutorials/) that show you how to handle fossil calibration in RevBayes. This tutorial is intended to make you aware of all the issues surrounding divergence time estimation so that you have sufficient background to fully appreciate the tutorials on the RevBayes site.

Let's continue our example by adding some moves that will modify the tree topology and branching times. Start by making a copy of your ''relaxed.Rev'' script, calling the copy ''divtime.Rev'':

cp relaxed.Rev divtime.Rev

Now add the following section just before the section entitled "# Uncorrelated Lognormal relaxed clock":
# Tree moves

# Add moves that modify all node times except the root node
moves[nmoves++] = mvNodeTimeSlideUniform(timetree, weight=10.0)

# Add several moves that modify the tree topology
moves[nmoves++] = mvNNI(timetree, weight=5.0)
moves[nmoves++] = mvNarrow(timetree, weight=5.0)
moves[nmoves++] = mvFNPR(timetree, weight=5.0)

Note that we are giving extra weight to these moves, so each tree topology move will be attempted 5 times more often, and the node time slider move will be attempted 10 times more often, than the other moves we've defined.

Note also that we're still starting the MCMC off with the true tree topology and node times (see the line <tt>timetree.setValue(T)</tt>). This is cheating, of course, but if you started with a maximum likelihood estimate obtained under a strict clock, the results would be no different.

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

Phylogenetics: RevBayes Lab

2020-04-20T22:44:15Z

Paul Lewis: /* Review results of the relaxed clock analysis */

{| border="0"
|-
|rowspan="2" valign="top"|[[Image:Adiantum.png|150px]]
|<span style="font-size: x-large">[http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Phylogenetics:_Syllabus EEB 5349: Phylogenetics]</span>
|-
|The goal of this lab exercise is to introduce you to Bayesian divergence time estimation using [https://revbayes.github.io RevBayes]. There are other programs that are currently more popular than RevBayes for doing this (notably [https://www.beast2.org BEAST2]), but I prefer RevBayes for this lab because it is less of a black box: every aspect of the model is explicitly defined in RevBayes.
|}

== Getting started ==
== Login to Xanadu ==

Login to Xanadu and request a machine as usual:
srun --pty -p mcbstudent --qos=mcbstudent bash

Once you are transferred to a free node, load the paml, paup, and revbayes modules
module load paml/4.9
module load paup/4.0a-166
module load RevBayes/xxx

== Create a directory ==
Use the unix <tt>mkdir</tt> command to create a directory to play in today:
cd ~ # you can omit this line if you are already in your home directory
mkdir rblab

== Simulating and analyzing under the strict clock model ==

Divergence time analyses are the trickiest type of analysis we will do in this course. That's because the sequences do not contain information about '''substitution rates''' or '''divergence times''' per se; they contain information about the '''number of substitutions''' that have occurred, and the number of substitutions is the ''product'' of rate and time. Thus, maximum likelihood methods cannot separate rates from times; doing so requires a Bayesian approach and considered use of priors, which constrain the range of rate and time scenarios considered plausible.

We will thus start slowly, and we will simulate data so that we know the truth. This will help guide your expectations when conducting divergence time analyses on real data.

=== PAML evolver ===
Let's use the evolver program, which is part of Ziheng Yang's PAML package, to simulate data for 10000 sites on a 20-taxon pure birth (Yule) tree using a strict clock. This will allow us to know everything: the birth rate of the tree generating process, the "clock" rate (i.e. the substitution rate that applies to the entire tree), as well as the model used for simulation.

We will each use a different random number seed, so we should all get slightly different answers.

==== Simulate a tree ====

First simulate a pure birth tree using evolver. Start evolver by simply typing evolver at the bash prompt, then enter the information provided below at the prompts (for questions that ask for multiple quantities, just separate the values by a space):

* specify that you want to generate a rooted tree by typing 2
* specify 20 species
* specify 1 tree and a random number seed of ''your'' choosing
* specify 1 to answer yes to the question about wanting branch lengths
* specify 2.6 for the birth rate, 0.0 for the death rate, 1.0 for the sampling fraction, and 1.0 for the tree height
* press 0 to quit

One thing to note before we continue. PAML's evolver program scales the tree to have height equal to the specified mutation rate (1.0, the last number we specified above). Normally pure birth trees would have different heights because of stochastic variation, but apparently this is only possible in evolver by editing the source code and making your own, ad hoc version. I've done the next best thing, which is set the birth rate to the value (2.6) that yield a tree of ''expected'' height 1.

You should now find a tree description in the file ''evolver.out''. '''Rename this file''' ''tree.txt'' (this will prevent your tree description from being overwritten when you run evolver again, and we will use the ''tree.txt'' file as input to RevBayes):
mv evolver.out tree.txt

==== Simulate sequences ====
The PAML evolver program requires a control file specifying everything it needs to know to perform your simulation. Create a file named ''control.dat'' with the following contents (2 lines require modification: seed and tree description):

2
seed goes here
20 10000 1
-1
tree description goes here
4
5
0 0
0.1 0.2 0.3 0.4

Here's what each of those lines does (consult the evolver section of the [http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf PAML manual] for more info about each option):
* line 1: 2 specifies that we want the output as a nexus file
* line 2: you should enter your own random number seed on the second line (can be the same as the one you used for the tree)
* line 3: 20 taxa, 10000 sites, 1 data set
* line 4: -1 says to use the branch lengths in the tree description
* line 5: tree description: paste in the tree description you generated from the first evolve run here
* line 6: 4 specifies the HKY model
* line 7: set kappa equal to 5
* line 8: set the gamma shape parameter to 0 and the number of rate categories to 0 (i.e. no rate heterogeneity)
* line 9: set state frequencies to: T=0.1, C=0.2, A=0.3, and G=0.4 (note, not in alphabetical order!)

When saving simulated data in nexus format, PAML's evolver command looks for three files (''paupstart'', ''paupend'', and ''paupblock'') specifying what text should go at the beginning, end, and following each data matrix generated, respectively. The only one of these files that needs to have anything in it is ''paupstart''. Here's the quick way to create these files:
echo "#nexus" > paupstart
touch paupblock
touch paupend
The echo command parrots what you put in quotes and the <tt>> paupstart</tt> at the end creates the file paupstart and saves the echoed contents there (you could use <tt>>> paupstart</tt> if you wanted to append other lines to the file). The touch command is intended to update the time stamp on a file, but will create an empty text file if the file specified does not exist.

Run evolver now using this control file, and selecting option (5) from the menu, which is "Simulate nucleotide data sets".
evolver 5 control.dat

If you get ''Error: err tree...'' it means that you did not follow the directions above ;)

You should now find a file named ''mc.nex'' containing the sequence data.

== Use RevBayes to estimate the birth rate and clock rate ==

In our first RevBayes analysis, we will see how well we can estimate what we already know to be true about the evolution of both the tree and the sequences. You will cheat and fix some things to their known true values, such as the tree topology and edge lengths. The idea is to take small steps so that we know what we are doing all along.

RevBayes uses an R-like language called the Rev Language to specify the model and the analysis. Rev is not R, but it is so similar to R that you will often forget that you are not using R and will try things that work in R but do not work in Rev - just a heads-up!

=== Set up the tree submodel ===
Create a new file named ''strict.Rev'' and add the following to it: I'll provide some explanation below the code block.
# Load data and tree

D <- readDiscreteCharacterData(file="mc.nex")
n_sites <- D.nchar()

T <- readTrees("tree.txt")[1]
n_taxa <- T.ntips()
taxa <- T.taxa()

# Initialize move (nmoves) and monitor (nmonitors) counters

nmoves = 1
nmonitors = 1

# Birth-death tree model

death_rate <- 0.0
birth_rate ~ dnExponential(0.01)
birth_rate.setValue(1.0)
diversification := birth_rate - death_rate
moves[nmoves++] = mvSlide(birth_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)
sampling_fraction <- 1.0
root_time <- T.rootAge()
timetree ~ dnBDP(lambda = birth_rate,
mu = death_rate,
rho = sampling_fraction,
rootAge = root_time,
samplingStrategy = "uniform",
condition = "nTaxa",
taxa = taxa)
timetree.setValue(T)

Note that we are assigning only the first tree in trees.txt to the variable <tt>T</tt> (there is only 1 tree in that file, but RevBayes stores the trees it reads in a vector, so you have to add the <tt>[1]</tt> to select the first anyway).

The functions beginning with <tt>dn</tt> (e.g. <tt>dnExponential</tt> and <tt>dnBDP</tt>) are probability distributions. Thus, <tt>birth_rate</tt> is a parameter that is assigned an Exponential prior distribution having rate 0.01, and <tt>timetree</tt> is a parameter representing the tree and its branching times that is assigned a Birth Death Process (BDP) prior distribution. The BDP is a submodel, like the +I or +G rate heterogeneity submodels: it has its own parameters (lambda, mu, rho, and rootAge) all of which are fixed except for birth_rate.

The <tt>setValue</tt> function sets the starting value of a parameter that is allowed to vary.

Each parameter in the model requires a mechanism to propose changes to its value. These are called ''moves''. A vector of moves has been created for you, so you need only add to it. The variable <tt>nmoves</tt> keeps track of how many moves we've defined. Each time a move is added to the moves vector, we increment the variable <tt>nmoves</tt> so that new moves will not overwrite previously defined moves. This increment is performed by the <tt>++</tt> in <tt>nmoves++</tt>. The fact that the <tt>++</tt> follows <tt>nmoves</tt> means that <tt>nmoves</tt> will be incremented after its value is used. If we had used <tt>++nmoves</tt> instead, <tt>nmoves</tt> would have been incremented and then used, which would be incorrect because vector indices in Rev Language, like R, start at 1, not 0.

Monitors in RevBayes handle output. We are just initializing the monitor counters now; we will add monitors toward the end of our RevBayes script.

The model comprises a DAG (Directed Acyclic Graph). The nodes of this graph are of several types and represent model inputs and outputs:
* '''Stochastic nodes''' are exemplified by <tt>birth_rate</tt> and <tt>timetree</tt>; they can be identified by the tilde (<tt>~</tt>) symbol used to assign a prior distribution.
* '''Constant nodes''' are exemplified by <tt>death_rate</tt>, <tt>sampling_fraction</tt>, and <tt>root_time</tt>; they can be identified by the assignment operator <tt><-</tt> that fixes their value to a constant.
* '''Deterministic nodes''' are exemplified by <tt>diversification</tt>; they can be identified by the assignment operator <tt>:=</tt>. These nodes represent functions of other nodes used to output quantities in a more understandable way. For example, diversification will show up as a column in the output even though it is not a parameter of the model itself. (The diversification node was only added here to illustrate deterministic nodes; it's value will always equal birth_rate because death_rate is a constant 0.0).

I will show you how to create the DAG graphically (ha!) in the form of a pdf before we run the model.

=== Set up the strict clock submodel ===

Add the following 3 lines to your growing revscript:

# Strict clock

clock_rate ~ dnExponential(0.01)
clock_rate.setValue(1.0)
moves[nmoves++] = mvSlide(clock_rate, delta=1.0, tune=true, tuneTarget=0.4, weight=1.0)

This adds a parameter <tt>clock_rate</tt> with a vague Exponential prior (rate 0.01) and starting value 1.0. The move we're using to propose new values for this parameter as well as the <tt>birth_rate</tt> parameter is a ''sliding window move'', which you are familiar with from your MCMC homework. The value <tt>delta</tt> is the width of the window centered over the current value, and we've told RevBayes to tune this proposal during the burnin period so that it achieves (if possible) an acceptance rate of 40%. The weight determines the probability that this move will be tried. At the start of the MCMC analysis, RevBayes sums the weights of all moves you've defined and uses the weight divided by the sum of all weights as the probability of selecting that particular move next.

=== Set up the substitution submodel ===

Now let's set up a GTR substitution model:

# GTR model

state_freqs ~ dnDirichlet(v(1,1,1,1))
exchangeabilities ~ dnDirichlet(v(1,1,1,1,1,1))
Q := fnGTR(exchangeabilities, state_freqs)
moves[nmoves++] = mvDirichletSimplex(exchangeabilities, alpha=10.0, tune=true, weight=1.0)
moves[nmoves++] = mvDirichletSimplex(state_freqs, alpha=10.0, tune=true, weight=1.0)

The Q matrix for the GTR model involves state frequencies and exchangeabilities. I've made both<tt>state_freqs</tt> and <tt>exchangeabilities</tt> stochastic nodes in our DAG and assigned both of them flat Dirichlet prior distributions (the <tt>v(1,1,...,1)</tt> part is the vector of parameters for the Dirichlet prior distribution (all 1s means a flat prior).

I've assigned mvDirichletSimplex moves to both of these parameters. A simplex is a set of coordinates that are constrained to sum to 1, and this proposal mechanism modifies all of the state frequencies (or exchangeabilities) simultaneously while preserving this constraint. A list of all available moves can be found in the [https://revbayes.github.io/documentation/ Documentation section of the RevBayes web site] if you want to know more.

=== Finalize the PhyloCTMC ===

It is time to collect the various submodels (<tt>timetree</tt>, <tt>Q</tt>, and <tt>clock_rate</tt>) into one big Phylogenetic Continuous Time Markov Chain (dnPhyloCTMC) distribution object and attach (''clamp'') the data matrix <tt>D</tt> to it.

# PhyloCTMC

phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=clock_rate, nSites=n_sites, type="DNA")
phySeq.clamp(D)
mymodel = model(exchangeabilities)
mymodel.graph("strict.dot", TRUE, "white")

The next-to-last line is a little obscure. RevBayes needs to have an entry point (a root node, if you will) into the DAG, and any stochastic node will suffice. Here I've supplied <tt>exchangeabilities</tt> when constructing mymodel, but I could have provided <state_freqs>, <birth_rate>, <clock_rate>, etc., instead.

The last line creates a file named ''strict.dot'' that contains code (in the [https://en.wikipedia.org/wiki/DOT_(graph_description_language) dot language]) for creating a plot of your DAG. The second argument (TRUE) tells the graph command to be verbose, and the last argument ("white") specifies the background color for the plot.

=== Set up monitors ===

Let's create 2 monitors to keep track of sampled parameter values, sampled trees, and screen output:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/strict.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/strict.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

The first monitor will save model parameter values to a file named ''strict.log'' in the ''output'' directory (which will be created if necessary). The second monitor will save trees to a file named ''strict.trees'' in the ''output'' directory. Note that we have to give it <tt>timetree</tt> as an argument. This is kind of silly because we've fixed the tree topology and edge lengths, so all the lines in the output.strict.trees will be identical, but this saves me having to explain this later. Finally, the third monitor produces output to the console so that you can monitor progress.

Note that we are sampling only every 10th iteration for the first 2 monitors and every 100th iteration for the screen monitor.

=== Set up MCMC ===

Finally, we're ready to add the final section to our revscript. Here we create an mcmc object that combines the model, monitors, and moves and says to do just 1 MCMC analysis. We will devote the first 1000 iterations to burnin, stoping to tune the moves every 100 iterations (RevBayes collects data for 100 iterations to compute the acceptance probabilities for each move, then uses that to decide whether to make the move bolder or more conservative.) Then we run for real for 10000 iterations and ask RevBayes to output an operator summary, which will tell us how often each of our moves was attempted and succeeded.

# MCMC

mymcmc = mcmc(mymodel, monitors, moves, nruns=1)
mymcmc.burnin(generations=1000, tuningInterval=100)
mymcmc.run(generations=10000)
mymcmc.operatorSummary()

quit()

== Run RevBayes ==

To run RevBayes, just type <tt>rb</tt> at the command prompt followed by the name of your revscript file:
rb strict.Rev

If this were a long analysis, we would create a slurm script and submit the job using sbatch, but this one should be short enough that you can easily wait for it to finish while logged in.

== Reviewing the strict clock results ==

First, copy the contents of the file ''strict.dot'' and paste them into one of the online [https://www.graphviz.org Graphviz] viewers such as [https://stamm-wilbrandt.de/GraphvizFiddle/ GraphvizFiddle], [https://dreampuf.github.io/GraphvizOnline GraphvizOnline], or [http://www.webgraphviz.com WebGraphviz]. The resulting plot shows your entire model as a graph, with constant nodes in square boxes, stochastic nodes in solid-line ovals, and deterministic nodes in dotted-line ovals.

Now download the file ''strict.log'' stored in the ''output'' directory to your laptop and open it in [https://www.beast2.org/tracer-2/ Tracer].

<div style="background-color: #ccccff">
* ''What is the 95% HPD (highest posterior density) credible interval for clock_rate (use the Estimates tab in Tracer to find this information)?' Does this include the true value?' {{title|I got (0.9986, 1.0216), and yes, it includes the true value 1.0|answer}}
* ''What is the 95% HPD (highest posterior density) credible interval for birth_rate?' Does this include the true value?' {{title|I got (0.6306,4.1727); yes, the true rate was 2.6, which is close to the middle of the credible interval|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 6 exchangeabilities, do these densities make sense given what you know about the model used?'' {{title|yes, the 2 transition relative rates are about 0.358, which is 5.04 times higher than the other 4, which average 0.071|answer}}
* ''Using the Marginal Density tab in Tracer, and selecting all 4 state_freqs, do these densities make sense given what you know about the model used?'' {{title|yes, they are centered over 0.1, 0.2, 0.3, and 0.4, which are the values we specified when simulating the data using evolver|answer}}
* ''Which parameter is hardest to estimate precisely (i.e. has the broadest HPD credible interval)? {{title|birth_rate|answer}}
* ''For the parameter you identified in the previous question, would it help to simulate 20000 sites rather than 10000? {{title|no, not even an infinite amount of data would narrow down the estimate of birth_rate (do you understand why?)|answer}}
* ''What would help in reducing the HPD interval for birth_rate? {{title|larger trees: the data needed to estimate birth_rate lies in the lengths of the intervals between speciation events, so more speciation events mean more data for estimating birth_rate|answer}}
</div>

You may have noticed that our effective sample sizes for the exchangeabilities and state_freqs parameters are pretty low. You no doubt also notice that these are being estimated quite precisely and accurately. What gives? The fact that there is so much information in the data about these parameters is the problem here. The densities for these parameters are very "sharp" (low variance), and proposals that move the values for these parameters away from the optimum by very much fail. Looking at the operator summary table generated after the MCMC analysis finished, you will notice that RevBayes maxed out its tuning parameter for both exchangeabilities and state_freqs at 100. Making this tuning parameter larger results in smaller proposed changes, so if we could set the tuning parameter alpha to, say, 1000 or even 10000, we could achieve better acceptance rates and higher ESSs.

== Relaxed clocks ==

It is safe to assume that a strict molecular clock almost never applies to real data. So, it would be good to allow some flexibility in rates. One common approach is to assume that the rates for each edge are drawn from a lognormal distribution. This is often called an UnCorrelated Lognormal (UCLN) relaxed clock model because the rate for each edge independent of the rate for all other edges (that's the uncorrelated part) and all rates are lognormally distributed. This is to distinguish this approach from correlated relaxed clock models, which assume that rates are to some extent inherited from ancestors, and thus there is autocorrelation across the tree.

Copy your ''strict.Rev'' script to a file named ''relaxed.Rev'':
cp strict.Rev relaxed.Rev

Edit ''relaxed.Rev'', replacing the section entitled "Strict clock" with this relaxed clock version:

# Uncorrelated Lognormal relaxed clock

# Add hyperparameters mu and sigma
ucln_mu ~ dnNormal(0.0, 100)
ucln_sigma ~ dnExponential(.01)
moves[nmoves++] = mvSlide(ucln_mu, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)
moves[nmoves++] = mvSlide(ucln_sigma, delta=0.5, tune=true, tuneTarget=0.4, weight=1.0)

# Create a vector of stochastic nodes representing branch rate parameters
n_branches <- 2*n_taxa - 2
for(i in 1:n_branches) {
branch_rates[i] ~ dnLognormal(ucln_mu, ucln_sigma)
moves[nmoves++] = mvSlide(branch_rates[i], delta=0.5, tune=true, weight=1.0)
}

You'll also need to substitute branch_rates for clock_rate in your dnPhyloCTMC call:
phySeq ~ dnPhyloCTMC(tree=timetree, Q=Q, branchRates=branch_rates, nSites=n_sites, type="DNA")

[[Image:Lognormal.png|thumb|right]]
The model has suddenly gotten a lot more complicated, hasn't it? We now have a rate parameter for every edge in the tree, so we've added 2*20-2=38 more parameters to the model. Each of these edge rate parameters has a lognormal prior, and the 2 parameters of that distribution represent hyperparameters in what is now a hierarchical model, so we've increased the model from 10 parameters (1 clock_rate, 1 birth_rate, 3 state_freqs, 5 exchangeabilities) to 50 parameters (the original 10 plus 38 edge rates and 2 hyperparameters ucln_mean and ucln_stddev).

To the right is a figure (click the thumbnail to enlarge) that shows the relationship between the lognormal distribution (left) and the normal distribution (right). The lognormal distribution is strange in that its two parameters (mu and sigma) are ''not'' the mean and standard deviation of the lognormally-distributed variable, as you might be led to assume; they are the mean and standard deviation of the ''log'' of the lognormally-distributed variable!

You should also edit the monitors section so that the output file names reflect the fact that we're using a relaxed clock now:

# Monitors

monitors[nmonitors++] = mnModel(filename = "output/relaxed.log", printgen = 10, separator = TAB)
monitors[nmonitors++] = mnFile(filename = "output/relaxed.trees", printgen = 10, timetree)
monitors[nmonitors++] = mnScreen(printgen=100)

And don't forget to change the name of the dot file:

mymodel.graph("relaxed.dot", TRUE, "white")

Now run the new model:
rb relaxed.Rev

== Review results of the relaxed clock analysis ==

If you create a plot of your ''relaxed.dot'' file using one of the online Graphviz viewers, the increase in model complexity will be made very apparent!

Open the ''relaxed.log'' file in Tracer and think about the following questions before peeking at the answer:

<div style="background-color: #ccccff">
* ''What is the true rate for any given edge in the tree?'' {{title|1.0 (remember, we are still using the data set simulated under a strict clock with rate 1.0)|answer}}
* ''Looking across the 38 branch rate parameters, do any of them get very far from 1.0?'' {{title|no, they are all very close to 1.0|answer}}
* ''What are the mean values of ucln_mu and ucln_sigma, our two hyperparameters that govern the assumed lognormal prior applied to each branch rate? {{title|I got 0.0116 for ucln_mu and 0.01828 for ucln_sigma|answer}}
* ''What do these values translate to on the linear scale (consult the figure)?'' {{title|mean equals 1.0118, standard deviation equals 0.01850|answer}}
* ''Do the values on the linear scale make sense?'' {{title|yes, the mean is near 1, which was the clock rate simulated, and the standard deviation is near 0, which means there is essentially no variation among branches in substitution rate (i.e., a strict clock), which is also what we simulated|answer}}
</div>

Phylogenetics: RevBayes Lab

2020-04-20T22:27:43Z

Paul Lewis: /* Reviewing results of the relaxed clock analysis */

Phylogenetics: RevBayes Lab

2020-04-20T22:23:23Z

Paul Lewis: /* Relaxed clocks */

Phylogenetics: RevBayes Lab

2020-04-20T22:21:44Z

Paul Lewis: /* Relaxed clocks */

Phylogenetics: RevBayes Lab

2020-04-20T21:59:10Z

Paul Lewis: /* Relaxed clocks */

Phylogenetics: RevBayes Lab

2020-04-20T20:08:45Z

Paul Lewis: /* Relaxed clocks */

Phylogenetics: RevBayes Lab

2020-04-20T20:06:28Z

Paul Lewis: /* Relaxed clocks */

Phylogenetics: RevBayes Lab

2020-04-20T20:05:14Z

Paul Lewis: /* Finalize the PhyloCTMC */

Phylogenetics: RevBayes Lab

2020-04-20T20:03:53Z

Paul Lewis: /* Reviewing the strict clock results */

Phylogenetics: RevBayes Lab

2020-04-20T19:45:19Z

Paul Lewis: /* Finalize the PhyloCTMC */

Phylogenetics: RevBayes Lab

2020-04-20T19:41:29Z

Paul Lewis: /* Relaxed clocks */

Phylogenetics: RevBayes Lab

2020-04-20T17:12:45Z

Paul Lewis: /* Relaxed clocks */

Phylogenetics: RevBayes Lab

2020-04-20T17:09:42Z

Paul Lewis: /* Relaxed clocks */

Phylogenetics: RevBayes Lab

2020-04-20T17:09:02Z

Paul Lewis:

File:Lognormal.png

2020-04-20T17:04:37Z

Paul Lewis:

Phylogenetics: RevBayes Lab

2020-04-20T17:04:14Z

Paul Lewis: