This file documents major bugs in Phycas. A major bug is loosely defined as something
that prevents a user from using Phycas to analyze a particular data set, or something
that causes Phycas to generate incorrect results. Major bugs usually take a day or more
to fix, and documenting them helps to avoid repeating history. Minor bugs represent 
cosmetic glitches that are easy to work around and do not take long to fix. Minor bugs 
are best documented in SVN comments.

Bug 7: the "underflow" bug (fixed in revision 1050 on 13 Mar 2009)
------------------------------------------------------------------
Thanks to Federico Plazzi and Mark Clements for alerting us to this problem.
Symptoms of this bug were that the likelihood was incorrectly computed when the pinvar
submodel was used and underflow protection was active. This turned out to be three
unrelated bugs. 

The first bug was that UnderflowType was typedef'd to an unsigned integer
in cond_likelihood.hpp. Because the maximum conditional likelihood for a pattern sometimes
needs to be made smaller, the log correction factors sometimes must be negative, hence
the problem. The UnderflowType is now a (signed) long. 

The second bug had to do with the pinvar model. The overall site likelihood x for the 
pinvar model is computed as follows:

x = p*i + (1 - p)*v

where p is pinvar, i is the site likelihood given invariability and v is the site 
likelihood given variability. If underflow correction has been applied, 

x = p*i + (1 - p)*v*c

where c is the cumulative correction factor that has been applied to keep v from
underflowing. In non-pinvar models, we were subtracting the log of c from the log 
site likelihood, but obviously taking the log of x above and subtracting log c will
not work, hence the incorrect log-likelihoods. To correct this, we now compute an
underflow correction factor f for i if underflow correction has been applied to v:

x = p*i*f + (1 - p)*v*c

If f < c, we effectively replace v*c with v*f by multiplying by f/c:

x = p*i*f + (1 - p)*v*c*(f/c)
  = p*i*f + (1 - p)*v*f
  
Now, log f can later be subtracted from the log x to give the correct site likelihood.
We need not worry that f/c might underflow, because if it does it means the variable 
component of the site-likelihood was negligible anyway.

If f < c, we effectively replace i*f with i*c by multiplying by c/f:

x = p*i*f*(c/d) + (1 - p)*v*c
  = p*i*c + (1 - p)*v*c
  
In this case, log c can later be subtracted from the log x to give the correct site
likelihood (i.e., the usual procedure). If c/f underflows, it means that f was 
incredibly large, which can only happen if the frequency of the constant state was 
so small that it would make the invariable part of the site likelihood zero anyway.
An assert was put in place that tests whether i > z, where z = underflow_max/DBL_MAX, 
the smallest value of i such that the correction factor f will not overflow.

The third bug surfaced after the first two were fixed when comparing the Paradox
example output when mcmc.uf_num_edges was set to 5 to that when mcmc.uf_num_edges
was set to 50 (the default). In the latter case, underflow protection does not kick
in because the data set has only 17 taxa (and thus 17*2 - 3 = 31 edges). This turned
out to be an incompletely-implemented PatternSpecificUnderflowPolicy::check() 
function.

Bug 6: the "leaky" bug (fixed in revision 994 on 24 Dec 2008)
-------------------------------------------------------------
This bug turned out to be two bugs, both causing memory leaks during that part of 
path sampling analyses when the prior is being sampled directly. The first bug turned
out to be relatively easy to fix: the ScopedThreeDMatrix destructor in nxsallocatematrix.h
consisted of this line

if (!ptr) DeleteThreeDArray<T>(ptr);

which of course should have been

if (ptr) DeleteThreeDArray<T>(ptr);

The second memory leak was much harder to find and fix. First of all, the Tree destructor
(in basic_tree.cpp) was popping nodes off the internalNodeStorage and tipStorage stacks
but not actually deleting the nodes. This meant that not only were TreeNode objects
not being deleted, but the large TipData and even larger InternalData structures that
were stored in these TreeNode objects were also not getting deleted. To fix this,
"delete nd" statements were added to the Tree destructor, but this led to a complication:
each TreeNode object was storing a smart pointer to its Tree object. Unfortunately, these
smart pointers were being created upon TreeNode construction and thus each of these had
a use count of just 1 when the node was deleted, resulting in a second call of the Tree
destructor. The Tree destructor was thus being called recursively, which led to lots of
problems with freeing the same object multiple times. Since the smart pointer to Tree
inside TreeNode objects was not ever being used, this data member was simply removed to
solve this problem. One final glitch remained, however, and this was revealed when 
doctestall.py was run in debug mode. The first doctest example in _TreeLikelihood.py
crashed when the following section was executed:

>>> likelihood = Likelihood.TreeLikelihood(model)
>>> likelihood.copyDataFromDiscreteMatrix(data_matrix)
>>> for t in reader.getTrees():
...     tree = Phylogeny.Tree(t)
...     likelihood.prepareForLikelihood(tree)
...     lnL = likelihood.calcLnL(tree)
...     print '%.5f' % lnL
-7812.79213

The problem here was that the TreeLikelihood object's destructor was called before the
Tree destructor. TreeNode objects that have been used to compute likelihoods have
either TipData or InternalData objects attached to them, and the InternalData objects
each have (or had, I should say) a reference to the CondLikelihoodStorage data member 
(cla_pool) of the TreeLikelihood class, so when TreeLikelihood was deleted,
these references got hosed. Then, when Tree::~Tree was called, it began deleting nodes
and the internal nodes began deleting their InternalData structures, which in turn
began trying to stuff their CondLikelihood objects back into cla_pool, and...you see
the problem. To fix this, TreeLikelihood creates a new CondLikelihoodStorage
object in its constructor:

cla_pool = CondLikelihoodStorageShPtr(new CondLikelihoodStorage());

and from this point on, shared pointers are used by InternalData objects. That way,
the object pointed to by cla_pool is not deleted until all references to it are 
deleted.

Bug 5: the "pinvar" bug (fixed in revision 760 on 17 July 2008)
---------------------------------------------------------------
Phycas failed the LikelihoodTest test when that test was reinstated by MTH on 16 July.
This bug caused the log-likelihoods for +I models to be slightly off (5th decimal place).
It turns out that this bug was caused by the "Rota" bug fix. Rather than implement a
workaround, POL decided to take the opportunity to redo the +I model likelihood 
calculations in a more efficient manner (+I was previously being implemented as simple
one more rate category with a 0.0 rate). Phycas now does not require an additional
rate category for +I models and instead uses a vector (TreeLikelihood::constant_states)
that stores the state(s) of potentially constant sites (and also keeps track of 
sites that are definitely variable). The constant_states vector is built in the function
TreeLikelihood::buildConstantStatesVector, and the information in constant_states is
used in TreeLikelihood::harvestLnLFromValidEdge.

Bug 4: the "Yang" bug (fixed in revision 591 on 5 May 2008)
-----------------------------------------------------------
Reported by Ziheng Yang on 3 May 2008. The immediate problem that Ziheng had 
when running the Paradox.py example was

Files\Python25\lib\site-packages\phycas\Phycas\MCMCManager.py", line 377, in
setupChain
    self.bush_move.viewProposedMove(self.phycas.bush_move_debug)
AttributeError: 'BushMove' object has no attribute 'viewProposedMove'

However, when that was fixed (by removing the call to viewProposedMove, which 
no longer exists), the program crashed almost immediately (with an unidentified
C++ exception) when running the Paradox example. 

When the three of us (MTH, POL and DLS) got together in Kansas (20-26 October 2007)
to implement SAMC, we inadvertently created this bug by eliminating Tree::nodeStorage 
and replacing it with Tree::tipStorage and Tree::internalNodeStorage. The Paradox
example in SVN revision 455 was found to work fine, but failed in revision 456 
(after unrelated compile-time errors were fixed). In revision 455, any node that 
was pruned from a tree was stored in one place, whereas in 456 onward it is stored in 
one of two places depending on whether it is a tip or an internal node. In BushMove,
when an edge is deleted to create a polytomy, BushMove::proposeDeleteEdgeMove calls
the function TreeManipulator::DeleteLeaf to remove the node (now a tip even if it 
was originally internal because all of its children have been made children of its
parent). DeleteLeaf stored the node in tipStorage, even if originally it was an
internal node. When BushMove::proposeAddEdgeMove was called later, it checks 
internalNodeStorage to see if any nodes are available before creating one de novo, 
but of course internalNodeStorage was empty because previous delete edge moves had
been storing pruned nodes in tipStorage instead. Thus, a new node was created each
time BushMove::proposeAddEdgeMove was called. This too would not be problematic if
the function TreeLikelihood::prepareInternalNodeForLikelihood were working correctly.
This function had this body:
	if (!nd)
		{
		TreeNode::InternalDataDeleter	cl_deleter	= &deallocateInternalData;
		InternalData * cl = allocateInternalData();
		nd->SetInternalData(cl, cl_deleter);
		}
The conditional should have been "if (nd)..." not "if (!nd)...", so newly-minted nodes
were not being adorned by internal node data structures necessary for carrying out 
likelihood calculations and the program crashed the first time these data structures
were needed. Search for POLPY_NEWWAY in revision 591 for all the places where code
was changed to fix this bug. The POLPY_NEWWAY conditionals were removed in revision
592.

Bug 3: the "Rota" bug (fixed in revision 434 on 19-Oct-2007)
------------------------------------------------------------
Named for the person (Jadranka Rota) whose data set triggered this investigation.
A combination of high rate heterogeneity, small branch lengths and quite unequal
base frequencies caused phycas to hang during MCMC analyses involving models that
allowed base frequencies to be unequal. Checks were in place to prevent edge lengths
from becomming too small (see TreeNode::SetEdgeLen), but when there is high rate
heterogeneity, these already small edge lengths are multiplied by a relative rate
that results in a really tiny edge length for the first rate category. This, in
combination with quite unequal base frequencies can lead to very small negative 
transition probabilities in HKY::calcPMat and QMatrix::recalcPMat. The fix was to
check edge lengths in HKY::calcPMat and QMatrix::recalcPMat, as that is where
the minimum edge length should be enforced.

Bug 2: the "topology prior table" bug (fixed in revision 421 on 13-Aug-2007)
----------------------------------------------------------------------------
This bug was first reported by Stephano Mona on 21 July 2007.
Using the polytomy model with a data set of 186 taxa resulted in this error:

Topology prior:
  Prior type:
polytomy prior
  Prior strength (C): 2.71828182846
  Expected prior probability for each resolution class:
   class        prior
  -------------------
       1      0.00000
       2      0.00000
...
      38      0.00000
      39      0.00000
Traceback (most recent call last):
  File "mona.py", line 89, in <module>
    phycas.run()
  File "C:\Synchronized\Projects\phycasdev_trunk\phycas\Phycas\Phycas.py", line 935, in run
    self.showTopoPriorInfo()
  File "C:\Synchronized\Projects\phycasdev_trunk\phycas\Phycas\Phycas.py", line 475, in showTopoPriorInfo
    self.output('%8d %12.5f' % (i,math.exp(v - denom)))
ValueError: math domain error

Resolution:
-----------
The problem was in the display of the realized prior probabilities for topologies. The numerator 
of each of these probabilities can get large because it involves the number of tree topologies 
for a given number of internal nodes. For this problem (186 taxa), the numerator overflowed for 
trees having more than 39 internal nodes and the crash occurred when it tried to subtract 
1.#INF from 1.#INF.

Fixing the bug required reworking parts of the TopoPriorCalculator class 
(phycas/src/topo_prior_calculator.cpp). In particular, the RecalcCountsAndPriorsImpl function was
reworked so that the vector counts is now associated with a vector nfactors, the ith element of 
which contains the number of times a factor needed to be removed in order to keep the magnitude
of counts[i] reasonable. The factor taken out is scaling_factor (saved as the log in a new variable
named log_scaling_factor). Thus, if scaling_factor = 10, nfactors[i] = 3, and counts[i] = 1.5, the 
value of the log_scaling_factor variable would be log(10) and the actual count would be 1.5*10^3.
Even with this system, the log of the total topology count is still difficult to obtain because
each log count must be exponentiated in order to add it to the sum. Thus, the largest log count is
determined and factored out of the sum, then its log is added back in at the end. Many of the 
smaller counts will disappear, but they are negligible anyway given that the total is never used 
in any MCMC calculations. A similar factoring out of the largest value was required in the function 
GetRealizedResClassPriorsVect. Now, a valid table of resolution class priors is presented to the
user prior to an analysis involving polytomies. For large numbers of taxa (e.g. 186), many of the
values will appear to be zero. A note has been added to the output apprising users of the fact that
even though probabilities are reported as 0.00000000, they are not in fact exactly zero and this
result simply means that the value is less than 0.000000005 and rounds off to zero when only 8
decimal places are used.

Bug 1: the "total_count" bug (fixed in revision 292 on 3-Nov-2006)
------------------------------------------------------------------
In the Gelfand-Ghosh calculations, a map in which patterns are keys and pattern counts are
values is generated that maintains the sum of all posterior predictive datasets generated
during the run. At the end of the run, each count is divided by the number of data sets added,
so that this map, called gg_mu, ends up being best described as the "average" posterior predictive
data set. The problem here was that the counts were stored as floats, and adding enough of these
together finally overflows the float value, creating errors in the total number of counts over
all patterns. The main symptom was that adding a value such as 1 to total_count yielded an updated
total_count that was not 1 more than its previous value! The example below explains why this 
happens. This bug mostly affected the SimData class (files sim_data.hpp, sim_data.inl and sim_data.cpp).

Here is an example of how the overflow is manifested. A count of 1139.0 is added to the existing
total count 16776298.0, which should yield 16777437.0, but instead yields 16777436.0.

total_count before:  16776298.0 is  111111111111110001101010 in binary
added amount:            1139.0 is               10001110011 in binary
----------------------------------------------------------------------
total_count after should be:       1000000000000000011011101 in binary

Single precision floats are stored as a 23 bit mantissa and an 8 bit biased exponent, plus a 
sign bit. For example, to represent 1139.0, start with the binary representation, 10001110011,
and move the decimal point left until achieving the normalized representation

 10001110011 = 1.0001110011 * 2^{10}
 
Adding the bias (127) to the exponent (biasing is necessary to allow negative exponents to be 
represented as unsigned values) yields 127 + 10 = 137, which is 10001001 binary. Thus, the
sign bit is 0 (because the number is positive), the 8 bit exponent is 10001001 and the 23 bit
mantissa (padded on the right with 13 0s) is 00011100110000000000000:

 1139 = 0|10001001|00011100110000000000000
 
Doing the same with total_count after the addtion, we have 
 
 16777437.0 = 1.000000000000000011011101 * 2^{24} = 0|10010111|000000000000000011011101
 
Unfortunately, the mantissa is 24 bits long rather than the 23 allowed, so the trailing
1 gets lopped off, giving us the following instead:

 16777436.0 = 0|10010111|00000000000000001101110

This is the source of the confusion surrounding this bug. The value of total_count was 
accurate until enough simulated datasets had been added to gg_mu to cause overflow of this
sort, at which point adding 1 didn't necessarily increase total_count by 1! 

Solution:

The best solution seems to keep the counts as floats, but instead of waiting until all
posterior predictive data sets have been added to gg_mu to before doing the division, keep
a running average so that the total number of counts does not grow too large.

For example, suppose there are only four possible patterns, and the following pattern counts 
are generated by performing three posterior predictive simulations:

[213, 287,   0,   0]
[177,   0, 323,   0]
[103,   0,  97, 300]

In each case, the sum of all pattern counts is 500, so the total count is 1500 after all 
three have been added. The strategy before would be to add values for each pattern over
all three replicates, then divide by 3 at the end, so mu vector would be:

[164.333, 95.667, 140, 100]

The alternative is to add each pair in turn to a running average as follows:

Add first posterior predictive replicate (r1) to mu:
p = 1/1 (p is always 1/(number of posterior predictive datasets added)
mu = [ 0,     0, 0, 0 ] =>   [ 0*(1-p), 0*(1-p), 0*(1-p), 0*(1-p) ]
r1 = [ 213, 287, 0, 0 ] => + [   213*p,   287*p,     0*p,     0*p ]
                    new mu = [     213,     287,       0,       0 ]

Add second posterior predictive replicate (r2) to mu:
p = 1/2
mu = [ 213, 287,   0, 0 ] =>   [ 213*(1-p), 287*(1-p), 0*(1-p), 0*(1-p) ]
r2 = [ 177,   0, 323, 0 ] => + [     177*p,       0*p,   323*p,     0*p ]
                      new mu = [       195,     143.5,   161.5,       0 ]

Add third posterior predictive replicate (r3) to mu:
p = 1/3
mu = [ 195, 143.5, 161.5,   0 ] =>   [ 195*(1-p), 143.5*(1-p), 161.5*(1-p), 0*(1-p) ]
r3 = [ 103,     0,    97, 300 ] => + [     103*p,         0*p,        97*p,   300*p ]
                            new mu = [   164.333,      95.667,         140,     100 ]

To implement the above averaging scheme required changing only four lines, all 
in Phycas.py. These three lines were added just before the contents of tmp_simdata
are added to gg_mu in the recordSample function:

  p = 1.0/self.gg_num_post_pred_reps
  self.gg_mu.multBy(1.0 - p)
  self.tmp_simdata.multBy(p)

The following line, which formerly did the final division, was removed from the 
run function:

  self.gg_mu.divideBy(float(self.gg_total)