At the end of my previous post, I highlighted a discrepancy between the output produced by the Boost.Accumulators implementation of the P2 algorithm. It turns out the discrepancy was due to a typo in the original paper and not in the Boost.Accumulators implementation as I had originally suspected.
All my objections listed in that post to the testing method used by the library are still valid, but I thought it was important to emphasize that my suspicions about an implementation error were misguided. Of course, had the authors tested their implementation against figures provided in the original paper, they may have discovered the typo in that paper a long time ago. That in and of itself serves to emphasize the importance of testing one’s code against known results.
I consider my more serious blog entries to be the equivalent of working papers: This is where I hash out ideas so others with similar interests can learn from my experience and/or point out my errors. This is a process of discovery: Explaining my process of how I discovered an interesting problem does not necessarily mean I have a solution at the time of discovery. It means I have reached a point where it is useful to step back and summarize what I have done.
When I do this, without fail, some individuals demand bug reports and pull requests. A pull request is more like submitting a draft to a journal for publication. That comes after I outline the basics in a working paper, present it and get some feedback.
Back to the Boost.Accumulators implementation of the P2 algorithm. At the end of my previous post, I showed the following output:
calculated= 0.74 expected= 0.74
calculated= 0.74 expected= 0.74
calculated= 2.06167 expected= 2.18
calculated= 4.55176 expected= 4.75
calculated= 4.55176 expected= 4.75
calculated= 9.15196 expected= 9.28
calculated= 9.15196 expected= 9.28
calculated= 9.15196 expected= 9.28
calculated= 9.15196 expected= 9.28
calculated= 6.17976 expected= 6.3
calculated= 6.17976 expected= 6.3
calculated= 6.17976 expected= 6.3
calculated= 6.17976 expected= 6.3
calculated= 4.24624 expected= 4.44
calculated= 4.24624 expected= 4.44
In this output, calculated values are produced by the Boost.Accumulators implementation and the expected values come from Table I in the paper. According to the paper, these values were produced from the following sequence of observations:
0.02, 0.5, 0.74, 3.39, 0.83,
22.37, 10.15, 15.43, 38.62, 15.92, 34.60, 10.28, 1.47, 0.40, 0.05, 11.39, 0.27, 0.42, 0.09, 11.37
where the first five observations are listed on page 1078:
These observations are used to set up the algorithm. Then, Table I in the paper summarizes the mechanics of the algorithm.
In the absence of evidence to the contrary, my initial instinct was to assume that the figures in a peer-reviewed paper published 30 years ago were correct. After all, I have been calculating percentiles for large data sets (where large means something larger every year) for more than 2/3 of that period, and, therefore, have been directly or indirectly using the algorithm outlined in this paper all that time.
That meant trying to figure out what subtle error might be lurking in the implementation. I probably re-wrote the calculation of the parabolic interpolation in that code six and a half different ways. Used different combinations of intermediate calculations, compared output from gcc, VC++, and clang, compared output using float
s versus double
s, single stepped through code, printf
ed every intermediate step etc etc and could not get the output to match what was shown in the paper.
I decided it was time to verify the calculations by hand. And, I do mean by hand. Handling x6 = 22.37 was easy. With x7 = 10.15, I had to calculate only one of the marker heights, q4 = 4.465. I noticed this was listed as 4.47 in the paper, and I briefly wondered if the discrepancy could be due to someone using the rounded figure in subsequent calculations rather than the exact number. Then, it was time to handle x8 = 15.43. This meant that q3 (that, is our approximation to the median of the sample {0.02, 0.5, 0.74, 3.39, 0.83, 22.37, 10.15}) needed to be updated. In this case, the parabolic interpolation formula on page 1078 of the paper reduces to:
0.74 + (1/3) * ( 2 * (4.465 - 0.74) / 2 + 1 * ( 0.74 - 0.5) / 1 )
= 0.74 + (1/3) * ( 3.725 + 0.24 )
= 2.06166666666667
and this matches the output from the Boost.Accumulators implementation. On the other hand, Table I in the original paper says the new value q3 after this step should be 2.18.
This made me stop in my tracks and start staring helplessly at the paper. For q3 to take on the value, the last parenthesized expression had to add up to 4.32 rather than 3.97. That is, q3 - q2 = 0.74 - 0.5 = 0.24 was too small by approximately 0.35.
That’s when I saw it.
If you look carefully at the top row of Table I in the paper, you’ll notice it, too if I zoom in:
While the text says the second observation is 0.5, Table I shows that the computations were based on the second observation being 0.15. It was easy to verify that this would fix the calculation of q3 after the arrival of x8. To make sure this typo was indeed the source of the discrepancy between the Boost.Accumulators implementation and the data shown in the original paper, I updated my original demo program, and ran it:
#include <algorithm>
#include <cstdio>
#include <string>
#include <utility>
#include <vector>
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics/stats.hpp>
#include <boost/accumulators/statistics/median.hpp>
namespace bacc = boost::accumulators;
int main(void)
{double,
bacc::accumulator_set<
bacc::stats<bacc::tag::median(bacc::with_p_square_quantile)> > acc;
// See http://www.cse.wustl.edu/~jain/papers/psqr.htm
// First five observations
0.02);
acc(0.15);
acc(0.74);
acc(3.39);
acc(0.83);
acc(
const std::vector<std::pair<double, double> > jain_chlamtac {
22.37, 0.74},
{10.15, 0.74},
{15.43, 2.18},
{38.62, 4.75},
{15.92, 4.75},
{34.60, 9.28},
{10.28, 9.28},
{1.47, 9.28},
{0.40, 9.28},
{0.05, 6.30},
{11.39, 6.30},
{0.27, 6.30},
{0.42, 6.30},
{0.09, 4.44},
{11.37, 4.44},
{
};
for (auto p: jain_chlamtac)
{
acc(p.first);std::printf("calculated= %.3f\texpected= %.2f\n", bacc::median(acc), p.second);
}
return 0;
}
when run, this program produced the output:
calculated= 0.740 expected= 0.74
calculated= 0.740 expected= 0.74
calculated= 2.178 expected= 2.18
calculated= 4.753 expected= 4.75
calculated= 4.753 expected= 4.75
calculated= 9.275 expected= 9.28
calculated= 9.275 expected= 9.28
calculated= 9.275 expected= 9.28
calculated= 9.275 expected= 9.28
calculated= 6.297 expected= 6.30
calculated= 6.297 expected= 6.30
calculated= 6.297 expected= 6.30
calculated= 6.297 expected= 6.30
calculated= 4.441 expected= 4.44
calculated= 4.441 expected= 4.44
That is, the output from the Boost.Accumulators implementation of the algorithm would exactly match the output in the original paper if the code was run in “round to nearest” mode.
I fired off a quick email to Dr. Jain explaining my findings. He immediately annotated the online version of the paper to note the typo:
Now that we have that cleared it up, we can use the example worked out in the original paper to put together a test of the implementation of the algorithm. My work increased my confidence in the implementation of the P2 algorithm in Boost.Accumulators. Obviously, this simple typo does not invalidate any of the conclusions of Jain & Chlamtac (1985).
I am still puzzled by some output I am getting from the other two median algorithms in Boost.Accumulators, and I’ll turn my attention to those next.