Tinderbox Performance Legend
This document explains how to read and interpret the results on:
- table on Tinderbox page
- Results for the day (you get here by clicking any of the testname links, for example '#1 Startup')
- 30 day trend graphs (click the 'trends' link)
- Numerical trends for the last 30 days (click 'Numerical Trends' link)
All measurements are always in seconds.
Typically we run each test 3 times and pick the median value for reporting.
Tinderbox table
The headers show when the latest results came in, the target release number, and which svn revision numbers were compared (note that it is possible to have several runs with the same svn revision). There will always be some variation even with the same svn revision number, and these differencies can usually be ignored.
Each row lists a single test, the target time for the test, and then for each platform the actual time it took to run the test, change since the last result (both % and absolute value), and standard deviation.
Standard deviation is calculated from all of the results for the day. This value can usually be ignored. However, a large std.dev means that there is a lot of variation in the results, in other words the numbers are not very accurate. std.dev can also be large if there is a real, big change in performance that day. We use the std.dev value to help color the cells (we use that as the error margin when determining if we need to color something).
The the measured value from the test is a number on white background, possibly surrounded by colored box. Color legend:
- green: the measured value is faster than our target, good
- no box: the measured value is within the error margin of the target
- orange dashed box: 1-100% slower than the target
- red box: 101% or more slower than the target
The change (delta values) are on either white, green, orange or red background. Color legend:
- green: the test run faster than previous time
- white: the change is too little to know if it was real (under std.dev)
- orange: 1-10% slower than previous time
- red: 11% or more slower than previous time
At the end of the page we list the times for latest results for each platform separately. NOTE: It is possible for the table to stop updating even if the performance tinderboxes are working correctly, so pay attention to the latest result times! If you notice the results are not updating, contact the release engineer.
Day graphs and tables
The day graphs and tables page groups page is started fresh each day. For each test there is one graph showing the day trend (we pick the median value for each revision number for the graph), and one table per platform per test. The tables are grouped next to the graph.
The colors are determined the same way as for the Tinderbox table.
Whenever there has been a checkin, the time on that row becomes a link to Bonsai. Clicking that link will quickly show you the checkins that happened in that time period.
NOTE: Even though we store all the old daily pages, we only store the latest daily graphs. So if you go to previous day's page, the raw table data is correct but the graphs show the latest results from today.
Trend Graphs
The trend graphs page graphs the results of each test over the last 30 days, picking the median value from each day for a point on the graph.
It is not possible to figure out from the trend graph when checkins happened, but you can click the link above each graph image and it will take you to the Trend Tables page, explained below.
Trend tables
The trend tables list the raw data in table format from the last 30 days.
Coloring is the same as on the other pages.
Also, if there were checkins at a certain time, the time will be a link to Bonsai that quickly shows checkins that happened.
Interpreting results
You should always monitor the performance numbers after checkins. Small changes are normal, even big spikes (you can see those from the day and trend graphs). What is a reason of concern (or joy, depending on the direction of change) is when the values jump from some old value to some new value and stay there. You can see these easily in the graphs if the change is relatively big.
Most real performance problems happen on all platforms (so you will see all of the graphs jump), but there are also platform specific cases so seeing a change in just some graphs might still indicate a problem.
There are some known cases where values jump to new level without any checkins:
- The Windows performance box updates it's antivirus database every night. While this is in progress, the whole machine slows down and consequently test results show slowdown. Sometimes the update has taken up to 14 hours! Once the update finishes, the values drop back to down.
- The Linux performance box has occasionally started slowing down, most likely due to accumulating zombie processes (haven't yet figured out why this happens). A reboot fixed this for about a week, after which it started getting slower and slower again. Currently this does not seem to be a problem.
- Typically all of the machines are slightly faster right after a reboot, and gradually slow down a little bit. This is barely visible in the 30 day trend graphs.
So, what this all means is that if you see a slowdown after a checkin, pay attention if it stays at the new level. If it does, determine if one of the known issues might be the cause. If not, file a performance regression bug.
Also, if you are checkin in supposed performance improvements, make sure that Tinderbox also reports a change. It is a good idea to mention in the bug how much the measures improved after a checkin.
Process when performance numbers change
We don't have absolute rules regarding changes yet, so these are just guidelines.
All persistent, significant changes should be investigated.
Some tests have a natural variation +/-2%, so even a consistent 2% change would be clearly visible with that test. Some other tests vary by 30, 50 or even over 100% in the runs over a normal day, but even these tend to have pretty consistent median values that show as plateaus on the graphs. If the plateu changes, it is a reason to investigate.
Even if we get an improvement, we should understand why that happened. There has already been at least one checkin which made a performance test faster, but that was because the checkin introduced a bug that made the test go faster.
All real, significant slowdowns should get bugs filed for them.
There are some
exceptions:
- The checkin was for a new feature.
- The slowdown made no difference regarding the targets (it is still faster than the target, or it always was slower than the target anyway). Use this excuse with caution...
- The slowdown was minor, and the benefit of the checkin seems to outweigh the slowdown.
In all of the exception cases we still need to understand why the slowdown happened, and discuss and decide
if the slowdown is acceptable.