Statistical Limitation to Absolute Point Models

(and Other Scary Aspects of the Proposed ISU Scoring System)

At the heart of the proposed ISU judging system is a fundamental shift in the method of evaluating skating performances -- that being, a change from a relative method of comparing performances to the evaluation of performances on an absolute point scale. This change in approach places such severe constraints on the required skill of the judges that it is questionable a point based approach to scoring can ever work.

The former ordinal method of scoring, and the current OBO system both are based on the recognition that humans can make relative judgements with greater precision than absolute judgements.

Prior to the use of electronic equipment, and before that photography, all scientific experiments used human observation to determine the results -- basically from the dawn of the scientific method to the middle of the 19th century, a period of about 350 years. During that period academic types became well versed in the limitations and quirks of human judgement. In general it is found that well trained human observers can make absolute quantitative observational judgements to a limit of about 5 to 10% accuracy at best, and often worse than that. Humans are much better at making relative judgements, and can do so with an accuracy of about 1 to 2% and sometimes better than that.

Both the Ordinal and OBO scoring methods only ask the judges to come up with an order of finish, not to accurately numerically specify how much better one performance is over another. The marks given are merely a tool to obtain that order of finish.

Statistical analysis of the judges marks over many years shows that when specifying an order of finish a single judge has an uncertainty of plus/minus one place in specifying placements for the top and bottom skaters in a group and about plus/minus two places in the middle half. This means that a single judge's placement is only believable to give or take one place at the top or bottom, and give or take two places in the middle. In terms of numerical accuracy this corresponds to about 2%, in agreement with the general ability of humans to make relative judgement, as described above.

A precision of give or take one or two places is not good enough to decide who should win a competition, so a panel of judges is used. If the uncertainty of one judge is give or take one place, then the uncertainty of the panel's combined result is give or take one place divided by the square root of the number of judges. For a nine member panel this means that the uncertainty in the result is give or take one-third of a place for the top and bottom and two-thirds of a place for the middle of the group. With a difference of one step between each place this also means at the top and bottom, the precision of the result is at the 95% confidence level. In other words, for the top and bottom skaters one can be confident that 95% of the time this is the "right" answer. For the middle group of skaters, however, the confidence level is only about 50%. To increase the confidence level one would have to increase the size of the panel. Reducing the size of the panel, on the other hand, reduces the confidence level and believability of the result.

Whether it was by sagacious choice or dumb luck I do not know, but the choice or relative judging combined with a nine member judging panel is just about the optimum choice one could have possibly made from a statistical point of view in setting up a scoring system for skating. It is worth noting that skating did not start out with a relative judging system, but began with a point system. It was abandoned long, long ago because an absolute point system that involves human judgement has a lot of limitations to it. It is unfortunate that some in the ISU know so little of the history of the sport that they are apparently unaware of that.

In the proposed ISU judging system, the judges will now be asked to evaluate performances on an absolute point scale without comparison to any other performance. To understand the potential problems with this consider the following.

Let us say for the moment that the point system will be set up so that the champion program will earn about 180 points and the last place performance about 90 points. This is based on some of the numbers the ISU has provided and a scaling of the current point system. In any event, the point to be made here is not affected by the exact numbers used. Using numbers, however, just makes it easier to follow (I hope).

So, in our example, a range of about 90 points spans the results from first to last, with about 30 skaters in the singles events at the World or European Championships. That means the average difference from one place to the next will be about 3 points. Further, we can expect at the top, the skaters may even be more closely bunched up. But let's stick with three points to be generous.

If a single judge can only judge to an absolute accuracy of 5% at best, that corresponds to 9 point on a 180 point scale. Nine points, however, corresponds to three places, so already we are worse off than before. If we were to use a nine member panel the accuracy of the result would be increased by a factor of three, so the uncertainty in the results of the panel will be give or take three points, or one full place. In other words, the results will be statistically meaningless. To obtain the same statistical accuracy in the results, and the same confidence level as in the current system, one would have to increase the size of the panel by a factor of nine -- requiring 81 judges -- or find nine humans who can judge on an absolute scale to an accuracy of about 2%; something experience says is highly unlikely. To make matters worse, the ISU is planning to use an approach in which only seven, and perhaps as few as five numbers will go into determining the final results.

For a five member panel, judges would have to mark with an absolute accuracy of about 1%. Again, this assumes the skaters are separated by about 3 points between places. If you want to separate skaters who differ by one point, which might well occur for the top skaters, the constraint on the judges is even tighter -- absolute judging with 1/3 of 1% accuracy. One should note, by the way, that having to measure a difference of only 1% between places on an absolute scale is not unusual for many individual performance sports, where the margin of victory between first and second place is usually about 1% or less, and sometimes as little as 1/10 of 1%. So expecting the top skaters may well be separated by only one point or less is not unreasonable.

As a final example to illustrate this problem, image for the moment you are a speedskating referee in an event where each skater performs against a clock one after the other. I have for you a high tech sounding clock that reads out to 1/1000 of a second. Thing is, however, the second hand of the clock occasionally sticks or randomly jumps forward in an unpredictable way. When you are done timing each performance you can never be sure of the correct time to better than 5% with this clock; say, 5 seconds in a 100 second race. Finally suppose the skill of the skaters is such that the time differences from one skater to the next is typically only 1 second, or maybe even 1/10 of a second or less. No one would use such a clock in timing a race. Why would one even consider using such a "clock" in figure skating?

In the absence of a rigorous study involving a large number of judges (dozens), a large number of skaters (hundreds) and a large number of competitions, that shows these statistical limitations can be overcome, one has to conclude that any results produced by an absolute point model will be statistically meaningless and one might just as give out the medals by pulling names out of a hat.

Changing gears, now ...

Last fall, the ISU said it was looking at combining the individual judges assessments by averaging the points or taking the median. In the most recent document distributed by the ISU, in December 2002, it said averages would be used. In February 2003 the ISU indicated it was also looking at a trimmed mean approach. The original plan was to have 14 judges and randomly select 7. It is now toying with randomly selecting 9 judges and throwing away the top two and bottom two marks. As a result only five marks would go into determining the average. As noted above, using only five numbers on an absolute scale compared to nine or seven makes an intolerable situation even worse. It has other implications also.

This trimmed mean is doubled up version of the old drop the top and bottom score and take the average that circulated a few years ago. The argument is that the top and bottom marks are likely to be biased and by throwing them out you filter out the effects of bias. Because the statistical properties of averages, medians and trimmed means have been well studied for many years, the following points can be made.

The top and bottom marks are not necessarily biased, and in a well policed judging pool most likely will not be biased.
A judge can consistently mark every skater slightly higher or lower than the rest of the panel and yet have placed every skater in the same order as everyone else. One can expect that a trimmed mean will throw out some of the most reliable marks much of the time.
Because the marks remain somewhat subjective, the high or low marks may well be the right marks compared to other judges on the panel who may be "protocol" judging.
A clever cheat will manipulate their marks in the middle of the range to evade detection. It isn't difficult to do and the trimmed mean will not filter it out.
When comparing averages, medians and trimmed means, median marks hands down is the best choice for filtering out biased marks.
Numerical analysis of averages, medians and trimmed means show that median marks have outstanding performance for filtering out one or two biased marks on a nine member panel, and continues to offer some filtering against bias for three or four biased marks. Trimmed means, however, fail to filter bias a significant fraction of the time for two or more biased marks.
Rather than resorting to numerical mumbo-jumbo that only sounds plausible but really doesn't work, the skaters would be better served by those 14 judges the ISU want to use by counting all the judges and combining their assessments using median marks.

And, changing gears one more time ...

The two subjective technical marks and three subjective presentation marks in the proposed system will be specified on a numerical scale with a granularity of one part in 20; i.e., 5%. For the case of the presentation marks, one step in each presentation assessment will amount to about one place in the standings. Further, 3 quality points on one trick, or one trick more or less than the competition will probably corresponds to about one place in the standings, or more. Thus, one single entry in the point total can alter a skaters' result by one place as determined by a single judge. This puts a huge burden on the judges to insure every assessment they make is perfect. Something humans are not well know for.

Another way to look at this is in the context of the judging accuracy required to obtain a believable result. The granularity (quantization error) in the number system used introduces an uncertainty of about 5%, but to believably separate the skaters in the placements one can only tolerate an uncertainty of less than 1%. The only way to overcome this is to increase the panel size dramatically. For example, 25 judges would bring this source of uncertainty down to 1%. This source of error, by the way, does not exist at all in the Ordinal and OBO relative judging systems.

How easy will it be to cheat the proposed system?

Very easy it turns out. One assessment level in the presentation marks, one set of quality points, one trick, is the difference between one place or more in any judges' result for a skater. How many places do you want to help your skater? Just give that skater a few ticks up here or there, or the opposition skater a few ticks down, and you give your skater a huge boost. Do it judiciously and you will not get caught. Do it extravagantly and you can pull the whole panel your way if the ISU sticks with averaging the judges' assessments or uses trimmed means.

Return to title page