## Statistical Performance of Ice Skating Scoring Methods

The following is a lengthy, fairly technical analysis of the way different methods of scoring ice skating competitions perform under a variety of circumstances.  If you are interested in the mathematics of how scoring systems work, this is it.  If you are not much for numbers you may just want to skim this - or run away screaming.

Intoduction

For the past few years, the scoring system used in competitive ice skating has come under attack from one quarter or another for a variety of reasons. In 1996 serious consideration was given to changing to the system used in professional skating in which the high and low marks are thrown out and the remaining marks averaged. Most recently the current ordinal system has been criticized by both the President of the ISU and the President of the IOC as being too difficult for the public to understand and unfair because it allows the results of higher placed skaters to change depending on the results of lower placed skaters who compete later in a competition. In order to compare the characteristics of the ordinal method and several alternate scoring methods a Monte Carlo analysis of 13 scoring methods was undertaken. Compared here are the characteristics of these different methods and descriptions how they behave in the presence of random errors and systematic errors (biases) in the assigning of marks.

In choosing which methods to consider it was decided to confine our self to methods which do not allow the results of later skaters in a group to change the results of skaters who have already competed, as this seems to be the major concern of the ISU at the time. Some consideration was also given to limiting the analysis to methods that are easy to understand, but not all the methods studied might be so described. Given the large number of spectators at events who record the marks and calculate results as events unfold, it is not clear that the current method is actually beyond the comprehension of the public, at least the public that attends competitive events. Among the TV viewing public confusion is probably more widespread, but that may be as much the fault of the ISU and the media who make no effort to adequately explain the method to the public as anything else. The methods studied were compared to the current ordinal method in terms of their response to marks subject only to random errors, marks subject to small systematic biases by 1 to 3 judges, and marks subject to large systematic biases by 1 to 3 judges. The ordinal method is the reference method against which the others are compared.

While not perfect, the ordinal system is, nevertheless, a remarkable system thanks to the two fundamental characteristics on which it is based, relative judging and the use of the median ordinal.

It is well known that human perceptions are much better at making comparative judgments than absolute judgments. Human senses can frequently compare and contrast observations to better that 1% but generally can only make absolute judgments to a few percent, and often no better than 5-10%. In the ordinal method the marks are only a preliminary step to establishing an order of finish for the competitors by each judge. The marks from the individual judges need not be on the same absolute scale (or any absolute scale at all) in order to produce consistent orders of finish among the judges. In determining the winner, the judges only have to decide who is the best in the group, but not how the skater rates in any absolute sense.

It is the use of relative judging, however, that gives rise to the occasional place switching which now concerns the ISU. Because the ordinal method is a relative judging method, it only gives the correct final result after everyone has skated and the placements of all the skaters relative to each other are calculated. Intermediate calculations will not necessarily match the final result since the relative placements of all the skaters are not known until it is all over. The place switching effect is not a flaw in the method, but rather a flaw in the process of releasing intermediate results. Since the practice of releasing intermediate results is expected by the public, one must either accept the existence of place swapping and try to explain it to the public, or change to a method that does not allow it to occur in the first place. The latter course of action means changing to a system in which the judges mark on an absolute scale so that intermediate calculations are not altered by the marks of subsequent skaters. As noted above, since humans are not as good at assigning absolute marks as they are relative marks, this change will add some uncertainty to the results produced by the scoring method.

Changing to absolute scoring will also have a second impact on the way marks are assigned. No longer would it be possible to give two skaters the same total mark and let the first or second mark break the tie in determining the ordinal. If placement is based in some fashion on the total mark, then more marks will be needed to separate the skaters. For example, in the ordinal method there are 15 places using marks between 5.6 and 6.0, but only 5 places using the corresponding total marks of 11.6 through 12.0. Consequently, switching to an absolute scoring method will require marking competitions more finely than the current 0.1 basis. At a minimum marking to the nearest 0.05 would be necessary. Fortunately, this level of precision appears both numerically adequate and humanly practical.

Although it is not obvious, and usually not pointed out when explaining the workings of the ordinal method, the key element of the method which makes it successful in minimizing the effects of judges bias is that the method is based on the median of the ordinals assigned each skater; i.e., when you read a results sheet the majority ordinal is actually the median ordinal. When you see 5/3 for a skater's majority what this means is that the median ordinal for the skater was third, and five judges placed the skater at or above the median ordinal.

In the ordinal method the median ordinal is the primary determinant for the skaters' places; everything else is tie breakers. If two skaters have the same median ordinal the number of judges placing the skater at or above the median ordinal breaks the tie. If the skaters have the same number of judges at or above the median ordinal then the sum of the ordinals for the judges in the majority breaks the tie. If that number is the same, then the sum of all the ordinals breaks the tie. If that number is the same the skaters are tied.

This use of the median to filter out judge's bias is a well know tool in numerical analysis. Whenever a measured parameter is subject to random errors and occasional systematic errors, the median mark is frequently a better gauge of the characteristic value of the parameter than is a simple average. For a parameter in which positive random errors are as common as negative random errors, the median value and the average value for the data will be the same in the absence of systematic errors. If, however, systematic errors are present, one large systematic error (a noise spike) can substantially skew the average value, but will have little impact on the median value, thus making it preferable to use the median value. Because the use of median values can be such a powerful tool to filter out systematic errors, several of the methods studied here rely heavily on the use of the median mark.

Methods Studied

The following methods have been thus far been studied.

Ordinal

This is the current method and serves as a reference against which the other methods are compared.

Simple Mean

This consists of taking the average of the judges total marks. This method is not of practical use since it has no immunity to the effects of bias. It is, however, the best method to use when marks are only affected by random errors and, thus, a good reference against which to compare the performance of other methods in dealing with random errors.

Clipped Mean

This consists of dropping out the high and low marks and then averaging the remaining marks. The first and second marks can be clipped and averaged separately and then summed, or the total marks can be clipped and averaged. Both approaches have been studied.

Median Mark

Placement based solely on the median mark. The median mark can be calculated for the first and second marks and then summed or the median of the total marks can be used. Both approaches have been studied.

Gaussian Mean

The judge's marks are sorted into a histogram. The mean mark is calculated assuming the histograms are gaussian distributions (a common noise model for random errors). The gaussian means can be calculated for the first and second marks and then summed, or the gaussian mean for the total marks can be calculated. Both approaches have been studied.

Weighted Mean

The median mark is calculated first. It is then assumed that the farther a judge's mark is from the median the less likely it is to be correct and thus should be given less weight in an average. The average of the judges' marks is calculated weighted by how far they depart from the median mark. Weighting the scores by the reciprocal of the deviation from the median marks, and the square of the reciprocal of the deviation from the median marks were both tried, with marks within 0.1 of the median marks having a weight of 1.0. This process can be applied to the individual marks first which are then added, or to the total marks. Both approaches have been studied.

Median Range

The median mark is calculated first. It is then assumed that marks that are too far off from the median are wrong and should not be counted at all and marks that are within an acceptable range are valid and should be included in an average; i.e., the average is taken only for those marks within the selected range. This process can be applied to the individual marks first with the results added, or to the total marks. Both approaches were studied.

Median Mark with Tie Breakers

This method is strictly analogous to the ordinal method, except it makes use of the median mark. The median marks is calculated first for the total marks. Placement is first based on the median mark (the skaters with the highest median total mark wins). If any skaters have the same median mark the number of judges assigning the median total mark or higher breaks the tie. If skaters have the same number of judges giving them the same median total mark the average of the total marks in the majority breaks the tie. If skaters are still tied the average of all the judges' total marks breaks the tie. This method is no easier to understand than the ordinal method, but it represents the least radical change from the current method, and by using the marks as assigned without comparing them to other skaters it precludes place swapping during an event.

Method of Analysis

Each of the scoring methods was studied by calculating the results of a large number of synthetic competitions for a variety of scenarios. The following parameters are adjustable for each scenario in the software used to do the calculations:

1. The number of skaters in the competition is selectable to a maximum of 30. Most test cases were run for either 6 or 12 skaters. Since in each method the maximum error in a skater's place was never greater than five places, there was no need to consider larger groups.

2. The number of judges on the panel is selectable to be an odd number between 3 through 15. The results here describe the performance of the different methods for a standard panel of 9 judges. A few other cases were run to see how thing vary for different size panels.

3. The spread in the judges marks due to random errors is adjustable in a range of 0.1 through 1.0. Most cases were run for spreads of + 0.1 or 0.2, which is typical for the judges marks in senior level competitions. The random errors could be selected to be uniformly distributed or to have a gaussian distribution. A few cases of very large spreads were run to see how the methods perform when a panel's marks are "all over the place", as is frequently found in lower level events.

4. The precision with with the marks are assigned is selectable as either 0.1 or 0.05. Most cases were run for 0.1. In general, the need to use a precision of 0.05 is driven by the need to have enough marks to separate the skaters on an absolute scale, and not the mathematical properties of the methods.

5. The permitted range of valid marks for the Median Range method is selectable between 0.1 through 0.5. For most examples a spread of 0.2 for individual marks and 0.4 for total marks were used.

To run a scenario, marks were first assigned the skaters on an absolute scale to define the "truth". These are the marks the skaters would receive if all judges were equally skilled, used identical judgment in exact accordance with the rules, and their marks were completely error free and bias free. The various adjustable parameters were then assigned values and marks created for some number of synthetic competitions. In general, 1000 synthetic competitions were run for each scenario. For each synthetic competition the judges' marks were assigned by generating a random error which was applied to the truth marks. Systematic errors were also be applied to the marks of individual judges to test the impact of systematic biases. From the results of the synthetic competitions several statistics were then examined to gauge the performance of the methods. These statistics included the fraction of the cases in which skaters ended up in their correct place on a place by place basis, the fraction of the cases in which all skaters in the group ended up in the correct place (a "perfect sheet"), and the fraction of the cases that produced ties.

The following scenarios were studied.

1. Random errors only, with skaters separated by 0.1 or 0.2 points in total marks, and the spread in the judges marks also either +0.1 or +0.2. The purpose of these tests was to see how well the different methods give the correct result when the judges are marking a competition on an absolute scale and only make random errors of judgment. [The ranges of marks used here are typical of what one finds in the marks for senior level events at Nationals and Worlds. Further, the spread in the ordinals calculated here from the spread in the marks is also consistent with what is found for the ordinals at Nationals and Worlds. One concludes, then, that judges at Nationals and Worlds marks on a roughly absolute (or at least consistent) scale to a level of about 0.2 points (about 3.3% absolute).]

2. The same parameters for random errors as in item 1, with the addition of small systematic biases of 0.1 in each mark nudging the first skater down and the second skater up. Test were run for 1 through 3 judges biasing their marks in this way. The purpose of these tests were to see how easy it is for a small group of judges to change the outcome for a given place (first place in particular) by making small adjustments to their marks.

3. The same parameters for random errors as in item 1, with the addition of large systematic biases if 0.5 or greater. Tests were run for 1 through 3 judges biasing there marks in this way. The purpose of these tests was to see how the methods perform when a small group of judges bias their marks by a significant amount.

4. Random errors only, with skaters separated by 0.1 or 0.2 points in total marks and the spread in the judges marks +0.4, to see the effect of large random errors on the results determined by the methods.

5. Various random errors combined with various permitted ranges for the median range method to begin to estimate the optimum value for permitted range that works best for that method.

Results

The following describes the performance of each method and the limiting factors that drive that performance. The relative merits of each method are then compared in tabular form which provides a quick-look qualitative comparison of the methods. Note that for all the methods studied which have two variations, the variation in which a method is applied to the total marks, as opposed to the individual marks first and then summing them, was found superior and is the variation to which the comments below specifically apply.

Ordinal Method

The ordinal method gives reasonable, though hardly outstanding, performance in dealing with random errors, small biases, and large biases. For each of these general situations it is neither exceptionally good nor exceptionally bad, merely adequate. For skaters separated by 0.1 in total marks and a small spread in the judges marks of +0.1 the skaters are correctly placed 80-90% of the time and the method produces only a small number of ties. When the range in the judges' marks is +0.2 the skaters are correctly place 65-80% of the time. For larger spreads in the marks the ordinal method begins to degrade rapidly. For a spread of +0.4 in the marks the method gives the correct answer only 40-70% of the time, and for panels of fewer than 9 judges it is far worse.

As a sidelight to this study, it is found that for lower level events where there is a large spread in the judges' marks the practice of using fewer than 9 judges strongly degrades the quality of the results. In such competitions, the practice of using panels of 5 judges should probably be abandoned.

When small biases are added to the mix, the ordinal method does fairly well if only one judge biases his/her marks and the range in the judges' marks is only +0.1, with first place still correctly determined 81% of the time. But if 2 or 3 judges bias their marks by 0.1, or if the natural spread in the judges' marks is +0.2, and 1 or more judges bias their marks by 0.1, then first place is incorrectly determined as much as 78% of the time in the worst case. This poor performance is common to all the methods studied for these situations. Because the skaters tend to be separated by 0.1-0.2 points in total marks and the natural spread in the judges' marks is also 0.1-0.2, it is impossible to filter out small biases based on a single set of marks since a small bias of 0.1 cannot be distinguished from the natural spread in the marks.

When large biases are applied the ordinal method deteriorates as the numbed of judges with errors increases. With a large bias from one judge the ordinal method gives the correct place for the skater affected about 60% of the time; for 2 judges about 40%, and for three judges about 20%. Thus, the method is moderately successful at accommodating one large error, but not very good when there are two or more.

Simple Mean

The simple mean is the best method for dealing with random errors only. For a small spread in the marks (+0.1) the simple mean gives the correct answer more than 96% of the time, and for a moderate spread (+0.2) 76-89% of the time. For a large spread in the marks (+0.4) it still performs the best, giving the correct about 20% more often than does the ordinal method.

With small biases the simple mean does slightly better than the ordinal method for most of the cases studied and slightly worse for the remainder. Overall, for small biases the results for the simple mean and the ordinal method can be considered nearly equivalent. For large biases, however, it is another story.

With a large bias from one judge (0.5 each mark) the simple mean gives the correct place for the skater affected only 41% of the time, and for more than one judge virtually never. Because the simple mean has no immunity to the effects of large errors it is not an appropriate method for use in actual competitions, it does however set the standard for how a method should perform in the presence of random errors alone.

Clipped Mean

The idea behind the clipped mean is that if the high and low mark are thrown away it will dissuade judges from consciously biasing their marks, and will filter out large errors in the marks by throwing away the extreme marks. The drawbacks to the method are that there is no guarantee that the high and low marks are erroneous, in most cases it throws away marks that are not in error, and it does not protect against more than one judge biasing their marks in the same direction.

In terms of random errors the clipped mean performs identically to the simple mean, but with two less judges. In other words, the clipped mean with 9 judges is the same as the simple mean with 7 judges so far as random errors are concerned.

For small biases the clipped mean does worse than the simple mean. Like all the methods it does not filter out the effects of small biases, and with 2 fewer judges going into the average the effects of random errors do not cancel out as well.

With large biases, the clipped mean performs better than both the ordinal method and the simple mean for 1 judge, but for 2 or more judges it is far worse than the ordinal method and almost as bad as the simple mean - but not quite. In some of the cases studied, when two judges gang-up on a skater, that skater can be incorrectly placed as much as 89% of the time.

Median Mark

The median mark method performs fairly well for most of the cases studied but is limited by the fact it produces an excessive number of ties. For most scenarios it produces ties for the majority of the synthetic competitions. This results from the fact that the marks are too coarse and the number of judges too small to get sufficiently precise values for the median marks. To obtain more precise median marks would require marking to a precision of 0.025 or greater and the addition of more judges than would be practical to employ in a competition. Nevertheless, variations on the use of the median mark can overcome this weakness, and are discussed below.

Gaussian Mean

This method works fairly well for random errors only, intermediate between the ordinal method and the simple mean. In the case of biases, however, it does not do as well. For small biases it does worse than the other methods and for large biases it does not as effectively filter out the effects of the marks that are way out of line. This method gives lower weight to marks that are greatly in error but does not totally ignore them, which results in the unsatisfactory performance found. It is also adversely affected if the histogram of marks do not actually correspond to a gaussian distribution.

Weighted mean

For random errors the weighted mean does better than the ordinal method, and nearly as well as the simple mean. For small biases it also does well when only one judge's marks are biased. If the marks of 2 or more judges have small biases, however, the results are worse than for both the ordinal method and the simple mean. This method does fairly well dealing with large biases, performing better than the simple mean, the clipped mean, and the ordinal method. If the performance of the method was a little better in the case of small biases this might be a viable method to use, but since small biases are probable a more common occurrence in judging than large biases, its deficiency in dealing with the latter is a serious limitation.

Median Range

This method is second best of all the methods studied when it comes to dealing with random errors, performing as well or nearly as well as the simple mean. For small biases it is comparable to the other methods, doing slightly better in some cases, slightly worse in others. It is the best of all the methods in dealing with large biases, placing a skater correctly more than 80% of the time even when three judges bias their marks substantially. The only weakness to this method is that the range selected over which the marks are averaged must be carefully matched to the expected consistency of the judges' marks. For these tests several ranges were compared. It was found that averaging total marks within +0.3 or 0.4 worked well. A narrower range does not work as well as it begins to filter out valid marks within the naturally expected spread, which reduces the statistically accuracy of the average, while a wider spread of +0.5 gives the judges too much latitude to add moderate biases to their marks. A related concern is that for lower levels the natural range of the judges marks is frequently greater than +0.2.

As the natural range of the judges' marks increases, the number of judges' marks falling outside the acceptable range will increase, and the number going into the average decreases, reducing the statistical accuracy of the results. To test the performance of this method for events with a large range in the judges' marks, cases were run where random errors of +0.4 were applied to events were the skaters were separated by 0.1 in total marks. The results, while not pretty, were comparable to the other methods studied and slightly better than the ordinal method. As in the case of the ordinal method, performance also degrades with the use of panels with fewer than 9 judges.

Based on the overall performance of this method, this approach to scoring appears to be a viable alternative to the ordinal method.

Median Mark with Tie Breakers

Because this method is so closely analogous to the ordinal method, it is not surprising to find that this method performs nearly identically to the ordinal method. In terms of random errors it actually does about 5-10% better than the ordinal method. For small biases it also does slightly better than the ordinal method in most cases; however, in a few cases it is slightly worse. For large biases it performs 25-50% better than the ordinal method.

This method represents a small but significant improvement over the current ordinal method. It does not allow place switching, represents the smallest conceptual change from the current system, and overall performs better than the current system. In principle, the main tie breaking rules could be represented numerically, which would allow the posting a single composite score that the public could easily understand. This method appears to be a viable alternative to the current scoring system.

Relative Performance

The following table gives a qualitative comparison of how the different methods studied perform for the cases tested. In the columns for small biases and large biases two grades are given. The first is for biases with a small range of random errors and the second is for biases with a moderate range of random errors (only the second case was run for the gaussian mean method). The grade given is based primarily on the frequency with which the methods give the correct place for the skaters under each condition.

 Method Random Errors Small Biases Large Biases Small Range Moderate Range Large Range 1 Judge 2 Judges 3 Judges 1 Judge 2 Judges 3 Judges Ordinal A- B D+ B C C+ D D F B+ C C+ D D F Simple Mean A+ B+ C- A B C C- F F D+ D F F F F Clipped Mean A+ B+ C- A B C- D+ F F A B- F F F F Median Mark B B- C- B+ C+ C D D F A- B- B+ C C+ D+ Gaussian Mean A B D+ C F F C+ D F Weighted Mean A B D+ A- C C D- D- F A B- A B- A- C- Median Range A+ B+ D+ A B C C- F F A+ B+ A+ B+ A+ B+ Median w. Tie Breakers A B C- A- C+ C D F F A C+ B C C D

Conclusions

Based on the test cases tabulated above, the current ordinal method of scoring competitions appears to be superior to the simple mean, the clipped mean, the median mark, and the gaussian mean methods. Although the median mark method has higher scores overall than the ordinal method, the large number of ties generated by the median mark method renders it useless in its basic form.

The median mark with tie breakers performs roughly 20% better than the ordinal method and is conceptually closest to that method. The weighted mean performs better, in general, than all but the median range method, but its somewhat weaker performance in dealing with small biases makes it less attractive. In terms of overall performance and ease of understanding for the public, the median range method seems the clear winner of the methods tested. It offers the easily understandable deterrence of throwing away all marks that are too out of line, but is smart enough not to throw away marks unnecessarily. In this respect it can be viewed as a more sophisticated version of the clipped mean method.

A limited number of test cases were run with panels of other than 9 judges. For the cases run, no significant benefit was found in increasing panels in size, up to a total of 15 judges. For smaller panels results degrade significantly, especially with moderate to large ranges in the judges marks. This effect is least significant for upper level events where the consistency of the marks if fairly high, but for lower level events where the judges' marks span a greater range it is a great disservice to the skaters to use panels of fewer than 7 judges.