Numbers Management

horn_star · August 18, 2007

The judges job is to rank and rate. The ratings are supposed to reflect the extent to which the quality of one performance can be discriminated from the quality of another performance. It is sheer stupidity to use the extreme end of the scale except in extremely rare instances. If you have a 10-point scale, then 5 should be the most common score and the vast majority should be between 2 and 8. Only about 5% of the scores should be above 8.

Why do the scores have to fit a standard normal distribution?! Given the scale on the back of the sheets, it just doesn't make any sense to do it that way.

The five boxes are distributed across the score range from 0-100 as follows:

Box 1: 1-29 (Rarely/Poor)

Box 2: 30-49 (Infrequently/Fair)

Box 3: 50-69 (Sometimes/Good)

Box 4: 70-89 (Usually/Excellent)

Box 5: 90-100 (Always/Superior)

The back of the sheet is very clear (especially after they were changed for the 2007 season). Each criteria is listed and the judges must determine whether, for instance:

The individuals demonstrate MUSICIANSHIP within ENVIRONMENTAL and PHYSICAL CHALLENGES rarely, infrequently, sometimes, usually, or always. These descriptors apply to very specific ranges in score for a given caption.

That being said, judges rarely, even in the beginning of the season, dip into box 2.

When you routinely give out 9.9's and 10's, you are not judging, you are cheerleading. You are letting your emotions overwhelm your intellect. If you can honestly say that a performance is so close to perfection, then why not really reward it by shifting the rest of the scores down and making the gaps bigger?

Can you tell me where on the sheet it says anything about perfection? Even in box five there is room for different levels of "always" and "superior." It's about achievement, wherein very high scores reflect a performance that is always at an extremely superior level. The word perfect just isn't on the sheet. More about the gaps can be found below...

This was never a problem when the top score at Finals was around 90. When the scores shot up to the 98 range, things got ridiculous. Especially when you have a 0.2 difference in total score between first and second. There is just no way a difference that small can be meaningful.

Actually, a gap of 0.2 has a VERY specific meaning. In fact, in a section of the judges training material, this topic is discussed specifically (I think the section is even titled 'numbers management'). Gaps of 0.1, 0.2, and 0.3 have specific meanings and, contrary to days of old, a 0.3 spread in a single caption is a significant judging statement these days.

octavia9299 · August 18, 2007

Actually, a gap of 0.2 has a VERY specific meaning. In fact, in a section of the judges training material, this topic is discussed specifically (I think the section is even titled 'numbers management'). Gaps of 0.1, 0.2, and 0.3 have specific meanings and, contrary to days of old, a 0.3 spread in a single caption is a significant judging statement these days.

True. It sends a specific message to the corps about who they're in competition with, who they're not even in the same league as, and how likely it is that they'll "catch" the next corps up in that caption.

Machine · August 19, 2007

For example, if you let the drum judge assign a 38 instead of a 19, that drum judge has given himself twice as much weight to his score, and if all else stayed equal, would then make his score twice as important as brass, et al.

While that's certainly a possibility, I doubt it would happen and certainly expect that there would be verbiage in the rules to prohibit that kind of behavior just in case. We don't see judges handing out 19's the first week of the season now, so I wouldn't expect them to get carried away at the end of the year either. Edited August 19, 2007 by Machine

vferrera · August 19, 2007

a 0.3 spread in a single caption is a significant judging statement these days.

To defend this statement you need to account for all sources of variability. Within judge variability - how consistent is a particular judge with him/herself. You might think that people would be perfectly consistent, but many studies show that if you ask someone to rate the same event repeatedly, they will give different ratings at different times. It is perfectly reasonable that a judge might view the same performance twice, but see different things each time and therefore give different scores. It's a sampling issue.

Then you have between-judge variability. How consistent are different judges in rating the same performance?

You also have serial order effects that artificially reduce variability. If a judge sees the same corps twice in a short time frame, and if he/she remembers the first score, the second score has to be related to the first. If the second performance is better, then the second number has to be higher (assuming the judge is honest). Thus, the two scores are not independent.

No measurements are available for either source of variability, but my guess is that if you remove serial order effects, they probably add up to about 0.5.

Tekneek · August 19, 2007

What if we broke Quarters, Semis, and Finals into equal blocks of corps. Have different panels for each block and don't let the different panels share any information or results between themselves. They would only know the total scores that have been announced up to that point (on Thursday and Friday), but they won't know the numbers handed out in their particular captions. I understand it would be unreasonable to do this kind of thing at every show, and perhaps tough to do at every regional, but surely they could arrange it for championships. I suspect the numbers might be interesting...

Edited August 19, 2007 by Tekneek

audiodb · August 19, 2007

To take your idea to the extreme, why not just do away with numbers altogether and simply assign placements?

Because we have several different captions being judged at the same time. If corps A beats corps B by a full point in drums, while losing brass by a tenth, corps A should win. Without the point spreads, the system loses that decision-making capability.

fourouttheforty · August 19, 2007

What you could do is uncap Box 5 and give arbitrary scores, and then at the end normalize the scores to be out of 20. Then maybe 18.0 won't necessarily be the bottom of Box 5, but maybe 17.6, if you had to initially give someone a theoretical 20.4. The spreads are preserved, you'd just need some extra words on the sheet saying "You got a 17.6, but we put you in Box 5."

Either that or just lower all the thresholds to begin with. Make a total of 16 or 17 Box 5. I mean, no one really uses Box 1, am I right?

Bob984 · August 19, 2007

Here's something I thought about while attending Championships this year... Why do we necessarily have to score groups "__ out of 100"? In a year such as this when everyone in the top 7 was so strong, it seemed as if the judges were forced to make a decision early in the evening as to where they were going to place a corps in a given caption. Essentially they had to decide if Corps X was going to be the gold standard for the night or if they were going to leave room for the 4, 5, or 6 corps to follow.
What if the scoring cap were to be removed, so that there is no limit to how good you can be? Leave the box criteria in place but take the cap off of box 5. That way if corps number 6 comes out with a blazing brass line, a judge can confidently give them a 19.8 or 19.9 without fear of having no room for the rest. If the rest of the corps come out on fire as well, maybe your top brass line ends up scoring 20.5. Who cares? We all say it's not about the score, but the spread. But when you get into championships week and the top corps are maxing out their shows, the current scoring system only allows for accurate ranking. Scores AND spreads go out the window in order to keep room at the top.

Does this make sense? Would this be a good idea?

Your thought about the problem about judges "leaving room" is a valid one. However, I don't think that we have to go above the 100th percentile....there actually is a simpler solution. Adding a zero. So each judge is scoring out of 2000 instead of 200. All the formulas remain intact, and you still have a final score out of 100. However, now instead of the judge having 20 numbers from the 90th percentile to perfection, the judge now has 200. Plenty for any show. The judge can now give up a very high percentile number at any time, with plenty of numbers left at the same range. When you figure that from semis on each judge only uses 40-45 numbers, now they would have over 400 in the same percentile range.

GB

Tekneek · August 19, 2007

Either that or just lower all the thresholds to begin with. Make a total of 16 or 17 Box 5. I mean, no one really uses Box 1, am I right?

Maybe not these days, but I think there was some box 1 action back when I marched.

horn_star · August 19, 2007

To defend this statement you need to account for all sources of variability. Within judge variability - how consistent is a particular judge with him/herself. You might think that people would be perfectly consistent, but many studies show that if you ask someone to rate the same event repeatedly, they will give different ratings at different times. It is perfectly reasonable that a judge ight view the same performance twice, but see different things each time and therefore give different scores. It's a sampling issue.
Then you have between-judge variability. How consistent are different judges in rating the same performance?

You also have serial order effects that artificially reduce variability. If a judge sees the same corps twice in a short time frame, and if he/she remembers the first score, the second score has to be related to the first. If the second performance is better, then the second number has to be higher (assuming the judge is honest). Thus, the two scores are not independent.

No measurements are available for either source of variability, but my guess is that if you remove serial order effects, they probably add up to about 0.5.

I don't understand what this has to do with the fact that 0.3 has a specific meaning in the judging community. I'm loosely paraphrasing here, but the judges training material says essentially that a three tenths spread means that the lower corps is 'not competitive' with the higher corps in that caption... on that night. Three tenths is saying you're in a different league... it's a pretty significant statement in the judging community.

Now, if you're saying that a spread that small is meaningless over many multiple nights from multiple (or even the same) judges, that's fine. Except that when a judge puts a corps three tenths down, it's a statement that the rest of the judging community notices, and that spread can be difficult to overcome late in the season.

The judges are TRAINED to use the point spreads this way. So regardless of variability, on a single given night, one, two, and three tenth spreads have a clear meaning.

Numbers Management

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members