Calculation for inter-rater reliability where raters don't overlap and different number per candidate?

by EconoQ   Last Updated January 11, 2019 12:19 PM - source

I want to calculate the degree to which the gymnastics judges agree on balance beam scores, i.e., "inter-rater reliability". However, not all judges judge the same candidates, and the number of judges per candidate also varies. There are around 30 judges making roughly 1500 observations.

The data looks like this:

enter image description here

Can you please tell me how to do this statistically, perhaps using Cronbach's alpha?

STATA set-up advice would help, as well.

Answers 1

Here's an example using the kappa-statistic measure of interrater agreement. Before proceeding, we will need to reshape the data so that each row is a gymnast but each score variable corresponds to single judge.

input byte Gymnast str9 Judge double Score
1 Smith 5.5
1 Bartlet 6
1 Baily 8
2 Smith 10
2 Patterson 9.5
3 Baily 8
3 Patterson 7
3 Smith 7.5 
4 Bartlet 7.5
rename Score Score_
reshape wide Score_, i(Gymnast) j(Judge, string)
kap Score_*

The combined kappa is -0.1912, which would be considered poor. Stata recommends the following RoTs for summarizing agreement:

below 0.0 Poor
0.00 – 0.20 Slight
0.21 – 0.40 Fair
0.41 – 0.60 Moderate
0.61 – 0.80 Substantial
0.81 – 1.00 Almost perfect
Dimitriy V. Masterov
Dimitriy V. Masterov
November 10, 2015 19:59 PM

Related Questions