I want to calculate the degree to which the gymnastics judges agree on balance beam scores, i.e., "inter-rater reliability". However, not all judges judge the same candidates, and the number of judges per candidate also varies. There are around 30 judges making roughly 1500 observations.
The data looks like this:
Can you please tell me how to do this statistically, perhaps using Cronbach's alpha?
STATA set-up advice would help, as well.
Here's an example using the kappa-statistic measure of interrater agreement. Before proceeding, we will need to reshape the data so that each row is a gymnast but each score variable corresponds to single judge.
clear input byte Gymnast str9 Judge double Score 1 Smith 5.5 1 Bartlet 6 1 Baily 8 2 Smith 10 2 Patterson 9.5 3 Baily 8 3 Patterson 7 3 Smith 7.5 4 Bartlet 7.5 end rename Score Score_ reshape wide Score_, i(Gymnast) j(Judge, string) kap Score_*
The combined kappa is -0.1912, which would be considered poor. Stata recommends the following RoTs for summarizing agreement:
below 0.0 Poor 0.00 – 0.20 Slight 0.21 – 0.40 Fair 0.41 – 0.60 Moderate 0.61 – 0.80 Substantial 0.81 – 1.00 Almost perfect