Crowdsourcing expertise: Using Amazon's Mechanical Turk to develop scoring keys for situational judgment tests
It is common practice to rely on a convenience sample of subject matter experts (SMEs) when developing a scoring key for situational judgment tests (SJTs). However, the defining characteristics of what constitutes a SME are often ambiguous and inconsistent across studies. Other research fields have adopted crowdsourcing methods to replace or reproduce judgments thought to require subject matter expertise. Therefore, we conducted the current study to compare crowdsourced scoring keys to SME-based scoring keys for three SJTs in different domains, each varying in job-relatedness. Our results indicate that scoring keys derived from crowdsourced samples are likely to converge with keys based on SME judgment, regardless of test content (correlations ranging from r = .88 to .94). We observed the weakest agreement among individual MTurker and SME ratings for the more job-specific Medical SJT (classification consistency = 61%) but the aggregate scoring keys remained highly correlated. We observed stronger agreement in response option rankings for the Military and Communication SJTs (80% and 85%), which were both designed to require less procedural knowledge. Although general mental ability and conscientiousness were each related to greater expert similarity among MTurkers, the average crowd rating outperformed nearly all individual MTurk raters. Based on an analysis of randomly-drawn bootstrapped samples of MTurker ratings in each of the three samples, we found that as few as 30-40 raters may provide adequate estimates of SME judgments of most SJT items. We hope that these findings help inspire others to consider using crowdsourcing methods as an alternative to SMEs.