Researchers at HumRRO have produced one of the first-ever, practitioner-oriented guides on developing situational judgment tests (SJTs). Drawing on scientific literature and their own extensive research and real-world experience developing and implementing SJTs in high-stakes assessment contexts for public and private sector clients, Deborah L. Whetzel, Ph.D., Taylor S. Sullivan, Ph.D., and Rodney A McCloy, Ph.D., wrote “Situational Judgment Tests: An Overview of Development Practices and Psychometric Characteristics,” published in the journal, Personnel Assessment and Decisions. According to ScholarWorks@BGSU, the article has already been downloaded in the United States and internationally nearly 400 times since its publication in March.
SJTs assess individual judgment by presenting examinees with problems to solve via scenarios and a list of plausible response options. Examinees then evaluate each response option for addressing the problem described in the scenario.
The paper discusses a variety of issues that affect SJTs, including reliability, validity, group differences, presentation modes, faking, and coaching, and provides best-practice guidance for practitioners.
“Consistent with HumRRO’s mission to give back to the profession, we are sharing experience- and evidence-based conclusions and suggestions for improving the development of SJTs,” said Sullivan.
It is clear from both psychometric properties and examinee response behavior that not all SJT designs are equally effective, and not all designs may be appropriate for all intended uses and assessment goals. To help practitioners and researchers alike, the authors provide best practices for developing SJTs:
SJT Best-Practice Guidelines
The use of critical incidents to develop SJT scenarios enhances their realism.
Specific scenarios rely on fewer assumptions, yielding higher levels of validity.
Brief scenarios reduce candidate reading load, which may reduce group differences.
Avoid sensitive topics and balance diversity of characters.
Avoid overly simplistic scenarios that yield only one plausible response.
Avoid overly complex scenarios that provide more information than needed.
Generate response options that have a range of effectiveness levels.
If developing a construct-based SJT, be careful about option transparency.
List only one action in each response option (avoid double-barreled responses).
Distinguish between active bad (do something wrong) and passive bad (do nothing).
Check for tone (use of loaded words can give clues as to effectiveness).
Use knowledge-based (“should do”) instructions for high-stakes settings. (Candidates will engage in impression management and will respond based on what they think should be done even if they would personally respond differently).
Use behavioral tendency (“would do”) instructions if assessing non-cognitive constructs, such as personality.
Use a format where examinees rate each option, as this method provides the most information for a given scenario, yields higher reliability, and elicits the most favorable candidate reactions.
Single-response SJTs are easily classified into dimensions and have reliability and validity comparable to other SJTs, but they can have higher reading load given each scenario is associated with a single response.
Empirical and rational keys have similar levels of reliability and validity.
Rational keys based on SME input are used most often.
Develop “overlength” forms (more scenarios and options per scenario than you will need) and score only those items that function properly)
Use 10–12 raters with a diversity of perspectives to establish the scoring key. Outliers may skew results if fewer raters are used.
Use means and standard deviations to select options (means will provide effectiveness levels; standard deviation will provide level of SME agreement).
Coefficient alpha (internal consistency) is not appropriate for multidimensional SJTs.
Use a split-half approach, with a Spearman-Brown correction, assuming the SJT content is balanced.
Because SJTs have small incremental validity over cognitive ability and personality, consider using them in tandem with other assessments to boost validity.
SJTs have been used effectively in military settings for selection and promotion.
SJTs likely measure a general personality factor.
SJTs correlate with other constructs, such as cognitive ability and personality.
SJTs have smaller racial group differences than cognitive ability tests.
Women perform slightly better than men on SJTs on average.
Behavioral tendency instructions have smaller group differences than knowledge instructions.
Rating all options has lower group differences than ranking or selecting best/worst.
Avatar- and video-based SJTs have several advantages in terms of higher face validity and lower group differences, but they may have lower reliability by inserting irrelevant contextual information.
Using avatars may be less costly, but developers should consider the uncanny valley effect when using three-dimensional human images.
Faking does affect rank ordering of candidates and who is hired.
Particularly in high-stakes settings, knowledge-based instructions (should do) appear to do a better job mitigating faking than behavioral tendency (would do) instructions
SJTs generally appear less vulnerable to faking than traditional personality measures.
Use scoring adjustments, such as key stretching and within-person standardization, to reduce the effect of coaching examinees on how to maximize SJT responses.