The Measurement Illusion in Executive Assessment

The Measurement Illusion in Executive Assessment

By -Published On: March 9, 2026-Categories: Assessment, Bias-

The Idiosyncratic Rater Effect: Why Leadership Ratings Measure the Rater

In executive selection, promotion, and removal, organizations often behave as if ratings are measurement instruments․ A 4 out of 5 on “strategic thinking,” “leadership potential,” or “executive presence” is treated as if it were a reasonably objective reading․ The problem is that decades of research suggest otherwise․ In many settings, ratings capture more variance from the rater rather than from the leader being rated․ This is the essence of the Idiosyncratic Rater Effect: the score reflects the evaluator’s personal standards, biases, affect, and interpretation habits at least as much as the candidate’s actual capability․

The best-known empirical demonstration comes from Scullen, Mount, and Goff’s study of multisource managerial ratings․ Using two very large datasets, they decomposed rating variance into several sources and found that idiosyncratic rater effects accounted for 62% and 53% of the variance, while the combined contribution of the manager’s actual general and dimensional performance was only 21% and 25%․ Random error added another 11% and 18%․ Several studies place idiosyncratic rater effects at roughly 53%–71% of rating variance, with one widely cited study around 71%․ In plain language: in these datasets, the dominant driver of the score was not the manager’s performance but the psychology of the person doing the rating․

That finding matters enormously in board- and executive-level decisions, because the traits that matter most in leadership selection are often abstract: judgment, adaptability, influence, learning agility, strategic navigation, composure under uncertainty․ Abstract constructs create room for subjective interpretation․ One rater’s “decisive” is another’s “reckless․” One board member’s “executive presence” is another’s “performative confidence․” The more ambiguous the construct, the more space there is for the rater to import their own internal benchmark

This is not a new problem․ Research on rater bias goes back more than a century, including Thorndike’s work on the halo effect․ Halo means that a general positive or negative impression spills over into supposedly separate dimensions․ If a leader is seen as polished, articulate, or successful, raters may unconsciously inflate ratings on unrelated dimensions as well․ That is one reason why competency grids and leadership scorecards often look more precise than they really are․

The problem becomes even more serious in 360-degree reviews, because many organizations assume that more viewpoints must produce more truth․ The logic sounds appealing: if one rater is biased, multiple raters should cancel each other out․ But that assumption only works if the errors are mostly random, independent, and centered around a shared standard․ In leadership ratings, that condition is often not met․ Research on multisource ratings has shown that rater source effects remain substantial, and that multisource data are still strongly shaped by who is doing the rating and from which vantage point․

Statistically, adding raters can sometimes improve reliability of an average․ Yet reliability is not the same as validity․ A more stable average of biased judgments is still a biased judgment․ Research on 360-degree feedback shows that a surprisingly high number of raters may be needed just to reach acceptable reliability for some developmental targets, and common practice with only a few peers often leaves reliability low․ Even then, acceptable reliability does not prove that the resulting score is measuring the intended leadership construct accurately

There is also evidence that 360 ratings are influenced by interpersonal affect․ Antonioni found that a rater’s affect toward the person being rated influenced downward, upward, and peer ratings in 360 feedback․ In other words, how much I like or dislike you can shape my “assessment” of your capability․ That is especially dangerous in executive contexts, where political coalitions, visibility, prior reputation, and stylistic fit often matter as much as observed behavior․

Why, then, do organizations still rely on ratings if the evidence is so uncomfortable? The answer is not that the science has vindicated them․ A better explanation is institutional convenience․ Performance appraisals and ratings remain embedded in employment systems because they are tied to pay, bonuses, promotions, demotions, dismissals, and talent processes․ Cappelli and Tavis showed that appraisal scores still shape important employment outcomes, and Murphy’s review argues that performance evaluation persists even though it is deeply problematic․ The most reasonable inference is that organizations keep ratings because they are administratively useful and socially familiar, not because they provide clean measurement․

There is also a psychological reason․ Ratings create an illusion of control․ A number in a box feels more governable than an uncomfortable admission such as: “We do not have enough evidence yet,” or “Our judgment is contaminated by role expectations, charisma, similarity bias, and politics․” In boards and executive committees, numbers can create false confidence because they look comparable, auditable, and defendable․ Marcus Buckingham and Ashley Goodall argued that managers are poor at rating people’s performance and that most HR data is therefore “bad data.” Their point was not anti-measurement; it was anti-pretend-measurement․

This helps explain why organizations often behave as though they can “manage the bias” even when they cannot․ Training raters, calibrating definitions, and adding more forms may reduce some variability, but the research does not justify confidence that these fixes solve the underlying problem in high-stakes leadership evaluation․ Frame-of-reference training can improve rating accuracy in controlled conditions, but the broader literature still shows that rater-specific variance remains stubbornly important in real organizational settings․

The implication is not that all ratings are useless․ It is that ratings should be treated with epistemic humility․ They are weak evidence when the construct is abstract, the stakes are high, and the raters differ in standards, exposure, and incentives․ They are especially risky when they are used to infer leadership capabilities from retrospective social judgments․

A better approach is to shift from broad impression ratings toward behaviorally anchored, role-relevant evidence․ That means clearer behavioral indicators, structured evidence collection, work-sample logic where possible, and separating developmental feedback from high-stakes selection decisions․ Behaviorally anchored rating scales can reduce some halo effects by focusing evaluation on observable behavior rather than vague trait labels․

In executive assessment, the central question should not be, “How many people rated this leader highly?” It should be, “What, exactly, did they observe, under what conditions, against what standard, and how much of this judgment is actually about the rater?” Until organizations become more disciplined about that distinction, many leadership ratings will continue to measure confidence, familiarity, and projection more than actual leadership capability․

Latest articles

Go to Top