Consensus scoring and empirical option weighting of performance-based Emotional Intelligence (EI) tests

https://doi.org/10.1016/S0191-8869(03)00123-5Get rights and content

Abstract

Faces and Designs (N=102) from the Mayer–Salovey–Caruso Emotional Intelligence Test (MSCEIT) were scored using five different consensual-scoring methods: proportion, mode, lenient mode, distance, and adjusted distance. The aim was to determine which scoring methods were superior in terms of reliability, discriminability (distribution shape), and validity. Where possible, the Method of Reciprocal Averages (MRA)—used previously on dichotomously scored aptitude tests (to improve reliability)—was applied to consensus scores. Psychometric analyses suggested that the most promising techniques were proportion and mode scoring, with MRA scaling ameliorating some potential weaknesses apparent with these forms of consensual-scoring. Faces and Designs showed weak correlations with pro-social personality dimensions, with crystallized intelligence, and with visualization abilities. The study concludes with suggested remedies for addressing measurement problems endemic to EI research.

Introduction

The scientific investigation of emotional intelligence (EI) is still in its infancy, having first been proposed as a quantifiable attribute only a decade ago (see Mayer, DiPaolo, & Salovey, 1990). There are currently several models of EI in the literature (and many more in the popular press), though these can be roughly classified under two distinct frameworks. The first approach, which tends to rely on self-report techniques, suggests that EI is primarily dispositional (i.e. representing a conglomerate of cognitive, personality, motivational, and affective attributes). Examples of measurement approaches subscribing to this framework include the EQ-i (Bar-On, 1997), the EQ-map (Cooper, 1996), and the Schutte Self-Report Index (SSRI; Schutte et al., 1998). The second approach upholds a cognitive view of EI, which in turn suggests that its measurement should conform to ability models. Examples of this approach include the four-branch hierarchical structure of EI, measured empirically by the Emotional Accuracy Research Scale (EARS; Geher, Warner, & Brown, 2001), the Multi-factor Emotional Intelligence Scale (MEIS; Mayer, Caruso, & Salovey, 1999) and its successor, the Mayer–Salovey–Caruso Emotional Intelligence Test (MSCEIT; Mayer, Salovey, Caruso, & Sitarenios, 2003).

Dispositional models of EI have problems with virtually all forms of validity (Matthews, Zeidner, & Roberts, 2002). In particular, they rely on self-report techniques, and therefore assess self-perceptions of intelligence, which relate only weakly to actual intelligence levels (Mabe & West, 1982). Indeed, the correlation between self-reported EI and traditional forms of intelligence is near zero (see e.g. Derksen, Kramer, & Katzko, 2002), suggesting that it cannot legitimately constitute a form of intelligence (Bowman, Markham, & Roberts, 2002). Moreover, many self-report measures lack divergent validity in their relation with established personality traits (Davies, Stankov, & Roberts, 1998). For example, the EIS correlates about 0.50 with Openness to experience (Schutte et al., 1998), while the EQ-i correlates close to 0.50 with all of the Big Five Personality Factors, and particularly highly with Neuroticism (Dawda & Hart, 2000). With respect to predictive validity, while several studies claim that this has been established (e.g. van der Zee, Thijs, & Schakel, 2002), the fact remains that there are noteworthy problems in the criterion used for this purpose, which often shares conceptual overlap with the predictor (see e.g. Zeidner, Matthews, & Roberts, in press).

In contrast, performance-based EI appears, at least, to demonstrate construct validity. The four-branch EI model (Mayer & Salovey, 1997), as indexed by the MEIS, is reasonably distinct from personality (Ciarrochi et al., 2000, Garcia et al., in preparation, Roberts et al., 2001), indicating divergent validity. Equally important, the MEIS correlates moderately with measures of crystallized intelligence (Gc), such as the Armed Services Vocational Aptitude Battery (ASVAB; Roberts et al., 2001) and the Army Alpha Vocabulary Test (Mayer et al., 1999). Since the four-branch model indexes four abilities (perception of emotion, emotional facilitation of thought, use and understanding of emotion, and emotion regulation; see Mayer & Salovey, 1997), this has been taken to indicate convergent validity (Matthews et al., 2002). Interestingly, the MEIS has shown no relation to Raven's Progressive Matrices (Ciarrochi et al., 2000), placing cognitive EI closer to crystallized (rather than fluid, i.e. Gf) intelligence within Gf/Gc theory (see Carroll, 1993; Horn & Cattell, 1966).

In light of the preceding being minimally investigated with respect to newly developed measures of performance-based EI, one aim of the current investigation was to test the proposition that the MSCEIT will behave similarly to the MEIS in its relationship with personality and intelligence. Specifically, the MSCEIT's relation to the Five-factor Model of personality (see e.g. Costa & McCrae, 1992), Eysenck's Psychoticism, and measures of Gf and Gc was examined. Since the MSCEIT's emotion perception tests involve the processing of visual stimuli, the MSCEIT's relation to broad visualization (Gv) was also examined. This follows from a suggestion that visuo-spatial capabilities may relate to the emotion perception branch largely because of shared information processing requirements (that are common to the Gv construct, rather than any other cognitive ability factor; see Kaufman & Kaufman, 2001).2

A major issue for performance-based EI is the “problem of the correct answer” (Mayer, Caruso, & Salovey, 2000). Intelligence tests of numerical, reasoning, or spatial ability employ mathematical, logical, or geometrical systems to determine the accuracy of response unequivocally (Matthews et al., 2002, Roberts et al., 2001). However, emotion-based systems have no corresponding algorithm to determine, for example, the degree of anger present in a design, or the similarity of the feeling of love to the physical sensation of heat. In everyday contexts, people determine the appropriateness (or correctness) of an emotional response (or judgment) by agreement with the rest of the group interacting in the same emotional system. The primary method of scoring both the MEIS and MSCEIT uses this idea—that the correct answer is what the group agrees upon (consensus scoring). Alternative scoring techniques are to consult experts in the field of EI (expert scoring) or, for items where this is possible, to ask the creator of the item (the target) what the correct answer is (target scoring).

Both target and expert scoring suffer from certain theoretical problems. In particular, there are no criteria for who is an ‘expert’ in EI, and some evidence suggests that scores are higher for test-takers similar to the experts—white males scored more highly under expert scoring for the MEIS when the experts were white males (Roberts et al., 2001). Problems with target scoring are that the targets themselves may not be able to express the emotion that they are feeling accurately, or that they may report only pleasant (or pro-social) emotions when they are in fact feeling something else (see Mayer & Geher, 1996 for empirical support of the latter view).

The MEIS and MSCEIT are most commonly scored by consensus. Legree (1995) outlines the logic of consensus scoring, presenting it as an extension of expert scoring. Legree reasons that the judgment of experts is equivalent to non-experts (on tasks of tacit or social knowledge) except that they are less consistent and therefore less reliable. Non-expert judgments can be considered as composed of two parts: the common variance (‘expert’ opinion) and unique variance, which ought to be random. If this is so, an aggregate of non-expert judgment is equivalent to expert judgment.

A major weakness with consensus scoring is that distributions of consensus scores, if subscribing to certain forms of reliability, can not logically be normally distributed. In any given question, the majority must score the highest mark. If the tests are to maintain high levels of internal consistency reliability, then more or less of the same group of ‘high ability’ people (who form a majority) will answer the ‘best’ option on most items and the distribution of test scores will have negative skew and a high degree of kurtosis (i.e. most scores will form a highly peaked cluster at the top end of the distribution). This distribution shape is problematic in terms of the multivariate normality assumptions of statistical procedures used in experimental research, and also in terms of discriminating between people of average and high EI. The scores of an EI ‘genius’ and someone with merely adequate EI both fall within the same peaked clump at the high end of the distribution. Another aim of the current paper was to determine which variants of consensus scoring are the most susceptible to this effect; in short, to ascertain the most empirically sound form of consensus-score available from those currently used in the literature. A description of five common methods of scoring by consensus is given in Table 1.

Proportion scoring is used in the MEIS and MSCEIT, where reported (internal consistency) reliabilities for the Emotion Perception branch tests routinely exceed 0.80 (Ciarrochi et al., 2000, Mayer et al., 1999, Roberts et al., 2001); although other tests are quite unreliable (e.g. Blends, which is less than 0.50 in all three aforementioned studies). Mode scoring has been used in studies examining the EARS, where reliability varied from 0.53 (Mayer & Geher, 1996) to 0.77 (Geher et al., 2001), although the forced-choice format utilized in this test may be less reliable than a Likert-scale format (see Legree, 1995). Lenient mode scoring was used in the forerunner to the MEIS test, and resulted in quite low reliability estimates (0.36–0.55 in Davies et al., 1998; 0.61 in Mayer et al., 1990).

Traditionally, reliability has been improved by increasing test length, and there is some evidence that a more fine-grained response scale (e.g. a 40-point rather than five-point scale) may also boost reliability estimates (Legree, Martin, & Psotka, 2000). A third method found to improve reliability in multiple-choice aptitude tests is to weight distractor options (Willson, 1982). A technique used to this end, known as the Method of Reciprocal Averages (MRA), adjusts option weights according to the total test score of the individuals choosing them. This scaling procedure will be applied to tests of EI in the present investigation with a view to improving reliabilities. It is envisaged that a consequence of making the option weights more reliable may be a change in the distribution of test scores, with greater variability in the top range.

Under MRA scaling, each category of an item (in this case, each possible rating on a five-point scale) is weighted by the total score of participants endorsing that category. An important set of assumptions underlying this procedure requires that all items measure a single ability to greater or lesser extent and that measurement of this ability becomes more reliable by partialling out variance due to other sources (Lawshe & Harris, 1958). It follows that the MRA technique can only be used on uni-dimensional tests (although it can certainly be used with multi-dimensional test batteries of uni-dimensional tests, as is the case with the MSCEIT). The uni-dimensional nature of Faces and Designs under various scoring procedures, was examined empirically before MRA was applied.

To date, MRA has mainly been applied to dichotomously-scored, multiple-choice aptitude tests, attitude scales, and personnel data. In general, increases in reliability have been observed. For example, applying MRA scaling, Mitzel and Hoyt (1954) found reliability to increase for an attitude scale (changing from 0.57 to 0.65), while Davis and Fifer (1959) report similar findings for a mathematics aptitude test (i.e. an increase in reliability from 0.68 to 0.76). However, several researchers note that while MRA increases reliability, corresponding increases in validity are often not found—in some cases the predictive validity of the measures even appears to decrease slightly under scaling (see Hendrickson, 1971, Reilly & Jackson, 1973). In the present investigation, another aim was to examine changes in reliability and validity accompanying MRA scaling of consensually scored EI. In the present context, changes in convergent validity can be indexed via the change in correlation between the Faces and Designs tests (since these purportedly measure the same construct, Emotion Perception).

Obviously, the same sample cannot be used both to determine the category weights and to assess validity and reliability, as weights would be tailored to the idiosyncrasies of the sample, capitalizing on chance to produce higher reliabilities (Anastasi & Urbina, 1997, Cureton, 1950). To avoid decreasing the sample size, and yet retain statistical independence, a double cross-validation design was used in the present study. To this end, each half of the sample functioned as a screening sample, from which weights were calculated and then applied to the other half of the sample (see Lord & Novick, 1968 for detailed description of these terms).

By examining five different consensus-scoring methods, and the additional scaling of such scores by MRA, it was hoped to refine the methodology for producing scales of socio-emotional skill (and other constructs currently consensus-scored) so that they are more reliable, more informative, and have greater validity. A subsidiary aim was to collect information about the MSCEIT scale, concerning its psychometric properties and correlates. Thus far, published studies of EI have reported on its predecessor, the MEIS, rather than this newer instrument. It is theorized that Faces and Designs will be related to Gc and possibly Gv, but not to Gf.

Section snippets

Participants

Psychology undergraduates (N=102) from the University of Sydney completed a 2 h paper-and-pencil test battery to fulfil course requirements. The sample was predominantly female (N=92), with an age range of 18–38 years (M=20.00, S.D.=3.61).

Emotion perception measures

Two tests of EI were taken from the Emotion Perception branch of the MSCEIT (Research Version 1.1). These sub-tests were consensually scored in five ways (see Table 1), using the current sample (N=102) to determine category weights.4

Results

Principal components analyses were conducted on Faces and Designs (as variously scored) to ascertain whether eigenvalue plots indicated uni-dimensionality (i.e. a test of the underlying assumptions of MRA). Eigenvalue plots of proportion and mode scores for Faces and Designs, and lenient mode-scored Faces, supported a view of uni-dimensionality. Plots for distance and adjusted distance scores for Faces and Designs, and lenient mode-scored Designs indicated two or more dimensions were likely.

Discussion

Popular interest notwithstanding (see e.g. Matthews et al., 2002, Mayer et al., 2000), available evidence indicates that current EI measures are not yet at a stage where their widespread use as psychological assessment devices is viable. In particular, they appear to lack acceptable degrees of reliability and are of limited utility in discriminating between individuals at high levels of skill (Zeidner, Matthews, & Roberts, 2001). The current investigation examined the way that scoring

Acknowledgements

Thank you to Peter Legree for his invaluable suggestions about the importance of different scoring mechanisms and to Patricia Mikolajski for editing an earlier version of this manuscript.

References (51)

  • N.S Schutte et al.

    Development and validation of a measure of Emotional Intelligence

    Personality and Individual Differences

    (1998)
  • A Anastasi et al.

    Psychological testing

    (1997)
  • R Bar-On

    Bar-On Emotional Quotient Inventory (EQ-i): technical manual

    (1997)
  • Bowman, D., Markham, P. M., & Roberts, R. D. (2002). Expanding the frontier of human cognitive abilities: so much more...
  • Carroll, J. B. (1993). Human cognitive abilities. New York: Cambridge University...
  • Christal, R. E. (1994). Non-cognitive research involving systems of testing and learning. Final Research and...
  • Cooper, R. K. (1996/1997). EQ map. San Francisco, CA: AIT and Essi...
  • E.E Cureton

    Validity, reliability, and baloney

    Educational and Psychological Measurement

    (1950)
  • M Davies et al.

    Emotional Intelligence (EI)in search of an elusive construct

    Journal of Personality and Social Psychology

    (1998)
  • F.B Davis et al.

    The effects on test reliability of scoring aptitude and achievement tests with weights for every choice

    Educational and Psychological Measurement

    (1959)
  • R.B Ekstrom et al.

    Manual for kit of factor-referenced cognitive Tests

    (1976)
  • H.J Eysenck et al.

    Adult EPQ-R

    (1991)
  • J.W French et al.

    Manual for kit of reference tests for cognitive factors

    (1963)
  • Garcia, A., Roberts, R. D., Rouse, J. R., MacCann, C. E., Matthews, G., & Zeidner, M. Performance-based emotional...
  • G.F Hendrickson

    The effect of differential option weighting on multiple-choice objective tests

    Journal of Educational Measurement

    (1971)
  • Cited by (90)

    • A meta-analytic review of emotional intelligence in gifted individuals: A multilevel analysis

      2021, Personality and Individual Differences
      Citation Excerpt :

      Studies have shown that correlations between ability EI and measures of cognitive ability were higher than for trait EI (e.g. R. Bar-On, 1997; J.D. Mayer, Roberts, & Barsade, 2008; D.L. Van Rooy & Viswesvaran, 2004). There is also evidence for a stronger link between ability EI and crystallized abilities rather than with fluid ability (e.g., E.J. Austin, 2010; Farrelly & Austin, 2007; MacCann et al., 2004; Roberts, Zeidner, & Matthews, 2001). Despite the recent changes in the definition of giftedness, including broader conceptualizations and different skills, intelligence tests that measure cognitive abilities are still the most common gifted identification tools in the field (McClain & Pfeiffer, 2012).

    • Optimizing the validity of situational judgment tests: The importance of scoring methods

      2018, Journal of Vocational Behavior
      Citation Excerpt :

      If the group mean indicated that the item was incorrect and the respondent also indicated that the item was incorrect, the respondent received a score of 1 for that item; otherwise, the respondent received a score of zero (see example in Table 1). We also scored the SJT using the mode and proportion consensus methods that have been commonly used in EI tests (Barchard et al., 2013; Barchard & Russell, 2006; MacCann et al., 2004, see examples in Table 1). As noted above, for the mode consensus method, the mode was judged as the correct response (i.e., this served as scoring key).

    View all citing articles on Scopus
    1

    Data was collected as part of MacCann's 4th year honors empirical project at the Individual Differences and Computerised Assessment Unit, School of Psychology, University of Sydney. Further research is now being funded by an Australian Post-graduate Award stipend.

    View full text