Volume 18 is Published

Volume 19 is Building


Stay notified as new articles appear:

RSS feed link Recent articles RSS feed
Email link eToC notification


How to get published

Related articles:

No articles to display

Related materials:

No materials to display

Descriptive Account

When Majority Doesn’t Rule: The Use of Discrimination Indices to Improve the Quality of MCQs

Neville Chiavaroli1 and Mary Familari2

1Medical Education Unit, Medical School and 2Department of Zoology, Faculty of Science, University of Melbourne

Date received: 16/02/11      Date accepted: 17/05/11

Abstract

This paper outlines the use of item analysis to assist examiners in evaluating the quality and validity of their MCQ exam questions. The generation of item analysis, particularly discrimination index, has long been established practice in professional testing and credentialing organisations and some disciplines in tertiary education, but its use appears to be inconsistent among Bioscience departments. We argue that generating some form of discrimination index is an essential part of the validation of exams, in particular to help identify errors in scoring, to identify potentially flawed questions, or to confirm the validity of challenging questions. We demonstrate each of these uses through examples drawn from first year Biology exams, with interpretations of the questions in the light of post-examination item analysis.

Keywords: MCQs, discrimination index, assessment quality, validity, examinations

Introduction

The use of MCQs to assess students’ knowledge is ubiquitous in science faculties, partially due to the need to assess broad domains of knowledge with large cohorts of students. MCQs are efficient to mark (but not necessarily to develop) and this efficiency allows a high degree of ‘sampling’ (i.e. numbers of questions), which can produce reliable exams. They are also amenable to statistical analysis which, when used appropriately, can provide important evidence about the validity of the exam content and the design of the questions.

Crisp and Palmer (2007) found that many academics are unfamiliar, or disinclined to engage with, item analysis in a systematic way. This may in part be due to the fact that most academics are not specialists in educational theory, and locally established or discipline-based practices are more likely to determine what level of post-examination review takes place. In other words, validation of exams and their results tend to be based around ‘academic acumen rather than quantitative evidence’ (Crisp and Palmer, 2007). As important as academic judgement may be in the setting and reviewing of exams, obtaining quantitative evidence about assessment would seem just as vital, especially when so much value is placed on the results of examinations.

A recent discussion of this issue with first-year Biology lecturers at three Australian universities revealed that only the proportion correct (or ‘facility’) statistic was generally used in reviewing the quality of questions. This statistic allows examiners to gauge how well the students responded to each question as a group. Sometimes, the proportion correct is also used to make decisions about the quality or ‘acceptability’ of the question. For example, a question on which only a small proportion of students responded correctly might be deemed to be unacceptably difficult. Other examiners additionally conduct a version of item ‘distractor analysis’, whereby the proportion of students selecting each option (the incorrect options are usually known as ‘distractors’ in educational terminology) is also considered in evaluating the quality of the question. This information might show, for instance, that a particular distractor was selected by very few students and therefore ought to be replaced with a more plausible one, as is usually recommended by MCQ writing guidelines (for discussion see Haladyna and Downing, 1989; Kehoe, 1995). This paper will show how much more value can be gained from including consideration of other relevant statistical information as a key indicator of the validity of individual MCQs as summative assessments. We draw on specific examples from the biosciences to illustrate and emphasise this point.

Item Analysis in Practice

In many departments it is common to review questions for which a large proportion of students (say, one third or more) have selected a particular incorrect option, on the basis that any distractor which draws such a large number of students is likely to represent a flawed or difficult question. This might subsequently lead to the question being omitted from final scoring and revised for future assessments. For example, the following MCQ was used in a first year Biology exam, and the relative proportions of students selecting each option are shown. The correct response is indicated with an asterisk.

Question I:

In plant propagation from tissue culture, which two plant hormones, in different concentrations, stimulate the roots and shoots?

   

Percentage of students selecting option

A

auxin and gibberellin

41%

  B*

auxin and cytokinin

54%

C

abscisic acid and ethylene

2%

D

abscisic acid and gibberellin

4%

Clearly, the vast majority of the students recognised auxin as a key plant hormone in this situation, but there was uncertainty about the other hormone. Some examiners view a response pattern such as this — with approximately 40% of students selecting a single incorrect option — as indicative of a flawed question, even though qualitative review of the question might confirm that gibberellin is not involved in the named process. They would therefore conclude that either the question as a whole was misleading (for example, the double use of gibberellin as a distractor may have ‘seduced’ many uncertain students), or the point at issue was not adequately taught. In either case, they might argue to remove the question rather than unfairly punish the unwitting student.

While such reasoning is laudable from the perspective of fairness and accountability in assessment, it may not be justifiable from a pedagogical perspective. The real issue is whether the question reveals a genuine misunderstanding in the group of students who chose A (and therefore gibberellin as the second hormone) and the group who chose B (cytokinin). Obviously it is just as important to reward the second group of students for making this correct distinction as it would be to avoid penalising the first group for a potentially flawed question. Academic acumen will go some way in determining the proper judgement here, but there is also quantitative evidence available which can provide important information about the students’ response patterns. Such information may reveal that the question is in fact an entirely valid and appropriate assessment of the students’ knowledge. The most common statistic used for this purpose is the ‘discrimination index’.

The Nature of the Discrimination Index

The discrimination index (DI) provides an indication of the ability of the group of students who selects each option, in terms of how they perform (as a group) on the examination overall. Accordingly, the DI allows examiners to evaluate their MCQs in terms of who selected each option, information which is arguably more valuable than simply knowing the proportion selecting each option. The DI statistic is most valuable when the content of the examination is relatively homogeneous, that is, when it forms a meaningful and relatively coherent whole, whether that be at the broad disciplinary level (such as ‘Biology’), or at a more specific topic level (like ‘Biomolecules and Cells’). In such circumstances, we can reasonably expect that performance on one question, at the group level, will be related to performance on other questions, and that when this is not the case, something is amiss. This is most commonly as a result of how students have interpreted the question, what the question is testing, or how it has been worded.

The simplest form of DI involves calculating the difference in proportion correct for a question between a high-performing group of students and a low-performing group (Haladyna, 2004). Commonly referred to as the ‘U-L index’, for ‘upper minus lower’, it is typically based on the quartiles of scorers at either end (Burton, 2001). A more statistically complex method is to calculate the correlation between scores on a given question and the total score for the whole exam. Various correlation coefficients can be used for this but the most common for MCQ items is the so-called ‘point-biserial correlation’ (Johnson et al., 2009). Both approaches to calculating the DI produce a figure which lies between -1.0 and +1.0, and are interpreted in a similar way. The different item analysis software programmes will usually specify which method is used and both methods are used in this article, as indicated.

Guidelines for interpretation of Discrimination Indices

While guidelines for interpreting the DI depend to a certain extent on the actual statistic used, the following two rules of thumb are most commonly applicable:

  1. The DI for the key should be positive, at least of the order of 0.2 or above; and
  2. The DIs for the distractors should be well below that of the key, and preferably negative (though a positive value close to zero might also be acceptable, depending on the nature of the question and the domain being tested).

While different thresholds have been suggested (for example, 0.3 or 0.4 have also been suggested by Abdel-Hameed et al., 2005, and McAlpine, 2002, respectively), much depends on other psychometric attributes of the exam, such as the number of questions, score distribution, general level of difficulty, and overall homogeneity. Even the nature of the question itself will influence the DI; questions addressing contentious or debatable content areas would generally produce lower indices, due to the less clear-cut nature of the responses. For this reason, a lower threshold is normally justified, framed by the considerations given in point 2 above. More importantly, independent evaluation of the question and academic judgement is still required to confirm the correctness of the key and incorrectness of the distractors.

The Usefulness of Discrimination Indices

There are three main uses of the DI for evaluating the quality of MCQs:

  1. Identifying ‘miskeys’
  2. Identifying potentially flawed questions
  3. Confirming correct answers

Identifying ‘miskeys’

Most examiners will have fielded queries from students regarding the correctness of some of their MCQs following an exam. When several complaints converge upon a particular question, it is not uncommon to find that the question was scored with the incorrect ‘key’ (or designated correct answer). A fundamental benefit of item analysis therefore is the efficient identification of incorrectly keyed responses (‘miskeys’), in what should otherwise have been a relatively straightforward question. For example, in the same first-year exam referred to previously (Question I), the following question was scored correct for C. The item analysis is shown below, with the key as supplied during automatic scanning indicated by the asterisk. The point-biserials reported in this article were generated using the Quest software programme (Adams & Khoo, 1996).

Question II:

Parthogenesis refers to:

  1. the ability to self-fertilise
  2. species which show no external sexual dimorphism
  3. individuals that can change sex at a given size or age
  4. the development of eggs without being fertilized by sperm

 

 

Facility

DI (point-biserial)

A

16%

-0.18

B

3%

-0.17

  C*

10%

-0.08

D

71%

0.26

 

Whilst the large percentage of students selecting D might have prompted suspicions, the DI provides very strong evidence of an erroneously marked key. Here, the DI indicates that the group of students who chose D were, in general, higher performers on this exam than the remaining 29% of students. This was especially in comparison with those who chose the nominally ‘correct’ answer C. Accordingly, the question was reviewed, option D confirmed to be the correct key, and scores were re-calculated.

Identifying potentially flawed questions

A second major benefit of using item analysis to review MCQs is to help identify questions which are genuinely flawed, either conceptually, or in their drafting (e.g. ambiguous or unclear expression). The following example is also from a first-year Biology exam.

Question III:

Which of the following is a feature of molluscs?

  1. an open circulatory system and complete gut
  2. a tripartite body plan and exoskeleton
  3. an open circulatory system and gills
  4. marine only and possessing a mantle that often secretes a shell
 

Facility

DI (point-biserial)

  A*

43%

0.07

B

24%

-0.08

C

18%

0.06

D

15%

-0.07

To judge from the percentages alone, this might appear to be simply a case of a challenging question, with less than half of the students selecting the answer designated as correct. However, the values of the DIs tell a different story. The DI for the key (A), while positive, is low, and very similar to that of distractor C, while the DIs for the remaining two distractors, although both negative, are not very much lower. Therefore, the DI tells us that the two groups of students who selected A and C are essentially indistinguishable in terms of performance on this exam, even though the proportion choosing A was considerably larger than the group who chose C, and that the students who selected B or D performed only marginally worse. This information raises doubts about the correctness of A and/or the incorrectness of C, and therefore the integrity of the whole question. Those familiar with the anatomy of molluscs will have realised that this question has been poorly drafted. Both A and C are correct, while both B and D also have correct elements. What the DI statistics reveal is a group of students collectively struggling to make sense of which option is the best. Thus, there is no single, strongly positive DI for the key, and relatively little difference (performance-wise) in the spread of students across the options. This question should not have made it onto the exam, and following review in the light of the above item analysis, was subsequently removed from scoring.

Confirming correct answers

A third major advantage of item analysis is the availability of evidence to support the examiner’s interpretation of the correct response, especially when attempting to write intentionally challenging questions, as in the following question. The DI in this case, a U-L index, was generated using the Exam System II software programme (NCS Pearson, 2002).

Question IV:

ATP and ADP are interconverted by which of the following enzymes?

  1. Adenyl cyclase
  2. ATP synthase
  3. Adenylate kinase
  4. ATPase
 

Facility

DI (U – L index)

A

  7%

  0.0

B

40%

- 0.1

  C*

20%

  0.2

D

33%

- 0.1

Following the review of the exam, this question was deemed suspect on the basis of the proportions of students selecting each option, with options B and D drawing significantly more students than the purported correct answer C. Yet from a content perspective — the crucial term being ‘interconverted’ - the correct answer itself was apparently uncontentious. In this case, the value of the DI provided important supporting evidence that those students who selected C did not do so fortuitously, say by guessing or some other strategy, but rather understood the underlying scientific concepts. Conversely, the negative DIs for options B and D suggested that students selecting these distractors did not. The data are consistent with the conclusion that the students selecting the distractors did not have the required knowledge and instead were perhaps drawn by the superficial plausibility of the term ‘ATP’. This confirmed the content-based argument that this question was a highly discriminating one, and therefore a valid test of this area of bioscience.

The above data also illustrates how, every now and then, the majority of students can and do get things wrong. In other words, a high proportion of students selecting a particular option is, of itself, meaningless in terms of evidence of the validity of the question. To remove such a question from an exam post-administration solely on the grounds of percentages would be a major disservice to that minority of students who correctly understand the requisite science, and would result in the loss of valuable information about the development of students’ understanding within a discipline.

We are now in a better position to re-evaluate Question I which we presented at the beginning of this paper, and which exemplified a common dilemma often facing examiners. This time, the relevant DIs have been added, again using the Quest program of Adams and Khoo (1996) to generate the point-biserials.

Question I revisited:

In plant propagation from tissue culture, which two plant hormones, in different concentrations, stimulate the roots and shoots?

   

Percentage of students selecting option

Discrimination
Index

A

auxin and gibberellin

41%

-0.29

  B*

auxin and cytokinin

54%

0.39

C

abscisic acid and ethylene

2%

-0.17

D

abscisic acid and gibberellin

4%

-0.14

Now we can see that the DI strongly supports the purported key (B), and this concurs with textbook knowledge on the subject. Therefore, we conclude that this question legitimately and validly assesses this point, and the students who failed to make the correct distinction between gibberellin and cytokinin (ie the 41% selecting option A), were more than likely ignorant in this area. In other words, the question is sound and appropriately discriminating.

The Limitations of the Discrimination Index

In the above examples, consideration of the DIs of each options led to reviews which resulted in different conclusions; one question was preserved with a corrected key, another was deleted from scoring, while the others were left entirely unchanged. This reinforces the point that the DI statistic ought not to be used in an absolute way. Rather, its greatest value is in identifying questions which are discrepant in some way with respect to other questions on the exam. The source of the discrepancy needs to be interpreted by the examiner(s), since those sources are potentially many and varied. In some cases, the source is a poorly drafted question which has no or more than one correct answer. In other cases, discrepant response patterns may arise when a question is too difficult for the cohort or the topic has been inadequately taught, and as a consequence students guess or use other strategies to select their response. One of these alternative strategies might be to use the phrasing of the question to help a student ‘pick’ the correct response when they might otherwise not have been able to, or to miss a question they ought to have answered correctly.

Whatever the underlying cause, the DI helps draw attention to questions which are either flawed or particularly susceptible to guessing, and therefore in need of post-examination review. While qualitative review of questions should be an integral part of any examination process, the DI serves to identify questions which require closer analysis. Ultimately, the value of the DI lies in the examiner’s capacity to understand what the statistic is indicating, in the specific teaching and assessment context for which the question was written. Nevertheless, several writers have rightly counselled against the overuse of the DI as an indicator of question quality (Burton, 2001; Pyrczak, 1973). Arguments also exist over the most appropriate statistic to use as a DI (see Attali and Fraenkel, 2000). Such considerations are beyond the scope of this paper, except to say that a proper understanding of the nature of the statistic which one uses to generate qualitative data is obviously a key aspect of its interpretation.

Conclusions

Assessment is arguably the most important factor for student learning and is undoubtedly a powerful motivator (Gibbs and Simpson, 2004). Students are driven by the prospect of having the extent of their knowledge and understanding evaluated by their teachers, and more positively, by having the results of such assessment fed back to them in a timely, educative way. But while the majority may rule in democratic elections, this does not apply to student assessment. Sometimes the majority of students can be wrong, and arguably, for the sake of discriminating between different levels of understanding, should be wrong. The challenge for educators is to identify those questions which appropriately discriminate between the (relatively few) higher achievers and the majority still struggling with the concepts. Item analysis, in particular the discrimination index and associated distractor analysis, can help support us in reviewing our exam questions, and to make more valid decisions about them.

Communicating Author

Neville Chiavaroli, M.Ed, Medical Education Unit, Melbourne Medical School Level 7, North Wing Medical Building, The University of Melbourne, Victoria 3010 Australia
Tel: +61 3 8344 8238; Fax:+61 3 8344 0188
n.chiavaroli@unimelb.edu.au

References

Abdel-Hameed, A.A., Al-Faris, E.A., Alorainy, I.A., and Al-Rukban, M.O. (2005) The criteria and analysis of good multiple choice questions in a health professional setting. Saudi Medical Journal, 26 (10), 1505–10

Adams, R. J. and Khoo, S. (1996) Quest: The Interactive Test Analysis System (computer programme, version 2.1), Melbourne: ACER Press

Attali, Y. and Fraenkel, T. (2000) The Point-Biserial as a Discrimination Index for Distractors in Multiple-Choice Items: Deficiencies in Usage and an Alternative. Journal of Educational Measurement, 37 (1), 77–86
http://dx.doi.org/10.1111/j.1745-3984.2000.tb01077.x

Burton, R. (2001) Do Item-discrimination indices really help us to improve our tests? Assessment and Evaluation in Higher Education, 26 (3), 213–20
http://dx.doi.org/10.1080/02602930120052378

Crisp, G.T. and Palmer, E.J. (2007) Engaging Academics with a Simplified Analysis of their Multiple-Choice Question (MCQ) Assessment Results. Journal of University Teaching and Learning Practice, 4 (2), 88–106

Gibbs, G. and Simpson, C. (2004) Conditions under which assessment supports students’ learning. Learning and Teaching in Higher Education, 1 (2004–2005), 3–31

Haladyna, T.M. (2004) Developing and validating multiple-choice test items. 3rd edn. New Jersey: Lawrence Erlbaum.

Haladyna, T.M. and Downing, S.M. (1989) A Taxonomy of Multiple-Choice Item-Writing Rules. Applied Measurement in Education, 2 (10), 37–50
http://dx.doi.org/10.1207/s15324818ame0201_3

Johnson, R.L., Penny, J.A. and Gordon, B. (2009) Assessing Performance: designing, scoring and validating performance tasks. New York: Guilford.

Kehoe, J. (1995) Writing Multiple-Choice Test Items. Practical Assessment, Research & Evaluation, 4(9), http://pareonline.net/getvn.asp?v=4&n=9 (accessed 5 February, 2011)

McAlpine, M. (2002) A Summary of Methods of Item Analysis, CAA Centre,
http://caacentre.lboro.ac.uk/dldocs/bp2final.pdf (accessed 6 February 2011)

NCS Pearson (2002) Exam System II (computer programme, version 2.1)

Pyrczak, P. (1973) Validity of the Discrimination Index as a Measure of Item Quality. Journal of Educational Measurement, 10 (3), 227–31
http://dx.doi.org/10.1111/j.1745-3984.1973.tb00801.x

Email Icon Email to a friend   Citation Icon Cite this article   DOI Icon DOI:10.3108/beej.17.8

 |  Search Journal | UK Centre for Bioscience | Top