main-content

## Über dieses Buch

A theme running through this book is that of making inference about sources of variation or uncertainty, and the author shows how information about these sources can be used for improved estimation of certain elementary quantities. Amongst the topics covered are: essay rating, summarizing item-level properties, equating of tests, small-area estimation, and incomplete longitudinal studies. Throughout, examples are given using real data sets which exemplify these applications.

## Inhaltsverzeichnis

### Frontmatter

Abstract
The advent and proliferation of computing technology is continually making us rethink the meaning of the adjectives ‘large’, ‘extensive’, and ‘complex’, as in large data, extensive computing, and complex computational algorithms. Our appetite for information has been enhanced to such a degree that our systems for digesting it are often under strain, and they crave for the substance of information to be presented in a more condensed manner.
Nicholas T. Longford

### 2. Reliability of essay rating

Abstract
Standardized educational tests have until recently been associated almost exclusively with multiple-choice items. In such a test, the examinee is presented items comprising a reading passage stating the question or describing the problem to be solved, followed by a set of response options. One of the options is the correct answer; the others are incorrect. The examinee’s task is to identify the correct response. With such an item format, examinees can be given a large number of items in a relatively short time, say, 40 items in a half-hour test. Scoring the test, that is, recording the correctness of each response, can be done reliably by machines at a moderate cost. A serious criticism of this item format is the limited variety of items that can be administered and that certain aspects of skills and abilities cannot be tested by such items. Many items can be solved more effectively by eliminating all the incorrect response options than by deriving the correct response directly. Certainly, problems that can be formulated as multiple-choice items are much rarer in real life; in many respects, it would be preferable to use items with realistic problems that require the examinees to construct their responses.
Nicholas T. Longford

### 3. Adjusting subjectively rated scores

Abstract
In the context of educational testing, estimation of the variance components and of the reliability coefficients is at best of secondary importance to the pivotal task — assigning scores to the essays (performances, problem solutions, or the like) in a way that reflects their quality as faithfully as possible. In ideal circumstances, this would correspond to reconstructing the true score α i for each essay. A more realistic target is to get as close to α i as possible. This chapter discusses improvements on the trivial estimator of the true score, the mean score over the K sessions, $${y_{i,.}} = ({y_{i,{j_{i1}}}} + \ldots + {y_{i,{j_{iK}}}})/K,$$ by means of several adjustment schemes. The variance components, σ a 2 and σ b 2 and σ e 2 play an important role in these schemes. To motivate them, consider the following extreme cases: when everybody has the same true score, σ a 2 = 0, each examinee should be given the same score, irrespective of the grades given by the raters. Similarly, when the raters vary a great deal in their severities (large σ b 2 ), or the rating is very inconsistent (large σ a 2 ), an extreme score (say, 0 or 9 on the scale 0–9) is not a strong evidence of very poor or very high quality of the essay. On average, it may be prudent to ‘pull’ the extreme scores closer to the mean, so as to hedge our bets against the largest possible errors.
Nicholas T. Longford

### 4. Rating several essays

Abstract
Chapters 2 and 3 were concerned with the measurement of examinees’ performances on a single item. Even though an essay written by an examinee may be much more informative about his/her ability than a multiple-choice item, a single essay is unlikely to provide sufficient information, especially in the presence of imperfect grading by the raters. Also, the examinee’s performance may be unduly affected by the essay topic, and no single topic can represent the tested domain in a balanced way.
Nicholas T. Longford

### 5. Summarizing item-level properties

Abstract
A typical educational test consists of a large number of multiple-choice items. In the construction and pretesting of a test form, particular attention is paid to every item of the form. Each item is screened for a number of possible faults, such as non-existence or non-uniqueness of the correct answer among the response options, use of words or expressions that may have a different connotation or may be more (or less) familiar to an ethnic category of examinees, may require non-elementary knowledge from a different subject area, and the like.
Nicholas T. Longford

### 6. Equating and equivalence of tests

Abstract
A test form serves its purpose of assessing an individual’s skill, ability, or knowledge of a well-defined domain only if the test scores satisfy a number of conditions. First, the test scores have to be strongly associated (highly correlated) with the measured trait. This trait is an inherently unobservable quantity. How closely it is approximated by a test score (on average, in a population of examinees) is in most cases a matter of conjecture. At best, some faith in a test score can be generated by pursuing and documenting the usual elements of quality control: review of the composition of the test and scrutiny of every step in the test assembly.
Nicholas T. Longford

### 7. Inference from surveys with complex sampling design

Abstract
Monitoring the quality of education is an important concern for educational researchers, politicians, and economists, as well as parents and organizations representing them. Large-scale surveys are probably the most important source of inference about the educational system of a country. Not infrequently, conclusions drawn from educational surveys make the news headlines and are used to support and justify policies that affect the educational system.
Nicholas T. Longford

### 8. Small-area estimation

Abstract
Large-scale educational and literacy surveys, such as the National Assessment of Educational Progress (NAEP), the National Education Longitudinal Survey (NELS), and the National Adult Literacy Survey (NALS), often provide sufficient information about their target populations and certain well-represented subpopulations, but they cannot be used directly for inferences about small areas, such as states, counties, or other geographical units. National surveys do contain some information about small areas, but such information may be difficult to extract and, on its own, may be insufficient for inferences with a desired level of confidence.
Nicholas T. Longford

### 9. Cut scores for pass/fail decisions

Abstract
Many educational and licencing tests are used for classifying examinees into a small number of categories, such as ‘satisfactory’, ‘borderline’, and ‘unsatisfactory’. In such tests, each possible response pattern has to be assigned to one of the categories. This task is often simplified by defining a scoring rule for the responses, so that the examinees are ordered on a unidimensional scale. Then it suffices to set the cut scores that separate the categories. When several test forms are used, separate sets of cut scores have to be set for each of them because the forms are likely to have different score distributions.
Nicholas T. Longford

### 10. Incomplete longitudinal data

Abstract
In most statistical applications, we expect that a set of relevant variables is available for each subject. However, human subjects are notorious for imperfect cooperation with surveys, especially when they have little or no stake in the outcome of the data collection exercise. Pervasive examples of imperfect cooperation are failure to answer an item from the background questionnaire and, more generally, failure to adhere to the protocol of the survey. For instance, examinees may lose motivation half-way through the test and abandon the test without attending to a segment of items, or they may mark the responses to these items arbitrarily. More radical forms of incomplete cooperation are not turning up for the appointment and rejecting the approach of the data collector. Such instances are not uncommon because educational surveys typically demand a substantial commitment, in terms of time and mental effort, from the examinees. Our concern in this chapter is not with alleviating these problems, such as providing incentives to examinees and improving the presentation of the survey instruments, but rather with facing the problem of missing data as a fact of life and devising methods of analysis that make full use of data from all subjects, however incomplete their records may be.
Nicholas T. Longford

### Backmatter

Weitere Informationen