Explanatory Item Response Theory Models: Impact on Validity and Test Development?

Many explanatory item response theory (IRT) models have been developed since Fischer’s (Acta Psychologica 37:359–374, 1973) linear logistic test model was published. However, despite their applicability to typical test data, actual impact on test development and validation has been limited. The purpose of this chapter is to explicate the importance of explanatory IRT models in the context of a framework that interrelates the five aspects of validity (Embretson in Educ Meas Issues Pract 35, 6–22, 2016). In this framework, the response processes aspect of validity impacts other aspects. Studies on a fluid intelligence test are presented to illustrate the relevancy of explanatory IRT models to validity, as well as to test development.

Susan Embretson

A Taxonomy of Item Response Models in Psychometrika

The main aim of this study is to report on the frequency of which different item response theory models are employed in Psychometrika articles. Articles relevant to item response theory modeling in Psychometrika for 82 years (1936–2017) are sorted based on the classification framework by Thissen and Steinberg (Item response theory: Parameter estimation techniques. Dekker, New York, 1986). A sorting of the item response theory models used by authors of 367 research and review articles in Volumes 1–82 of Psychometrika indicates that the usual unidimensional parametric item response theory models for dichotomous items were employed in 51% of the articles. The usual unidimensional parametric item response theory models for polytomous items were employed in 21% of the articles. The multidimensional item response theory models were employed in 11% of the articles. Familiarity with each of more complicated item response theory models may gradually increase the percentage of accessible articles. Another classification based on recent articles is proposed and discussed. Guiding principles for the taxonomy are also discussed.

Seock-Ho Kim, Minho Kwak, Meina Bian, Zachary Feldberg, Travis Henry, Juyeon Lee, Ibrahim Burak Olmez, Yawei Shen, Yanyan Tan, Victoria Tanaka, Jue Wang, Jiajun Xu, Allan S. Cohen

NUTS for Mixture IRT Models

The No-U-Turn Sampler (NUTS) is a relatively new Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior that common MCMC algorithms such as Gibbs sampling or Metropolis Hastings usually exhibit. Given the fact that NUTS can efficiently explore the entire space of the target distribution, the sampler converges to high-dimensional target distributions more quickly than other MCMC algorithms and is hence less computational expensive. The focus of this study is on applying NUTS to one of the complex IRT models, specifically the two-parameter mixture IRT (Mix2PL) model, and further to examine its performance in estimating model parameters when sample size, test length, and number of latent classes are manipulated. The results indicate that overall, NUTS performs well in recovering model parameters. However, the recovery of the class membership of individual persons is not satisfactory for the three-class conditions. Findings from this investigation provide empirical evidence on the performance of NUTS in fitting Mix2PL models and suggest that researchers and practitioners in educational and psychological measurement should benefit from using NUTS in estimating parameters of complex IRT models.

Rehab Al Hakmani, Yanyan Sheng

Controlling Acquiescence Bias with Multidimensional IRT Modeling

Acquiescence is a commonly observed response style that may distort respondent scores. One approach to control for acquiescence involves creating a balanced scale and computing sum scores. Other model-based approaches may explicitly include an acquiescence factor as part of a factor analysis or multidimensional item response model. Under certain assumptions, both approaches may result in acquiescence-controlled scores for each respondent. However, the validity of the resulting scores is one issue that is sometimes ignored. In this paper, we present an application of these approaches under both balanced and unbalanced scales, and we report changes in criterion validity and respondent scores.

Ricardo Primi, Nelson Hauck-Filho, Felipe Valentini, Daniel Santos, Carl F. Falk

IRT Scales for Self-reported Test-Taking Motivation of Swedish Students in International Surveys

This study aims at modeling the self-reported test-taking motivation items in PISA and TIMSS Advanced studies for Swedish students using an IRT approach. In the last two cycles of the assessments, six test-specific items were included in the Swedish student questionnaires to evaluate pupil’s effort, motivation and how they perceived the importance of the tests. Using a Multiple-Group Generalized Partial Credit model (MG-GPCM), we created an IRT motivation scale for each assessment. We also investigated measurement invariance for the two cycles of PISA (i.e., 2012 and 2015) and of TIMSS Advanced (i.e., 2008 and 2015). Results indicated that the proposed scales refer to unidimensional constructs and measure reliably students’ motivation (Cronbach’s alpha above 0.78). Differential item functioning across assessment cycles was restricted to two criteria (RMSD and DSF) and had more impact on the latent motivation scale for PISA than for TIMSS Advanced. Overall, the test-taking motivation items fit well the purpose of a diagnostic of test-taking motivation in these two surveys and the proposed scales highlighted the slight increase of pupils’ motivation across the assessment cycles.

Denise Reis Costa, Hanna Eklöf

A Modification of the IRT-Based Standard Setting Method

We present a modification of the IRT-based standard setting method proposed by García, Abad, Olea & Aguado (Psicothema 25(2):238–244, 2013), which we have combined with the cloud delphi method (Yang, Zeng, & Zhang in IJUFKBS 20(1):77–97, 2012). García et al. (Psicothema 25(2):238–244, 2013) calculate the average characteristic curve of each level, to determine cutoff scores on the basis of the joint characteristic curve. In the proposed new method, the influence of each item on the average item characteristic curve is weighted according to its proximity to the next level. Performance levels are placed on a continuous scale, with each judge asked to determine an interval for each item. The cloud delphi method is used until a stable final interval is achieved. From these judgments, the weights of each item in the scale are calculated. Then, a family of weighted average characteristic curves is calculated and in the next step, joint weighted averaged ICC are calculated. The cutoff score is determined by finding the ability where the joint weighted averaged ICC reach a certain predefined probability level. This paper compares the performance of this new procedure for a math test with the classic Bookmarking method. We will show that this modification to the method improves cutoff score estimation.

Pilar Rodríguez, Mario Luzardo

Model Selection for Monotonic Polynomial Item Response Models

One flexible approach for item response modeling involves use of a monotonic polynomial in place of the linear predictor for commonly used parametric item response models. Since polynomial order may vary across items, model selection can be difficult. For polynomial orders greater than one, the number of possible order combinations increases exponentially with test length. I reframe this issue as a combinatorial optimization problem and apply an algorithm known as simulated annealing to aid in finding a suitable model. Simulated annealing resembles Metropolis-Hastings: A random perturbation of polynomial order for some item is generated and acceptance depends on the change in model fit and the current algorithm state. Simulations suggest that this approach is often a feasible way to select a better fitting model.

Carl F. Falk

TestGardener: A Program for Optimal Scoring and Graphical Analysis

The aim of this paper is to demonstrate how to use TestGardener to analyze testing data with various item types and explain some main displays. TestGardener is a software designed to aid the development, evaluation, and use of multiple choice examinations, psychological scales, questionnaires, and similar types of data. This software implements the optimal scoring of binary and multi-option items, and uses spline smoothing to obtain item characteristics curves (ICCs) that better fit the real data. Using TestGardner does not require any programming skill or formal statistical knowledge, which will make optimal scoring and item response theory more approachable for test analysts, test developers, researchers, and general public.

Juan Li, James O. Ramsay, Marie Wiberg

Item Selection Algorithms in Computerized Adaptive Test Comparison Using Items Modeled with Nonparametric Isotonic Model

A computerized adaptive test (CAT) is used in this paper where the item bank is calibrated by using the nonparametric isotonic model proposed by Luzardo and Rodríguez (Quantitative psychology research. Springer International Publishing, Switzerland, pp. 99-108, 2015). The model is based on the estimation of the inverse of the item characteristic curves (ICC), and it uses a two-stage process. First, it uses the Ramsay nonparametric estimator of the ICC (Ramsay In Psychometrika 56:611–630, 1991) and then it estimates the density function of the inverse ICC by using Ramsay’s estimator. By integrating the density function and then symmetrizing it, we obtain the result. Xu and Douglas (Psychometrika 71:121–137, 2006) studied the possibility of using Ramsay’s nonparametric model in a CAT. They explored the possible methods of item selection but they did not use Fisher’s maximum information method because the derivatives of the ICC may not be estimated well. We present, for the isotonic model, a suitable way to estimate the derivatives of the ICCs and obtain a formula for item information that allows us to use the maximum information criterion. This work focuses on comparing three methods for selecting items in the CAT: random selection, the maximum Fisher information criterion with the isotonic model, and the Kullback-Leibler information criterion.

Mario Luzardo

Utilizing Response Time in On-the-Fly Multistage Adaptive Testing

On-the-fly multistage adaptive testing (OMST), which integrates computerized adaptive testing (CAT) and multistage testing (MST), has recently gained popularity. While CAT selects each item on-the-fly and MST bundles items to pre-assembled modules, OMST assembles modules on-the-fly after the first stage. Since item selection algorithms play a crucial role in latent trait estimation and test security in CAT designs, given the availability of response time (RT) in the current testing era, researchers have been actively striving to incorporate RT into item selection algorithms. However, most such algorithms were only applied to CAT whereas little is known about RT’s role in the domain of OMST. Building upon previous research on RT-oriented item selection procedures, this research intends to apply RT-oriented item selection algorithms to OMST. This study found that the relative performance of RT-oriented item selection methods in OMST was consistent with CAT. But the underlying item bank structure and test design features can make a huge difference with respect to estimation accuracy and test security.

Yang Du, Anqi Li, Hua-Hua Chang

Heuristic Assembly of a Classification Multistage Test with Testlets

In addition to the advantages of shortening test and balancing item bank usage, multistage testing (MST) has its unique merit of incorporating testlets. Testlet refers to a group of items sharing the same piece of stimulus. As MST can include an entire testlet in one module, fewer stimuli are required than items. On the other hand, computerized adaptive testing (CAT) selects item one by one, thus excludes the possibility of several items sharing the same stimulus. In this way, testlets in MST save the stimuli processing time and facilitate ability estimate. In order to utilize the advantages brings by testlet, a classification MST was designed to upgrade an operational listening test. A heuristic module top-down assembly procedure incorporating testlet was developed based on the modified normalized weighted absolute deviation heuristic (NWADH). A three-stage classification MST with 1-3-5 panel design was assembled to classify examinees into six levels. A real data-based simulation study was conducted to compare the performance of the classification MST and the operational linear test in terms of ability recovery and classification accuracy. The bi-factor model was used in item parameter calibration and examinee scoring. Results show the 30-item MST had a similar performance as the 44-item linear test with prior knowledge of examinee ability and outperformed the 44-item linear test without prior information, in both ability recovery and classification accuracy. In conclusion, the classification MST can shorten the test while keeping a good accuracy.

Zhuoran Wang, Ying Li, Werner Wothke

Statistical Considerations for Subscore Reporting in Multistage Testing

This study examines factors that influence the reliability of subscores and the accuracy of subscore estimates in multistage testing (MST). The factors considered in the study include the number of subtests, subtest length, correlations among subscores, item pool characteristics such as item pool size relative to the number of items required for an MST and statistical properties of items. Results indicated that the factors that most influenced subscore reliability and subscore estimates were subtest length, item pool size, and the degree of item discrimination.

Yanming Jiang

Investigation of the Item Selection Methods in Variable-Length CD-CAT

Cognitive diagnostic computerized adaptive testing (CD-CAT) provides useful cognitive diagnostic information for assessment and evaluation. At present, there are only a limited numbers of previous studies investigating how to optimally assemble cognitive diagnostic tests. The cognitive discrimination index (CDI) and attribute-level discrimination index (ADI) are commonly used to select items for cognitive diagnostic tests. The CDI measures an item’s overall discrimination power, and the ADI measures an item’s discrimination power for a specific attribute. Su (Quantitative psychology research. Springer, Switzerland, pp. 41–53, 2018) integrated the constraint-weighted procedure with the posterior-weighted CDI and ADI for item selection in fixed-length CD-CAT, and found examinees yielded different precision. In reality, if the same precision of test results is required for all the examinees, some examinees need to take more items and some need to take fewer items than others do. To achieve the same precision for examinees, this study investigated the performance of the constraint-weighted procedure with the posterior-weighted CDI and ADI for item selection in variable-length CD-CAT through simulations.

Ya-Hui Su

A Copula Model for Residual Dependency in DINA Model

Cognitive diagnosis models (CDMs) have been received the increasing attention by educational and psychological assessment. In practice, most CDMs are not robust to violations of local item independence. Many approaches have been proposed to deal with the local item dependence (LID), such as conditioning on other responses and additional random effects (Hansen In Hierarchical item response models for cognitive diagnosis. University of California, LA, 2013); however, these have some drawbacks, such as non-reproducibility of marginal probabilities and interpretation problem. (Braeken et al. In Psychometrika 72(3): 393–411 2007) introduced a new class of marginal models that makes use of copula functions to capture the residual dependence in item response models. In this paper, we applied the copula methodology to model the item dependencies in DINA model. It is shown that the proposed copula model could overcome some of the dependency problems in CDMs, and the estimated model parameters recovered well through simulations. Furthermore, we have extended the R package CDM to fit the proposed copula DINA model.

Zhihui Fu, Ya-Hui Su, Jian Tao

A Cross-Disciplinary Look at Non-cognitive Assessments

The past two decades have seen an increasing interest in studying non-cognitive skills across disciplines. Despite the shared popularity, non-cognitive skills have been assessed variously across disciplines with different assumptions and target populations. Synthesizing across the commonalities, differences, and limitations in these various approaches will have important implications for the development and interpretation of non-cognitive assessments. In this project, we review the ways in which non-cognitive skills have been conceptualized and measured across psychology and education, and use self-control as an example to address the challenges to various types of assessments that are commonly seen in these disciplines. We will draw implications from a cross-disciplinary perspective on the validity and reliability of the non-cognitive assessments.

Vanessa R. Simmreing, Lu Ou, Maria Bolsinova

An Attribute-Specific Item Discrimination Index in Cognitive Diagnosis

There lacks an item quality index as a measure of item’s correct classification rates of attributes. The purpose of this study is to propose an attribute-specific item discrimination index as a measure of correct classification rate of attributes based on a q-vector, item parameters, and the distribution of attribute patterns. First, an attribute-specific item discrimination index was introduced. Second, a heuristic method was presented using the new index for test construction. The first simulation results showed that the new index performed well in that their values matched closely with the simulated correct classification rates of attributes across different conditions. The second simulation study results showed that the heuristic method based on the sum of the attributes’ indices yielded comparable performance to the famous CDI. The new index provides test developers with a useful tool to evaluate the quality of diagnostic items. It will be valuable to explore the applications and advantages of using the new index for developing an item selection algorithm or a termination rule in cognitive diagnostic computerized adaptive testing.

Lihong Song, Wenyi Wang

Assessing the Dimensionality of the Latent Attribute Space in Cognitive Diagnosis Through Testing for Conditional Independence

Cognitive diagnosis seeks to assess an examinee’s mastery of a set of cognitive skills called (latent) attributes. The entire set of attributes characterizing a particular ability domain is often referred to as the latent attribute space. The correct specification of the latent attribute space is essential in cognitive diagnosis because misspecifications of the latent attribute space result in inaccurate parameter estimates, and ultimately, in the incorrect assessment of examinees’ ability. Misspecifications of the latent attribute space typically lead to violations of conditional independence. In this article, the Mantel-Haenszel statistic (Lim & Drasgow in J Classif, 2019) is implemented to detect possible misspecifications of the latent attribute space by checking for conditional independence of the items of a test with parametric cognitive diagnosis models. The performance of the Mantel-Haenszel statistic is evaluated in simulation studies based on its Type-I-error rate and power.

Youn Seon Lim, Fritz Drasgow

Comparison of Three Unidimensional Approaches to Represent a Two-Dimensional Latent Ability Space

All test data represent the interaction of examinee abilities with individual test items. It has been argued that for most tests these interactions result in, either unintentionally or intentionally, multidimensional response data. Despite this realization, many standardized tests report a single score which follows from fitting a unidimensional model to the response data. This process is justified with the understanding that the response data, when analyzed, say for example by a principal component analysis, have a strong, valid, and content identifiable first component and weaker minor inconsequential components. It is believed that the resulting observed score scale represents primarily a valid composite of abilities that are intended to be measured. This study examines three approaches which estimate unidimensional item and ability parameters based on the parameters obtained from a two-dimensional calibration of the response data. The goal of this study is to compare the results of the different approaches to see which best captures the results of the two-dimensional calibration.

Terry Ackerman, Ye Ma, Edward Ip

Comparison of Hyperpriors for Modeling the Intertrait Correlation in a Multidimensional IRT Model

Markov chain Monte Carlo (MCMC) algorithms have made the estimation of multidimensional item response theory (MIRT) models possible under a fully Bayesian framework. An important goal in fitting a MIRT model is to accurately estimate the interrelationship among multiple latent traits. In Bayesian hierarchical modeling, this is realized through modeling the covariance matrix, which is typically done via the use of an inverse Wishart prior distribution due to its conjugacy property. Studies in the Bayesian literature have pointed out limitations of such specifications. The purpose of this study is to compare the inverse Wishart prior with other alternatives such as the scaled inverse Wishart, the hierarchical half-t, and the LKJ priors on parameter estimation and model adequacy of one form of the MIRT model through Monte Carlo simulations. Results suggest that the inverse Wishart prior performs worse than the other priors on parameter recovery and model-data adequacy across most of the simulation conditions when variance for person parameters is small. Findings from this study provide a set of guidelines on using these priors in estimating the Bayesian MIRT models.

Meng-I Chang, Yanyan Sheng

On Extended Guttman Condition in High Dimensional Factor Analysis

It is well-known that factor analysis and principal component analysis often yield similar estimated loading matrices. Guttman (Psychometrika 21:273–285, 1956) identified a condition under which the two matrices are close to each other at the population level. We discuss the matrix version of the Guttman condition for closeness between the two methods. It can be considered as an extension of the original Guttman condition in the sense that the matrix version involves not only the diagonal elements but also the off-diagonal elements of the inverse matrices of variance-covariances and unique variances. We also discuss some implications of the extended Guttman condition, including how to obtain approximate estimates of the inverse of covariance matrix under high dimensions.

Kentaro Hayashi, Ke-Hai Yuan, Ge (Gabriella) Jiang

Equivalence Testing for Factor Invariance Assessment with Categorical Indicators

Factorial invariance assessment is central in the development of educational and psychological instruments. Establishing factor structure invariance is key for building a strong validity argument, and establishing the fairness of score use. Fit indices and guidelines for judging a lack of invariance is an ever-developing line of research. An equivalence testing approach to invariance assessment, based on the RMSEA has been introduced. Simulation work demonstrated that this technique is effective for identifying loading and intercept noninvariance under a variety of conditions, when indicator variables are continuous and normally distributed. However, in many applications indicators are categorical (e.g., ordinal items). Equivalence testing based on the RMSEA must be adjusted to account for the presence of ordinal data to ensure accuracy of the procedures. The purpose of this simulation study is to investigate the performance of three alternatives for making such adjustments, based on work by Yuan and Bentler (Sociological Methodology, 30(1):165–200, 2000) and Maydeu-Olivares and Joe (Psychometrika 71(4):713–732, 2006). Equivalence testing procedures based on RMSEA using this adjustment is investigated, and compared with the Chi-square difference test. Manipulated factors include sample size, magnitude of noninvariance, proportion of noninvariant indicators, model parameter (loading or intercept), and number of indicators, and the outcomes of interest were Type I error and power rates. Results demonstrated that the $$ T_{3} $$ statistic (Asparouhov & Muthén, 2010) in conjunction with diagonally weighted least squares estimation yielded the most accurate invariance testing outcome.

W. Holmes Finch, Brian F. French

Canonical Correlation Analysis with Missing Values: A Structural Equation Modeling Approach

Canonical correlation analysis (CCA) is a generalization of multiple correlation that examines the relationship between two sets of variables. When there are missing values, spectral decomposition in CCA becomes complicated and difficult to implement. This article investigates structural equation modeling approach to Canonical correlation analysis when data have missing values.

Zhenqiu (Laura) Lu

Small-Variance Priors Can Prevent Detecting Important Misspecifications in Bayesian Confirmatory Factor Analysis

We simulated Bayesian CFA models to investigate the power of PPP to detect model misspecification by manipulating sample size, strongly and weakly informative priors for nontarget parameters, degree of misspecification, and whether data were generated and analyzed as normal or ordinal. Rejection rates indicate that PPP lacks power to reject an inappropriate model unless priors are unrealistically restrictive (essentially equivalent to fixing nontarget parameters to zero) and both sample size and misspecification are quite large. We suggest researchers evaluate global fit without priors for nontarget parameters, then search for neglected parameters if PPP indicates poor fit.

Terrence D. Jorgensen, Mauricio Garnier-Villarreal, Sunthud Pornprasermanit, Jaehoon Lee

Measuring the Heterogeneity of Treatment Effects with Multilevel Observational Data

Multilevel latent class analysis and mixture propensity score models have been implemented to account for heterogeneous selection mechanisms and for proper causal inference with observational multilevel data (Kim & Steiner in Quantitative Psychology Research. Springer, Cham, pp. 293–306, 2015). The scenarios imply the existence of multiple selection classes, and if class membership is unknown, homogeneous classes can be usually identified via multilevel logistic latent class models. Although latent class random-effects logistic models are frequently used, linear models and fixed-effects models can be alternatives for identifying multiple selection classes and estimating class-specific treatment effects (Kim & Suk in Specifying Multilevel Mixture Models in Propensity Score Analysis. International Meeting of Psychometric Society, New York, 2018). Using the Korea TIMSS 2015 eighth-grade student data, this study examined the potentially heterogeneous treatment effects of private science lessons by inspecting multiple selection classes (e.g., different motivations to receive the lessons) using four types of selection models: random-effects logistic, random-effects linear, fixed-effects logistic, and fixed-effects linear models. Implications of identifying selection classes in casual inference with multilevel assessment data are discussed.

Youmi Suk, Jee-Seon Kim

Specifying Multilevel Mixture Selection Models in Propensity Score Analysis

Causal inference with observational data is challenging, as the assignment to treatment is often not random and people may have different reasons to receive or to be assigned to the treatment. Moreover, the analyst may not have access to all of the important variables and may face omitted variable bias as well as selection bias in nonexperimental studies. It is known that fixed effects models are robust against unobserved cluster variables while random effects models provide biased estimates of model parameters in the presence of omitted variables. This study further investigates the properties of fixed effects models as an alternative to the common random effects models for identifying and classifying subpopulations or “latent classes” when selection or outcome processes are heterogeneous. A recent study by Suk and Kim (2018) found that linear probability models outperform standard logistic selection models in terms of the extraction of the correct number of latent classes, and the authors continue to search for optimal model specifications of mixture selection models across different conditions, such as strong and weak selection, various numbers of clusters and cluster sizes. It is found that fixed-effects models outperform random effects models in terms of classifying units and estimating treatment effects when cluster size is small.

Jee-Seon Kim, Youmi Suk

The Effect of Using Principal Components to Create Plausible Values

In all large scale educational surveys such as PISA and TIMSS the distribution of student abilities is estimated using the method of plausible values. This method treats student abilities within each country as missing variables that should be imputed based upon both student responses to cognitive items and a conditioning model using background information from questionnaires. Previous research has shown that, in contrast to creating single estimates of ability for each individual student, this technique will lead to unbiased population parameters in any subsequent analyses, provided the conditioning model is correctly specified (Wu in Studies in Educational Evaluation 31:114–128, 2005). More recent research has shown that, even if the conditioning model is incorrectly specified, the approach will provide a good approximation to population parameters as long as sufficient cognitive items are answered by each student (Marsman, Maris, Bechger, & Glas in Psychometrika 81:274–289, 2016). However, given the very large amount of background information collected in studies such as PISA, background variables are not all individually included in the conditioning model, and a smaller number of principal components are used instead. Furthermore, since no individual student answers cognitive items from every dimension of ability, we cannot rely on sufficient items having been answered to ignore possible resulting misspecification in the conditioning model. This article uses a simple simulation to illustrate how relying upon principal components within the conditioning model could potentially lead to bias in later estimates. A real example of this issue is provided based upon analysis of regional differences in performance in PISA 2015 within the UK.

Tom Benton

Adopting the Multi-process Approach to Detect Differential Item Functioning in Likert Scales

The current study compared the performance of the logistic regression (LR) and the odds ratio (OR) approaches in differential item functioning (DIF) detection in which the three processes of an IRTree model were considered in a five-point response scale. Three sets of binary pseudo items (BPI) were generated to indicate an intention of endorsing the midpoint response, a positive/negative attitude toward an item, and a tendency of using extreme category, respectively. Missing values inevitably appeared in the last two sets of BPI. We manipulated the DIF patterns, the percentages of DIF items, and the purification procedure (with/without). The results suggested that (1) both the LR and OR performed well in detecting DIF when BPI did not include missing values; (2) the OR method generally outperformed the LR method when BPI included missing values; (3) the OR method performed fairly well without a purification procedure, but the purification procedure improved the performance of the LR approach, especially when the number of DIF was large.

Kuan-Yu Jin, Yi-Jhen Wu, Hui-Fang Chen

Detection of Differential Item Functioning via the Credible Intervals and Odds Ratios Methods

Differential item functioning (DIF) analysis is an essential procedure for educational and psychological tests to identify items that exhibit varying degrees of DIF. DIF means that the assumption of measurement invariance is violated, and then test scores are incomparable for individuals of the same ability level from different groups, which substantially threatens test validity. In this paper, we investigated the credible intervals (CI) and odds ratios (OR) methods to detect uniform DIF within the framework of the Rasch model through a series of simulations. The results showed that the CI method performed better than the OR method to identify DIF items under the balanced DIF conditions. However, the CI method yielded inflated false positive rates under the unbalanced DIF conditions. The effectiveness of these two approaches was illustrated with an empirical example.

Ya-Hui Su, Henghsiu Tsai

Psychometric Properties of the Highest and the Super Composite Scores

For students who took college admissions tests multiple times, institutions may have different policies of utilizing the multiple sets of test scores for decision making. For example, some may use the most recent, and others may use the average, the highest, or even the super composite scores by combining the highest subject test scores from each administration. Previous research on these different score use policies mainly focused on their predictive validity with little discussion about their psychometric properties. Through both theoretical and empirical investigations, this study showed how the bias, the standard error of measurement, and the reliability of scores for these different policies compare with each other and how these properties change for each score type as the number of test events increased.

Dongmei Li

A New Equating Method Through Latent Variables

Comparability of measurements is an important practice in different fields. In educational measurement, equating methods are used to achieve the goal of having comparable scores from different test forms. Equated scores are obtained using the equating transformation which maps the scores on the scale of one test form into their equivalents on the scale of another for the case of sum scores. Such transformation has been typically computed using continuous approximations of the score distributions, leading to equated scores that are not necessarily defined on the original discrete scale. Considering scores as ordinal random variables, we propose a latent variable formulation based on a flexible Bayesian nonparametric model to perform an equipercentile-like equating that is capable to produce equated scores on the original discrete scale. The performance of our model is assessed using simulated data under the equivalent groups equating design. The results show that the proposed method has better performance with respect to a discrete version of estimated equated scores from traditional equating methods.

Inés Varas, Jorge González, Fernando A. Quintana

Comparison of Two Item Preknowledge Detection Approaches Using Response Time

Response time (RT) has been demonstrated to be effective in identifying compromised items and test takers with item preknowledge. This study compared the performance of the effective response time (ERT) approach and the residual based on the lognormal response time model (RES) approach in detecting the examinees with item preknowledge using item response time in a linear test. Three factors were considered in this study: the percentage of examinees with item preknowledge, the percentage of breached items, and the percent decrease of response time of the breached items. The results suggest that the RES approach not only controls the Type I error rate below 0.05 for all investigated conditions, but also flag the examinees with item preknowledge sensitively.

Chunyan Liu

Identifying and Comparing Writing Process Patterns Using Keystroke Logs

There is a growing literature on the use of process data in digitally delivered assessments. In this study, we analyzed students’ essay writing processes using keystroke logs. Using four basic writing performance indicators, writers were grouped into four clusters, representing groups from fluent to struggling. The clusters differed significantly on the mean essay score, mean total time spent on task, and mean total number of words in the final submissions. Two of the four clusters were significantly different on the aforementioned three dimensions but not on typing skill. The higher scoring group even showed signs of less fluency than the lower scoring group, suggesting that task engagement and writing efforts might play an important role in generating better quality text. The four identified clusters further showed distinct sequential patterns over the course of the writing session on three process characteristics and, as well, differed on their editing behaviors during the writing process.

Mo Zhang, Mengxiao Zhu, Paul Deane, Hongwen Guo

Modeling Examinee Heterogeneity in Discrete Option Multiple Choice Items

A new format for computer-based administration of multiple- choice items, the discrete option multiple choice (DOMC) item, is receiving growing attention due to potential advantages related both to item security and control of testwiseness. A unique feature of the DOMC format is the potential for an examinee to respond incorrectly to an item for different reasons—either failure to select a correct response, or selection of a distractor response. This feature motivates consideration of a new item response model that introduces an individual differences trait related to general proclivity to select response options. Using empirical data from an actual DOMC test, we validate the model by demonstrating the statistical presence of such a trait and discuss its implications for test equity in DOMC tests and the potential value for added item administration constraints.

Nana Kim, Daniel M. Bolt, James Wollack, Yiqin Pan, Carol Eckerly, John Sowles

Simulation Study of Scoring Methods for Various Multiple-Multiple-Choice Items

Multiple-choice (MC) format is the most widely used format in objective testing. The “select all the choices that are true” items, also called multiple-multiple-choice (MMC) items, is a variation of the MC format, which gives no instructions about how many correct choices may be selected. Although many studies have been developed and various scoring methods for MMC items have been compared, the results have often been inconsistent. Arai and Miyano (Bull Data Anal Japan Classif Soc 6:101–112, 2017) proposed new scoring methods and compared their scoring features by conducting numerical simulations of a few MMC item patterns. In this study, we conducted numerical simulations of all other plausible MMC item patterns to examine the relationships between examinees’ abilities (true scores) and scores given by scoring methods. We illustrated the effects of the total number of choices and correct choices for each scoring.

Sayaka Arai, Hisao Miyano

Additive Trees for Fitting Three-Way (Multiple Source) Proximity Data

Additive trees are graph-theoretic models that can be used for constructing network representations of pairwise proximity data observed on a set of N objects. Each object is represented as a terminal node in a connected graph; the length of the paths connecting the nodes reflects the inter-object proximities. Carroll, Clark, and DeSarbo (J Classif 1:25–74, 1984) developed the INDTREES algorithm for fitting additive trees to analyze individual differences of proximity data collected from multiple sources. INDTREES is a mathematical programming algorithm that uses a conjugate gradient strategy for minimizing a least-squares loss function augmented by a penalty term to account for violations of the constraints as imposed by the underlying tree model. This article presents an alternative method for fitting additive trees to three-way two-mode proximity data that does not rely on gradient-based optimization nor on penalty terms, but uses an iterative projection algorithm. A real-world data set consisting of 22 proximity matrices illustrated that the proposed method gave virtually identical results as the INDTREES method.

Hans-Friedrich Köhn, Justin L. Kern

A Comparison of Ideal-Point and Dominance Response Processes with a Trust in Science Thurstone Scale

The purpose of this study is to compare the dominance and ideal-point response process models for a trust in science measure developed from Thurstone’s (Am J Sociol 33(4):529–554, 1928; Psychol Rev 36(3):222–241, 1929) scaling procedures. The trust in science scale was scored in four different ways: (1) a dominance response approach using observed scores, (2) a dominance response approach using model-based trait estimates, (3) an ideal-point response observed score approach using Thurstone scoring, and (4) an ideal-point response approach using model-based trait estimates. Comparisons were made between the four approaches in terms of psychometric properties and correlations with political beliefs, education level, and beliefs about scientific consensus in a convenience sample of 401 adults. Results suggest that both the ideal-point and two-parameter IRT models fit equally well in terms of overall model fit. However, two items demonstrated poor item fit in the two-parameter model. Correlations with political beliefs, education level, and science-related items revealed very little differences in magnitude across the four scoring procedures. This study shows support for the flexibility of the ideal-point IRT model for capturing non-ideal-point response patterns. The study also demonstrates the use of using IRT to examine item parameters and item fit.

Samuel Wilgus, Justin Travis

Rumor Scale Development

Rumor refers to unsubstantiated story or information being circulated. Although the more the integrity of the source implies the more the reliability of rumor, not all that seems reliable would be adjudged as valid. There has been cogent need for rumor validity assessment, but dearth of construct-relevant scale hampers empirical data collection. Considering that psychological scales are indispensable for assessment, the present study developed a suitable and psychometrically sound scale, using cross-sectional design and 570 randomly sampled participants. The psychometric properties are based on reliability and validity. Reliability (ά = 0.78) was determined by item-total statistics while validity was based on content validity indexes, principal component analysis and the compatibility of factor model to the data. Seven extracted factors accounted for 92% of the total scale variance. Rumor intensity score (R = 80) corroborated the scale suitability. However, although the newly developed 50-item Rumor Scale is suitable for adaptation among different populations at various settings, there is need for confirmatory factor analysis (CFA) which was not implemented in the initial scale development study. Further validations, suggested to include cross-cultural and trans-national adaptations using CFA and other competing analysis models, can help to establish sufficient norms.

Joshua Chiroma Gandi

An Application of a Topic Model to Two Educational Assessments

A topic model is a statistical model for extracting latent clusters or themes from the text in a collection of documents. The purpose of this study was to apply a topic model to two educational assessments. In the first study, the model was applied to students’ written responses to an extended response item on an English Language Arts (ELA) test. In the second study, a topic model was applied to the errors students’ made on a fractions computation test. The results for the first study showed five distinct writing patterns were detected in students’ writing on the ELA test. Two of the patterns were related to low scores, two patterns were associated with high scores and one pattern was unrelated to the score on the test. In the second study, five error patterns (i.e., latent topics) were detected on the pre-test and six error patterns were detected on the post-test for the fractions computation test. The results for Study 2 also yielded evidence of instructional effects on students’ fractions computation ability. Following instruction, more students in the experimental instruction condition made fewer errors than students in the business-as-usual condition.

Hye-Jeong Choi, Minho Kwak, Seohyun Kim, Jiawei Xiong, Allan S. Cohen, Brian A. Bottge

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter