Skip to main content
Top

2024 | Book

Quantitative Psychology

The 88th Annual Meeting of the Psychometric Society, Maryland, USA, 2023

insite
SEARCH

About this book

This book includes presentations given at the 88th annual meeting of the Psychometric Society, held in Maryland, USA on July 24–28, 2023.

The proceeding covers a diverse set of psychometric topics. The topics include, but are not limited to item response theory, cognitive diagnostic models, Bayesian estimation, validity and reliability issues, and several applications within different fields. The authors are from all over the world, they work in different psychometrics areas, as well as having diverse professional and academic experiences.

Table of Contents

Frontmatter
A Family of Discrete Kernels for Presmoothing Test Score Distributions

In the fields of educational measurement and testing, score distributions are often estimated by the sample relative frequency distribution. As many score distributions are discrete and may have irregularities, it has been common practice to use presmoothing techniques to correct for such irregularities of the score distributions. A common way to conduct presmoothing has been to use log-linear models.In this chapter, we introduce a novel class of discrete kernels that can effectively estimate the probability mass function of scores, providing a presmoothing solution. The chapter includes an empirical illustration demonstrating that the proposed discrete kernel estimates perform as well as or better than the existing methods like log-linear models in presmoothing score distributions. The practical implications of this finding are discussed, highlighting the potential benefits of using discrete kernels in educational measurement contexts. Additionally, the chapter identifies several areas for further research, indicating opportunities for advancing the field’s methodology and practices.

Jorge González, Marie Wiberg
Priors in Bayesian Estimation Under the Graded Response Model

The purpose of this chapter is to review various priors used in Bayesian estimation under the graded response model with clear mathematical definitions of the prior distributions. A Bayesian estimation method, Gibbs sampling, was compared with the marginal Bayesian estimation method using empirical data. The effects of the priors and their specifications on both item and ability parameter estimates are demonstrated. Issues in Bayesian estimation, use of priors in item response theory, and selection of item response theory models are discussed.

Seock-Ho Kim
Identifiability Conditions in Cognitive Diagnosis: Implications for Q-Matrix Estimation Algorithms

The Q-matrix of a cognitive diagnosis (CD) assessment documents the item-attribute associations and is thus a key component of any CD test. However, the true Q-matrix underlying a CD assessment is never known; it must be estimated. In practice, this task is typically performed by content experts, which, however, can result in the misspecification of the Q-matrix, causing examinees to be misclassified. In response to these difficulties, algorithms have been developed for estimating the entire Q-matrix based on the item responses. Extant algorithms for estimating the Q-matrix under the conjunctive Deterministic Input Noisy “AND” Gate (DINA) model either impose the identifiability conditions from Chen et al. (J Amer Statist Assoc 110:850–866, 2015) or do not. The debate on which is “right” way to do is ongoing; especially, as these conditions are sufficient but not necessary, which means that viable alternative Q-matrix estimates may be ignored. The goal of this chapter was to compare the estimated Q-matrices obtained from three algorithms that do not impose the identifiability conditions on the Q-matrix estimator with the estimated Q-matrices obtained from two algorithms that do impose the identifiability conditions. Simulations were conducted using data conforming to the DINA model generated in using an identifiable “true” Q-matrix. The impact on Q-matrix estimation of three factors was controlled: the length of the test, the number of attributes, and the amount of error perturbation added to the data. The estimated Q-matrices were evaluated whether they met the identifiability conditions and in their capacity to enable the correct classification of examinees. The results show there is essentially no difference in the rates of correctly classified examinees between Q-matrix estimates obtained from algorithms imposing the identifiability conditions and those that do not.

Hyunjoo Kim, Hans Friedrich Köhn, Chia-Yi Chiu
A Two-Stage Approach to a Latent Variable Mixed-Effects Location-Scale Model

Understanding within- and between-subject variation in repeated measures is central to longitudinal behavioral investigations. Mixed-effects location-scale models include distinct variance models to permit study of heterogeneity of within- and between-subject variation. Recent developments have extended the model to address measurement error in the longitudinal response. Accounting for variation in the response that is due to measurement error is especially important in studies that focus efforts to understand the within-subject variation. Relative to a mixed-effects location-scale model for a variable assumed to be measured without error, the latent variable version of the model is more complicated, and this complexity can be carried over to increased computational demands. One approach to the estimation of the latent variable version of the model simplifies the calculation by analytically removing the random scale effect from the marginal response distribution, resulting in a substantial reduction in the computational burden using maximum likelihood estimation. This paper proposes a two-stage approach in which factor scores and their corresponding standard errors of measurement are estimated and then incorporated into a mixed-effects location-scale model. This paper considers the two approaches in the context of daily diary data from a large sample of adults in the United States.

Shelley A. Blozis, Mark H. C. Lai
A Hierarchical Prior for Bayesian Variable Selection with Interactions

Selecting subsets of variables has always been a vital and challenging topic in educational and psychological settings. In many cases, the probability that an interaction is active is influenced by whether the related variables are active. In this chapter, we proposed a hierarchical prior for Bayesian variable selection to account for a structural relationship between variables and their interactions. Specifically, an interaction is more likely to be active when all the associated variables are active and is more likely to be inactive when at least one variable is inactive. The proposed hierarchical prior is based upon the deterministic inputs, noisy “and” gate model and is implemented in the stochastic search variable selection approach (George and McCulloch (J Amer Statist Assoc 88(423):881–889, 1993)). A Metropolis-within-Gibbs algorithm is used to uncover the selected variables and to estimate the coefficients. Simulation studies were conducted under different conditions and in a real data example. The performance of the proposed hierarchical prior was compared with the widely adopted independent priors in Bayesian variable selection approaches, including traditional stochastic search variable selection prior, Dirac spike and slab priors (Mitchell and Beauchamp (J Amer Statist Assoc 83(404):1023–1032, 1988)), and hyper g-prior (Liang et al. (J Amer Statist Assoc 103(481):410–423, 2008)).

Anqi Li, Steven Andrew Culpepper
Application of Topic Modeling Techniques in Meta-analysis Studies

The latent Dirichlet allocation (LDA) model is a topic modeling technique that reveals the semantic structure underlying a collection of documents. The advantage of the LDA model is providing the interpretation of a bunch of textual data with Bayesian statistical quantitative evidences. We analyzed 198 abstracts from two journals, Review of Educational Research and Psychological Bulletin, published from 2008 to 2018. In terms of perplexity, we extracted four topics underlying the 198 abstracts. By using the top 10 representative abstracts among one-topic indexed abstracts and the top 20 words of each topic, we labeled four topics as genetic or environmental characteristics of individuals, students’ achievement with intervention, cognitive psychology, and behavior psychology. Hence, we employed the applicability of the LDA model to understand the journal articles and the guidelines to apply the LDA model to conduct topic modeling in social science studies.

Minju Hong, Sunyoung Park
Comparing Maximum Likelihood to Markov Chain Monte Carlo Estimation of the Multivariate Social Relations Model

The social relations model (SRM) is a linear random-effects model applied to dyadic data within social networks (i.e., round-robin data). Such data have a unique nesting structure in that dyads (pairs) are cross-classified within individuals, who can also be nested in different networks. The SRM is used to examine basic multivariate relations between components of dyadic variables at two levels: individual-level random effects and dyad-level residuals. The current “gold standard” for estimating multivariate SRMs is the maximum likelihood (ML) estimation. However, Bayesian approaches, such as Markov chain Monte Carlo (MCMC) estimators, may provide some practical advantages to estimate complex or computationally intensive models. In this chapter, we report a small simulation study to compare the accuracy and efficiency of ML and MCMC point (and interval) estimates of a trivariate SRM on the ideal scenario: normally distributed, complete round-robin data. We found that MLE outperformed MCMC at both levels. MCMC greatly underestimated parameters and displayed poor coverage rates at the individual level but was relatively accurate at the dyad level.

Aditi M. Bhangale, Terrence D. Jorgensen
Exploring Attenuation of Reliability in Categorical Subscore Reporting

Research on subscores has consistently advocated discontinuing reporting when they lack sufficient psychometric properties, yet many educational agencies and operational testing programs have not changed. This may be due to several real-world complications such as user demand, competitors, or contractual obligations. Given these challenges, some test providers have continued to report subscores but in a categorical format to mitigate misinterpretation of small differences likely due to error. However, there also exists robust literature on how continuous scores grouped into categories can be less reliable than the scores from which they were constructed. Using a resampling design based on real data, a variation on the Lord–Wingersky recursion described by Feinberg and von Davier (J Educ Behav Stat 45(5):515–533, 2020) was applied to compare two different approaches of discretizing subscore into categories, relative-to-self and relative-to-average. Results support categorical subscores as a promising alternative when the reliability of a continuous subscore may be too imprecise to report a numeric score. Implications for practice, operational utility, and the extent to which categorical subscore reporting represents an appropriate compromise are discussed.

Richard A. Feinberg
Assessing Cross-Level Interactions in Clustered Data Using CATE Estimation Methods

Treatment effect heterogeneity is a critical issue in causal inference, as a one-size-fits-all approach is not sufficient and can even be detrimental for many treatments and interventions. In environments where individuals are clustered within communities, effect heterogeneity is commonplace rather than an exception, as characteristics of communities often interact with a treatment implemented on members within the communities, and such interactions result in treatment effect heterogeneity. This chapter demonstrates how various nonparametric methods for estimating conditional average treatment effects (CATEs) can be used to examine cross-level interaction effects between cluster-level variables and treatments implemented at the individual level. The pool of considered methods includes causal forests, Bayesian additive regression trees (BARTs), and X-Learners (using random forests and BART as base learners). We apply these methods to the Trends in International Mathematics and Science Study data, a widely recognized large-scale assessment dataset in education. In educational settings, cross-level interactions have garnered significant attention, as they can address the moderating effects of school-level resources and actions on student outcomes. Understanding these interactions is crucial for making informed policy decisions to enhance educational effectiveness. This chapter concludes by discussing remaining issues and future directions in employing CATE with clustered observational data.

Jee-Seon Kim, Xiangyi Liao, Wen Wei Loh
A Comparison of Full Information Maximum Likelihood and Machine Learning Missing Data Analytical Methods in Growth Curve Modeling

Missing data are inevitable in longitudinal studies. Traditional methods, such as the full information maximum likelihood (FIML), are commonly used to handle ignorable missing data. However, they may lead to biased model estimation due to missing not at random data that often appear in longitudinal studies. Recently, machine learning methods, such as random forest (RF) and K-nearest neighbors (KNN) imputation methods, have been proposed to cope with missing values. Although machine learning imputation methods have been gaining popularity, few studies have investigated the tenability and utility of these methods in longitudinal research. Through Monte Carlo simulations, this chapter evaluates and compares the performance of traditional and machine learning approaches (FIML, RF, and KNN) in growth curve modeling. The effects of sample size, the rate of missingness, and missing data mechanism on model estimation are investigated. Results indicate that FIML is a better choice than the two machine learning imputation methods in terms of model estimation accuracy and efficiency.

Dandan Tang, Xin Tong
Investigating Variable Selection Techniques Under Missing Data: A Simulation Study

Variable selection is one of the most pervasive problems researchers face, especially with the increased ease in data collection arising from online data collection strategies. Machine learning methods such as LASSO and elastic net regression have gained traction in the field but are limited in the types of problems for which they are suitable. As such, researchers have pulled more complex techniques, such as the genetic algorithm, from fields like computer science. Although there is strong support in the literature for the use of each of these methods on complete data (McNeish. Multivar Behav Res 50(5):471–484, 2015. https://doi.org/10.1080/00273171.2015.1036965 ; Schroeders et al. PLoS One 11(11):e0167110, 2016. https://doi.org/10.1371/journal.pone.0167110 ), less is known about their relative performance in the presence of missing data. Using a large-scale Monte Carlo simulation, the performance of the LASSO, Elastic Net, and the genetic algorithm is reviewed, for solving variable selection problems in the presence of ignorable missing data. In particular, this chapter incorporates the state-of-the-art missing data handling technique multiple imputation, into the studied tools. All techniques were found to perform at satisfactory levels (as measured by MSE, precision, false positive rate, and computation time) under MCAR and MAR conditions. The genetic algorithm was seen to be most robust to changes in the data.

Catherine Bain, Dingjing Shi
Comparison of DIF Detection Methods

Differential item functioning (DIF) has been one of the important considerations in measurement practices over the last three decades. Various statistical methods have been developed to detect DIF, and as a result, many simulation studies have been conducted to evaluate the performance of two or three methods under different test situations. To have an overall picture on how they compare for tests with dichotomous items with and without DIF, this study carries out a comprehensive comparison of these methods by evaluating their type I error and power rates controlling for sample sizes, test lengths, DIF sizes, and DIF proportions. Results of the study provide a set of guidelines for researchers and practitioners on the use of each DIF detection method in different test situations where uniform or nonuniform DIF is present.

Yevgeniy Ptukhin, Yanyan Sheng
Validity Evidence for an ECE Classroom Observation Tool

A lack of studies examining the quality of early childhood education (ECE) classrooms and use of teaching practices across low- and middle-income countries limits what we know about the extent to which effective ECE teachers in different contexts use similar teaching practices. This limitation is, in part, due to the lack of standardized early childhood education (ECE) classroom observation tools with technically sound psychometric properties available for low- and middle-income contexts. In response to this global gap, the objective of this study was to explore reliability and validity evidence for the constructs measured by a classroom observation tool across contexts worldwide. Data from five low- and middle-income countries located in four world regions was used to conduct this validation study. Results showed that the tool meets adequate levels of reliability as captured by Cronbach’s alpha coefficient and item-total correlations are all positive and above 0.40. In terms of validity, multiple theory-driven confirmatory factor analysis models were estimated and compared; consistent with the theoretical framework behind the development of the tool, model fit favored a three-factor model solution that uses parcels as observed variables. Results are discussed in terms of the policy implications for monitoring ECE teaching quality worldwide, and next steps for the validity research agenda on this classroom observation tool are proposed.

Elaine Ding, Adelle Pushparatnam, Jonathan Seiden, Estefania Avedaño, Ezequiel Molina, Marie-Helene Cloutier, Diego Luna Bazaldua, Laura Gregory
Enhancing Multilevel Models Through Supervised Machine Learning

Clustered data are common in various fields, such as social sciences (multiple individual measurements) and machine learning (city-wise weather forecasts, regional house price predictions). In such data, observations within clusters exhibit dependencies, violating assumptions of independence and identical distribution. To address this, multilevel random effects are used instead of fixed effects. Starting from linear approaches, typical multilevel frameworks extend to non-linear approaches by direct specifications (e.g., products of variables), guided by theory or trial and error in model comparisons. However, when the primary aim is to find the best prediction model in a multilevel context, this approach can be cumbersome, and non-linear models may lack flexibility. Here, we introduce mixed-effects machine learning (mixedML), incorporating multilevel effects into supervised regression machine learning models. This framework enhances flexibility in prediction model functional forms. We discuss its applicability in multilevel modeling, following its publication and presentation at IMPS 2023. By popular request after the presentation, we explain how to apply mixedML from a traditional multilevel modeling perspective. For detailed technical information, please refer to the original publication.

Pascal Kilian, Augustin Kelava
Assessing the Effects of a Yearly Renewable Education Program Through Causal Mediation Analysis

When education programs are renewed yearly, participation in such programs can vary over time, resulting in multiple patterns of participation. One such example is the national Head Start program administered for two consecutive years, serving children aged 3 and 4. Even though there are four possible patterns of Head Start attendance, Head Start has often been examined as a one-time event in early childhood education literature, with only the effect of early Head Start attendance at age 3 being evaluated. In this study, we propose to apply causal mediation analysis to the study of yearly renewable education programs, separating the effect of initial program attendance into sequential effects of the programs over time and long-term effects of initial program attendance. We adopt a parametric closed-form estimation that combines regression models to examine the effect of Head Start on children’s receptive vocabulary using data from the Head Start Impact Study as an illustration. Our analysis exemplifies how the effect of a yearly renewable education program can be attributed to different program attendance histories and invites further research on studying time-varying treatment effects as causal mediation effects.

Hanna Kim, Jee-Seon Kim
Gumbel-Reverse Gumbel (GRG) Model: A New Asymmetric IRT Model for Binary Data

We propose a novel asymmetric item response theory (AsymIRT) model based on a convex combination of the complementary log–log and log–log links. These two links are the cumulative distribution functions (CDFs) of the Gumbel-min and Gumbel-max extreme value distributions, respectively. The resulting Gumbel-Reverse Gumbel (GRG) mixture model has one additional parameter. We illustrate using intelligence data taken from the Synthetic Aperture Personality Assessment (SAPA). In particular, we illustrate how nonparametric bootstrapping can be used to study model identification. We conclude with a discussion of how the GRG model fits in the AsymIRT literature more broadly as well as a call for scholars to work on comparing the AsymIRT models to each other in a more rigorous manner, in particular focusing on identification issues within AsymIRT models that do not fix a tail direction a priori.

Jay Verkuilen, Peter J. Johnson
Fisher Information-Based Item Difficulty and Discrimination Indices for Binary Item Response Models

While difficulty and discrimination parameters have appealing and intuitive meanings in the 2PL model, the parameters in IRT models outside of the 2PL are much harder to interpret. For example, even adding the pseudo-guessing parameter with the 3PL model makes the difficulty and discrimination parameters no longer have the meaning they have in the 2PL, and they are not directly comparable when the pseudo-guessing parameter differs. Increasingly, models even more complicated than the 3PL, such as the 4PL or various asymmetric IRF models, e.g., the Logistic Positive Exponent (LPE), Heteroscedastic Residuals (HR), Complimentary Log–Log (CLL), etc., have been considered. These models help resolve some issues encountered in IRT, but unfortunately, they sacrifice the interpretable nature of difficulty and discrimination that the 2PL provides. We propose to use two properties of Fisher information—the maximizer and a transformation of the information at the maximum—in analogy to the 2PL model, for which the model parameters and the Fisher information function are in close correspondence, as measures of effective difficulty and discrimination, respectively.

Peter J. Johnson, Jay Verkuilen
Investigating the Impact of Equating on Measurement Error Using Generalizability Theory

The paper discusses the relationship between equating error and measurement error in educational measurement research, which are traditionally treated as independent sources of error with potential differential impacts on individual scores and group means. This paper proposes a perspective shift using generalizability theory. It argues that equating, when integrated into the measurement process, should be viewed as one among various sources contributing to measurement error. The key assertion is that the impact of equating can be assessed alongside other sources of error through appropriate generalizability study designs. The paper acknowledges potential challenges and offers a framework for classifying applications of generalizability theory, providing empirical examples of study designs to investigate the impact of equating and other errors either separately or simultaneously.The paper also delves into the differential impact of equating error on individual scores and group means by decomposing the variance components in generalizability theory, particularly when individuals or schools are the object of measurement. By adopting a unified view of equating error and measurement error within the generalizability theory framework, the paper aims to facilitate both the conceptual discussions and practical estimation of various error sources. Ultimately, this approach is expected to enhance interpretation of scores in educational measurement.

Dongmei Li
Fitting a Drift–Diffusion Item Response Theory Model to Complex Cognition Response Times

Drift–Diffusion Models (DDMs) have been widely successful in modeling fast decision response times. DDM describes the underlying (cognitive) decision process as a function of a diffusion process drifting toward a decision threshold. A few studies have shown that introducing within-trial variability in DDM parameters or describing DDM parameters as a function of item properties improves the DDM model fit for response times of the Complex Decision Task (CDT) as well. One such extension of DDM is the item response theory-based Q-diffusion model (QDM). QDM has been successful in modeling response times of CDT such as chess ability assessment. The current study further examined whether QDM can fit response times corresponding to certain problem-solving tasks. First, the drift rate parameter of standard DDM was extended to approximate the within-trial variability in the reasoning process as discussed in existing meta-reasoning studies that examine such within-trial dynamics for problem-solving tasks. Then, the response times were simulated using the standard DDM and the mentioned extension of the standard DDM. Then, the goodness of fit of QDM was examined using a Bayesian model fit method—Posterior Predictive Check (PPC). PPC analysis revealed that the fitted QDM was able to effectively describe the simulated response time mean. However, the fitted QDM was not able to describe the simulated response time variance.

Ritesh K. Malaiya
Comparing Correlation Tests

The Pearson product-moment correlation is widely used statistic to explore the association between two variables. To test whether the population correlation is zero, traditional parametric procedure utilizing Fisher’s z transformation can be applied. However, this method relies on the assumption of normality, which is often violated in real-world scenarios. Bootstrapping, a resampling technique, provides more accurate and reliable solutions when data are not ideally distributed or have small sample sizes. But the question of whether various bootstrapping testing methods possess equal efficacy and how to determine the best among them remains unanswered. More importantly, the fourth-order moment significantly impacts the distribution of correlation, but this subject is typically overlooked by most researchers. There are further inquiries to consider regarding these testing methods. However, there exists a lack of literature addressing these issues. This project aims to investigate and compare the performances of four correlation testing methods—the traditional parametric procedure using Fisher’s z transformation, bivariate bootstrapping, univariate bootstrapping, and bootstrap hypothesis testing—through theoretical derivation and Monte Carlo simulations. We focus on the inference of correlation, including data generation methods, covariance matrix distributions, and deriving correlation distribution. Simulation studies were conducted by applying the four methods to datasets having either high or regular kurtosis, generated from normal or non-normal distributions. The mis-coverage rates across various scenarios were summarized and compared. Drawing insights from the simulation results, the project aims to offer conclusive observations and recommendations, aiding in the selection of the most appropriate method for specific scenarios.

Zhenqiu Laura Lu, Ke-Hai Yuan
Optimizing Maximum Likelihood Estimation in Performance Factor Analysis: A Comparative Study of Estimation Methods

Assessment methods impact learning and ensure alignment with course goals. However, formative assessments face model selection challenges influenced by factors like class size and item availability. These factors compromise assessment validity in item response models for small populations. Performance factor analysis, an alternative model class, effectively provides detailed student performance information and parameter estimation per learning object or latent attribute. To address challenges in sample size and multidimensionality of latent attribute-item matrices in formative assessments, this study explores limited-memory Broyden-Fletcher-Goldfarb-Shanno with bound (L-BFGS-B) and Nelder-Mead optimization methods of maximum likelihood estimation for performance factor analysis. Comparison of their accuracy using various criteria indicates that L-BFGS-B and Nelder-Mead methods are robust for handling small to moderate sample sizes in both unidimensional and multidimensional Q matrix scenarios.

A. Mehrabi, O. Altintas, J. W. Morphew
Validation of the Household Food Security Survey Module (HFSSM) Using Factor Analysis and Rasch Measurement Theory

The purpose of this study is to use factor analysis (exploratory factor analysis and bifactor confirmatory factor analysis) and Rasch measurement theory to analyze the Household Food Security Survey Module (HFSSM). The goal is to achieve three objectives: (a) to verify construct validity of the survey by examining the factor structure, (b) to evaluate the item effectiveness in measuring the continuum of household food insecurity based on Rasch measurement theory, and (c) to combine the outcomes of both models to establish their consistency. Our findings reveal 18 items of HFSSM that contribute to assessing household food security as a unidimensional construct. The methods used in this study can be applied to validate and assess scales in various other areas.

Jing Li, Seock-Ho Kim, George Engelhard Jr.
Are We Playing the Same Game? Translating Fairness Content

Traduttore, traditore. An act of translation is always an act of betrayal. For years, language has been shown to impact individual perception and decision-making. Some argue that preferences and evaluations may be modulated as a result of linguistic contexts (Vidal et al., 2021). Therefore, when translating assessments across languages and cultures, it is important to verify that the validity of these tests is preserved. With the rise of automation, it has become more common for hiring processes to include various assessments (Assessments, 2023). These assessments include gamified implementations of the Trust Game (Berg et al., Games Econ Behav 10:122–142) and the Dictator Game (Forsythe et al. Games Econ Behav 6:347–369, 1994), which follow a three-factor model. Though utilized globally, these assessments are built on English language response patterns. Therefore, this research aims to disentangle the effects of language in observed behaviors of fairness, altruism, and decision-making speed. Factorial invariance testing reveals overall equal factor loadings when comparing Spanish to English across six countries. Small intercept differences in fairness ratings have been identified in Spain for European Spanish-speaking participants against English- speaking participants. Possible elaborations and future mitigations of these differences in other populations are briefly discussed.

Amy Li, Ambar Kleinbort, Janelle Szary, Anne Thissen-Roe
Diagnosing Skills and Misconceptions with Bayesian Networks Applied to Diagnostic Multiple-Choice Tests

We discuss the use of Bayesian networks as a general framework for diagnostic classification in educational assessments, showing how they can accommodate sophisticated capabilities useful in diagnostic assessment, including modeling hierarchical structure among latent attributes, diagnostic use of information from incorrect alternatives in multiple-choice items, and simultaneous diagnosis of both subskills and misconceptions. These capabilities are illustrated with an application reported by (Lee 2003; Lee & Corter, 2003, 2011), who proposed using Bayesian networks as the inference engine to learn from test data and diagnose individuals’ misconceptions or bugs in the domain of multicolumn subtraction. Lee and Corter demonstrated that diagnosis of misconceptions or bugs is most effective when information from incorrect alternatives in multiple-choice items is used and when both bugs and skills are assessed simultaneously, with a hierarchical structure assumed for subskills and misconceptions. More recently, these innovations and issues have been investigated in the context of traditional CDM models. In this paper, we describe the approach taken by Lee and Corter and discuss some advantages and disadvantages of using Bayesian networks for diagnostic assessment.

James E. Corter, Jihyun Lee
Exploring Conceptual Differences Among Nonparametric Estimators of Treatment Heterogeneity in the Context of Clustered Data

One aim of educational research is to evaluate interventions developed to improve student learning and behavioral outcomes. Estimating an intervention’s treatment effect is one way to evaluate its efficacy. This treatment effect is not always constant, and heterogeneity arises when not all students respond to interventions similarly, particularly between subgroups with varying characteristics. Sample differences in sociodemographic features or individual covariates can identify key subgroups, which can be used to estimate heterogeneity via the conditional average treatment effect (CATE) for individuals with those characteristics. Nonparametric methods are an increasingly popular choice for estimating CATE because of their flexibility to model complex relationships between many covariates and treatment status. Clustered data is a common occurrence in educational research, but many nonparametric methods do not explicitly account for clustered data structure. To better understand the role of clustered data structure in estimating heterogeneous treatment effects with nonparametric methods, we conduct a simulation study that compares the performance of different popular nonparametric methods as measured by their recovery of individual treatment effects under a potential outcomes framework. We examine methods’ performance across varying levels of intraclass correlation (ICC) and number of clusters sampled. Finally, we discuss the practice of accounting for clustered data structure, how conceptual differences between methods might correspond to differences in performance, and pose questions for future research.

Graham Buhrman, Xiangyi Liao, Jee-Seon Kim
Assessment of Testlet Effects: Testing it All at Once

A testlet is a cluster of items that shares a common stimulus (e.g., a set of questions all related to the same text passage). Testlets are commonly used in educational and psychological assessments for their appealing features regarding test development and administration. Yet, bundling items into testlets calls into question one of the key statistical assumptions underlying any assessment: local independence of the test item responses. This article presents a condensed version of (Lim, 2024), which proposed a new index—the parametric bootstrap Mantel–Haenszel MH χ t e s t l e t 2 $$\text{MH} \chi ^2_{testlet}$$ —as a device for detecting the presence of testlet effects at the level of an entire testlet and not just for pairs of items. The description of the theoretical foundation of the parametric bootstrap Mantel–Haenszel MH χ t e s t l e t 2 $$\text{MH} \chi ^2_{testlet}$$ is augmented by simulation studies for assessing the performance of MH χ t e s t l e t 2 $$\text{MH} \chi ^2_{testlet}$$ statistic under diverse conditions.

Youn Seon Lim
Item Response Theory Modeling with Response Times: Some Issues

The increased prevalence of item response time (RT) data along with item responses has made applications of several joint item response theory (IRT) models feasible. Molenaar et al. (Multivar Behav Res 50: 56–74, 2015) unified several IRT models with joint response accuracy and RT into a common hierarchical framework to possibly increase the measurement precision of trait. Depending on the model, the assumed relationship between response accuracy and response time may be either positive or negative. However, a previous study (Embretson, 2021) found that examinees differed substantially in the relationship of their item response times to both item difficulty and test position on a spatial ability test. Such differences could impact the advantages of the various joint models. In this chapter, two additional types of tests, analytic reasoning and mathematics achievement, were studied for examinee differences in their item response time relationships within the test. Thus, for each examinee, their item response times will be correlated with both item difficulty and item test position. Broad distributions of these examinee correlations, ranging from strong negative to strong positive, were found for both tests. Hence, models that assume a uniform relationship of item response time to item difficulty across examinees may not be appropriate for many types of tests.

Susan E. Embretson, Clifford E. Hauenstein
DIF Detection in a Response Time Measure: A Likelihood Ratio Test Method

For assessments used in hiring decisions, it is essential to address discrepancies in measurement across subgroups from which job candidates may be compared for the same position, so that scores and subsequent decisions are fair and valid. Computer-administered employment tests today commonly use objective items that capture a response reflecting a job-relevant construct and a response time. Response times may be used to assess individual speed, to calibrate the difficulty of a speeded test, or for other purposes. In order to improve the fairness and validity of speed measures and time limits, we want to check for parametric differential item functioning (DIF) not only in item responses but also in response times. While many joint models of item responses and response times fall into the category of complex multidimensional systems in which DIF is challenging to interpret, the hierarchical framework proposed by Linden (2006, 2007) isolates response and response time models sufficiently that, to the extent that the framework holds, the possibility of DIF in responses and response times may be evaluated separately. Using this framework, likelihood ratio tests of parametric DIF are applied to a three-parameter form of the lognormal response time model. The lognormal response time model can be fit and interpreted as a transformation of a factor model (Finger and Chuah, 2009). Through this relationship, the accepted procedures of factorial invariance testing suggest an order for nested parameter constraint tests in a DIF sweep, corresponding to sequential tests of loadings, intercepts, and unique variances. This facilitates interpretation of DIF findings.

Anne Thissen-Roe
Revisiting the 1PL-AG Item Response Model: Bayesian Estimation and Application

This article provides a review of the One Parameter Logistic Ability-based Guessing (1PL-AG) model (San Martín et al., 2006), which belongs to the family of ability-based guessing models within the Item Response Theory (IRT) framework. The model considers both the characteristics of the test items and the abilities of the individuals when estimating the probability of a correct guess and incorporates a general discrimination parameter to account for item difficulty. A comprehensive model that encompasses the 1PL-AG model as a specific instance, while employing a general Cumulative Distribution Function (CDF) as item characteristic curve (ICC) is introduced. Additionally, we explore another scenario, referred to as the One Parameter Normal Ogive Ability-based Guessing (1PNO-AG) model. 1PNO-AG employs the standard normal distribution as its link function and then a Bayesian approach is developed. Results considering simulations indicated that the R code developed with the use of JAGS successfully recovered the true parameter values. From an applied perspective, we compare the results obtained from applying various alternative models to a real dataset. We observed that the 1PNO-AG model exhibited superior performance in terms of the Deviance Information Criterion (DIC).

Paula Fariña, Jorge Luis Bazán
MAP Estimation Using a Possibly Misspecified Parameter Redundant Model

In this paper, new theorems are proved which show how in some cases the asymptotic distribution of Maximum A Posteriori (MAP) estimates can be obtained for parameter redundant probability models which are possibly misspecified. The new methods are then empirically investigated in a simulation study investigating confidence interval coverage for Cognitive Diagnostic Models (CDMs). The empirical results are shown to be relevant in the application of CDMs to small sample size situations.

Richard M. Golden
Global Validity of Assessments: Location and Currency Effects

As assessments are used in an increasingly multicultural and connected world, there is a growing need to verify that they are equally valid across different populations. More specifically, when using hiring assessments to select people for jobs, it is important to corroborate that direct comparisons of individuals from different populations are valid, leading to fair and accurate hires. Populations differ in many interesting ways, but in this chapter, we examined how cultural group differences affect assessment behavior. Thus, we set out to disentangle the effects of location and currency, as elements of cultural behavior, on constructs used in hiring assessments: fairness, altruism, and decision-making speed. These constructs are measured in our gamified implementation of the Trust Game (Berg et al. (Games Econ Behav 10:122–142, 1995)) and Dictator Game (Savin and Sefton (Games Econ Behav 6:347–369, 1994)). We had data from job candidates in many world regions, who responded in various languages and game money currencies. Using this, we tested the factorial invariance of the measures from the two games. We compared large groups across different regions (controlling for language and currency), namely the United States and China, and across different currencies (controlling for language and region), specifically euros and reales. While the general factor structure held across all groups, we found differences in the observed variables, which varied by group. The findings highlight the importance of considering cultural influences when interpreting assessment results and underscore the significance of measurement invariance in promoting fairness and accuracy in hiring processes.

Ambar Kleinbort, Amy Li, Janelle Szary, Anne Thissen-Roe
The Deconstruction of Measurement Invariance (and DIF)

Measurement invariance holds, if the distribution of the observed variables (e.g., test items) is conditionally independent of group membership for every value of the latent variable. A DIF model describes the group-specific differences by group-specific item parameters. Based on a geometric perspective on measurement, it is argued that measurement invariance holds if the generalized true scores (i.e., the link function of conditional expectations) of all groups are spread across the same affine subspace (of all conceivable generalized true scores). A taxonomy of patterns of DIF and measurement invariance is introduced that relies on the degree of overlap and parallelism of the group-specific affine subspace associated with the group-specific measurement models. It is argued the each DIF model can be transformed in a higher-dimensional measurement model with measurement invariance and with constraints on the group-specific distribution of the latent variables. It turns out that DIF implies the existence of a latent variable that is constant in one group but varies across groups. Consequently, the DIF approach relies on postulating complete segregation of prespecified groups in latent space which is inherently discriminatory. It is concluded that DIF (and the idea of an absence of measurement invariance) is a chimera that does not rely on a viable conceptual basis but refers to extremely implausible limiting cases of the group-specific distributions in latent space.

Safir Yousfi
Assessment of Misspecification in CDMs Using a Generalized Information Matrix Test

If the probability model is correctly specified, then we can estimate the covariance matrix of the asymptotic maximum likelihood estimate distribution using either the first or second derivatives of the likelihood function. Therefore, if the determinants of these two different covariance matrix estimation formulas differ this indicates model misspecification. This misspecification detection strategy is the basis of the Determinant Information Matrix Test ( G I M T D e t $$GIMT:{Det}$$ ). To investigate the performance of the G I M T D e t $$GIMT:{Det}$$ , a Deterministic Input Noisy And gate (DINA) Cognitive Diagnostic Model (CDM) was fit to the fraction-subtraction data set. Next, various misspecified versions of the original DINA CDM were fit to bootstrap data sets generated by sampling from the original fitted DINA CDM. The G I M T D e t $$GIMT:{Det}$$ showed good discrimination performance for larger levels of misspecification. In addition, the G I M T D e t $$GIMT:{Det}$$ did not detect model misspecification when model misspecification was not present and additionally did not detect model misspecification when the level of misspecification was very low. However, the G I M T D e t $$GIMT:{Det}$$ discrimination performance was highly variable across different misspecification strategies when the misspecification level was moderately sized. The proposed new misspecification detection methodology is promising but additional empirical studies are required to further characterize its strengths and limitations.

Reyhaneh Hosseinpourkhoshkbari, Richard M. Golden
The Impact of Generating Model on Preknowledge Detection in CAT

Recent years have seen a growing interest in the development of methods for detecting examinees with preknowledge, especially in the context of computerized adaptive testing (CAT). Because it is difficult to obtain real data in which the examinees with preknowledge and the compromised items are known with absolute certainty, the performance of such methods is typically evaluated using simulation studies where models are used to generate the data. However, with different researchers making different choices regarding which models to use and how the data should be generated, it becomes challenging, if not impossible, to find ways to compare the results of one simulation study to another. In this chapter, we examine the impact of generating model on preknowledge detection in CAT. Results indicate that the use of different generating models has the potential to greatly impact detection results.

Kylie Gorney, Jianshen Chen, Luz Bay
Empirical Comparisons Among Models in Detecting Extreme Response Style

The models that have been proposed within the framework of item response theory (IRT) to identify extreme response style (ERS) may be categorized within three groups. The first group treats ERS as an explicit additional dimension that influences item responses and is distinct from the latent ability that the items intend to measure, e.g., the multidimensional nominal response model (MNRM) for response styles. The second group uses a weighting parameter for the thresholds of each item category to account for individuals’ ERS, for example, the modified generalized partial credit model for ERS (ERS-GPCM). The third group incorporates a tree-like procedure into IRT to differentiate participants’ latent ability from ERS (e.g., the tree model with a dominance model and an ideal-point model). To facilitate the practical use of these approaches, the present study compared the performance of these methods against the conventional IRT. The combined findings of model-fit indexes, estimates of reliability, latent ability, and ERS and the estimated relationship between latent ability and ERS suggested that the MNRM might be a better option to differentiate normal participants from ERS respondents.

Hui-Fang Chen, Jianheng Huang
Backmatter
Metadata
Title
Quantitative Psychology
Authors
Marie Wiberg
Jee-Seon Kim
Editors
Heungsun Hwang
Hao Wu
Tracy Sweet
Copyright Year
2024
Electronic ISBN
978-3-031-55548-0
Print ISBN
978-3-031-55547-3
DOI
https://doi.org/10.1007/978-3-031-55548-0

Premium Partner