Optimal Scores as an Alternative to Sum Scores

This paper discusses the use of optimal scores as an alternative to sum scores and expected sum scores when analyzing test data. Optimal scores are built on nonparametric methods and use the interaction between the test takers’ responses on each item and the impact of the corresponding items on the estimate of their performance. Both theoretical arguments for optimal score as well as arguments built upon simulation results are given. The paper claims that in order to achieve the same accuracy in terms of mean squared error and root mean squared error, an optimally scored test needs substantially fewer items than a sum scored test. The top-performing test takers and the bottom 5% test takers are by far the groups that benefit most from using optimal scores.

Marie Wiberg, James O. Ramsay, Juan Li

Disentangling Treatment and Placebo Effects in Randomized Experiments Using Principal Stratification—An Introduction

Although randomized controlled trials (RCTs) are generally considered the gold standard for estimating causal effects, for example of pharmaceutical treatments, the valid analysis of RCTs is more complicated with human units than with plants and other such objects. One potential complication that arises with human subjects is the possible existence of placebo effects in RCTs with placebo controls, where a treatment, suppose a new drug, is compared to a placebo, and for approval, the treatment must demonstrate better outcomes than the placebo. In such trials, the causal estimand of interest is the medical effect of the drug compared to placebo. But in practice, when a drug is prescribed by a doctor and the patient is aware of the prescription received, the patient can be expected to receive both a placebo effect and the active effect of the drug. An important issue for practice concerns how to disentangle the medical effect of the drug from the placebo effect of being treated using data arising in a placebo-controlled RCT. Our proposal uses principal stratification as the key statistical tool. The method is applied to initial data from an actual experiment to illustrate important ideas.

Reagan Mozer, Rob Kessels, Donald B. Rubin

Some Measures of the Amount of Adaptation for Computerized Adaptive Tests

Computerized Adaptive Testing (CAT) is gaining wide acceptance with the ready availability of computer technology. The general intent of is to adapt the difficulty of the test to the capabilities of the examinee so that measurement accuracy is improved over fixed tests, and the entire testing process is more efficient. However, many computer administration designs, such as two-stage tests, stratified adaptive tests, and those with content balancing and exposure control, are called adaptive, but the amount of adaptation greatly varies. In this paper, several measures of the amount of adaptation for a CAT are presented along with information about their sensitivity to item pool size, distribution of item difficulty, and exposure control. A real data application is presented to show the level of adaptation of a mature, operational CAT. Some guidelines are provided for how much adaptation should take place to merit the label of an “adaptive test.”

Mark D. Reckase, Unhee Ju, Sewon Kim

Investigating the Constrained-Weighted Item Selection Methods for CD-CAT

Cognitive diagnostic computerized adaptive testing (CD-CAT) not only provides useful cognitive diagnostic information measured in psychological or educational assessments, but also obtains great efficiency brought by computerized adaptive testing. At present, there are only a limited numbers of previous studies examining how to optimally construct cognitive diagnostic tests. The cognitive diagnostic discrimination index (CDI) and attribute-level discrimination index (ADI) have been proposed for item selection in cognitive diagnostic tests. Zheng and Chang (Appl Psychol Measure 40:608–624, 2016) proposed the modified version of these two indices, an extension of the Kullback-Leibler (KL) and posterior-weighted KL (PWKL) methods, and suggested that they could be integrated with the constraint management procedure for item selection in CD-CAT. However, the constraint management procedure hasn’t been investigated in CD-CAT yet. Therefore, the aim of this study is two fold (a) to integrate the indices with the constraint management procedure for item selection, and (b) to investigate the efficiency of these item selection methods in CA-CAT. It was found that the constraint-weighted indices performed much better than those without constraint-weighted procedure in terms of constraint management and exposure control while maintaining similar measurement precision.

Ya-Hui Su

Modeling Accidental Mistakes in Multistage Testing: A Simulation Study

Stress in tests may cause individuals to underperform. In an adaptive test context, earlier mistakes due to stress can raise the risk of administering inadequate items to the examenees leading to an underestimation of their ability. In this paper, the effects of accidental mistakes on the first stage of an Multistage Adaptive Testing (MST) were analyzed in a simulation study. Two Item Response Theory models were used in this study: the Two-Parameter Logistic and the Logistic Positive Exponent models. Two groups were created: one group had a probability of making accidental mistakes and one did not have this probability. Comparison of latent trait estimates accuracy and the impact on the item selection process of the MST (Routing) between these two models were made. Results shows that both models had similar performance with slightly differences depending on the procedures to simulate the responses.

Thales A. M. Ricarte, Mariana Cúri, Alina A. von Davier

On the Usefulness of Interrater Reliability Coefficients

For four data sets of different measurement levels, we computed 20 coefficients that estimate interrater reliability. The results show that the coefficients provide very different numerical values when applied to the same data. We discuss possible explanations for the differences among coefficients and suggest further research that is needed to clarify which coefficient a researcher should use to estimate interrater reliability.

Debby ten Hove, Terrence D. Jorgensen, L. Andries van der Ark

An Evaluation of Rater Agreement Indices Using Generalizability Theory

This study compared several rater agreement indices using data simulated using a generalizability theory framework. Information from previous generalizability studies conducted with data from large-scale writing assessments was used to inform the variance components in the simulations. Rater agreement indices, including percent agreement, weighted and unweighted kappa, polychoric, Pearson, Spearman, and intraclass correlations, and Gwet’s AC₁ and AC₂, were compared with each other and with the generalizability coefficients. Results showed that some indices performed similarly while others had values that ranged from below 0.4 to over 0.8. The impact of the underlying score distributions, the number of score categories, rater/prompt variability, and rater/prompt assignment on these indices was also investigated.

Dongmei Li, Qing Yi, Benjamin Andrews

How to Select the Bandwidth in Kernel Equating—An Evaluation of Five Different Methods

When using kernel equating to equate two test forms, a bandwidth needs to be selected. The bandwidth parameter determines the smoothness of the continuized score distributions and has been shown to have a large effect on the kernel density estimate. There are a number of suggested criteria for selecting the bandwidth, and currently four of them have been implemented in kernel equating. In this paper, all four of the existing bandwidth selectors suggested for kernel equating are evaluated and compared against each other using real test data together with a new criterion that implements leave-one-out cross-validation. Although the bandwidth methods generally were similar in terms of equated scores, there were potentially important differences in the upper part of the score scale where critical admission decisions are typically made.

Gabriel Wallin, Jenny Häggström, Marie Wiberg

Evaluating Equating Transformations from Different Frameworks

Test equating is used to ensure that test scores from different test forms can be used interchangeably. This paper aims to compare the statistical and computational properties from three equating frameworks: item response theory observed-score equating (IRTOSE), kernel equating and kernel IRTOSE. The real data applications suggest that IRT-based frameworks tend to provide more stable and accurate results than kernel equating. Nonetheless, kernel equating can provide satisfactory results if we can find a good model for the data, while also being much faster than the IRT-based frameworks. Our general recommendation is to try all methods and examine how much the equated scores change, always ensuring that the assumptions are met and that a good model for the data can be found.

Waldir Leôncio, Marie Wiberg

An Alternative View on the NEAT Design in Test Equating

Assuming a “synthetic population” and imposing strong assumption to estimate score distributions has been the traditional practice when performing equating under the nonequivalent groups with anchor tests design (NEAT). In this paper, we use the concept of partial identification of probability distributions to offer an alternative to this traditional practice in NEAT equating. Under this approach, the score probability distributions used to obtain the equating transformation are bounded on a region where they are identified by the data. The advantages of this approach are twofold: first, there is no need to define a synthetic population and, second, no particular assumptions are needed to obtain bounds for the score probability distributions that are used to build the equating transformation. The results show that the uncertainty about the score probability distributions, reflected on the width of the bounds, can be very large, and can thus have a big impact on equating.

Jorge González, Ernesto San Martín

Simultaneous Equating of Multiple Forms

When test forms are calibrated separately, item response theory parameters are not comparable because they are expressed on different measurement scales. The equating process converts the item parameter estimates to a common scale and provides comparable test scores. Various statistical methods have been proposed to perform equating between two test forms. However, many testing programs use several forms of a test and require the comparability of the scores of each form. To this end, Haberman (ETS Res Rep Ser 2009(2):i–9, 2009) developed a regression procedure that generalizes the mean-geometric mean method to the case of multiple test forms. A generalization to multiple test forms of the mean-mean, the Haebara, and the Stocking-Lord methods was proposed in Battauz (Psychometrika 82:610–636, 2017b). In this paper, the methods proposed in the literature to equate multiple test forms are reviewed, and an application of these methods to data collected for the Trends in International Mathematics and Science Study will be presented.

Michela Battauz

Incorporating Information Functions in IRT Scaling

Item response theory (IRT) scaling via a set of items common to two test forms assumes that those item’s parameters are invariant with respect to a linear transformation. Characteristic curve methods rely on this assumption; scale transformations are conducted by minimizing a loss function between item characteristic curves (ICCs), as in the case of Haebara (1980), or test characteristic curves (TCCs), as in the case of Stocking and Lord (1983). However, minimizing the loss function between characteristic curves does not guarantee that the same will hold for information functions. This study introduces two new scaling methodologies: one combines the ICC methodology of Haebara (1980) with item information functions (IIFs); the other combines the TCC methodology of Stocking and Lord (1983) with test information functions (TIFs). In a simulation experiment, Haebara’s (1980) and Stocking and Lord’s (1983) methodologies as well as the two new scaling methodologies were applied to simulated administrations of a fixed form under different latent trait distributions. Results suggest that IRT scaling by combining TCCs with TIFs yields some benefits over the existing characteristic curve methodologies; however, combining ICCs with IIFs did not perform as well as the other three scaling methodologies.

Alexander Weissman

Reducing Conditional Error Variance Differences in IRT Scaling

The performance of a Hybrid scaling method that takes into account the differences between test characteristic curves as well as differences between conditional error variances when estimating transformation constants is proposed and evaluated. Results are evaluated and discussed in relation to the Stocking-Lord method. Findings using a Monte Carlo simulation approach suggest that when the two forms being scaled are parallel, the Hybrid method and the Stocking-Lord test characteristic curve method lead to similar results. However, when the forms being scaled have similar test characteristic curves but different conditional error variances, the Hybrid method does better near the mean of the ability distribution, especially for the test information function.

Tammy J. Trierweiler, Charles Lewis, Robert L. Smith

An IRT Analysis of the Growth Mindset Scale

Growth mindset has gained popularity in the fields of psychology and education, yet there is surprisingly little research on the psychometric properties of the Growth Mindset Scale. This research presents an item response theory analysis of the Growth Mindset Scale when used among college students in the United States. Growth Mindset is the belief that success comes through hard work and effort rather than fixed intelligence. Having a growth mindset is believed to be important for academic success among historically marginalized groups; therefore it is important to know if the Growth Mindset Scale functions well among first generation college students. The sample consists of 1260 individuals who completed the Growth Mindset Scale on one of 5 surveys. The Growth Mindset Scale consists of 8 items, with responses ranging from strongly disagree (1) to strongly agree (5). IRT analysis is used to assess item fit, scale dimensionality, local dependence, and differential item functioning (DIF). Due to local dependence within the 8-item scale, the final IRT model fit 4 items to a unidimensional model. The 4-item scale did not exhibit any local dependence or DIF among known groups within the sample. The 4-item scale also had high marginal reliability (0.90) and high total information. Cronbach’s alpha for the 4-item scale was α = 0.89. Discussion of the local dependence issues within the 8-item scale is provided.

Brooke Midkiff, Michelle Langer, Cynthia Demetriou, A. T. Panter

Considering Local Dependencies: Person Parameter Estimation for IRT Models of Forced-Choice Data

The Thurstonian IRT model of Brown and Maydeu-Olivares (Educ Psychol Meas 71:460–502, 2011) was a breakthrough in estimating the structural parameters of IRT models for forced-choice data of arbitrary block size. However, local dependencies of pairwise comparisons within blocks of more than two items are only considered for item parameter estimates, but are explicitly ignored by the proposed methods of person parameter estimation. A general analysis of the likelihood function of binary response indicators (used Brown and Maydeu-Olivares) for arbitrary IRT models of forced-choice questionnaires is presented that reveals that Fisher Information is overestimated by Brown and Maydeu-Olivares’ approach of person parameter estimation. Increasing block size beyond 3 leads only to a slight increase measurement precision. Finally, an approach that considers local dependencies within blocks adequately is outlined. It allows for Maximum-Likelihood and Bayesian Modal Estimation and numerical computation of observed Fisher information.

Safir Yousfi

Elimination Scoring Versus Correction for Guessing: A Simulation Study

Administering multiple-choice questions with correction for guessing fails to take into account partial knowledge and may introduce a bias as examinees may differ in risk-taking to guess the correct answer when not having full knowledge. In the latter case, elimination scoring gives examinees the opportunity to express their partial knowledge as this alternative scoring procedure requires examinees to eliminate all the response alternatives they consider to be incorrect. The current simulation study investigates how these two scoring procedures affect response behaviors of examinees who differ not only in ability but also in their attitude toward risk. Combining a psychometric model accounting for ability and item difficulty with the decision theory accounting for individual differences in risk aversion, a two-step response-generating model is proposed to predict the expected answering patterns on given multiple-choice questions. The results of the simulations show that overall there are no substantial differences in the answering patterns for examinees at both ends of the ability continuum under two scoring procedures, suggesting that ability has a predominant effect on the response patterns. Compared to correction for guessing, elimination scoring leads to fewer full score response and more demonstration of partial knowledge, especially for examinees with intermediate success probabilities on the items. Only for those examinees, risk aversion has a decisive impact on the expected answering patterns.

Qian Wu, Tinne De Laet, Rianne Janssen

Three-Way Generalized Structured Component Analysis

Generalized structured component analysis (GSCA) is a component-based approach to structural equation modeling, where components of observed variables are used as proxies for latent variables. GSCA has thus far focused on analyzing two-way (e.g., subjects by variables) data. In this paper, GSCA is extended to deal with three-way data that contain three different types of entities (e.g., subjects, variables, and occasions) simultaneously. The proposed method, called three-way GSCA, permits each latent variable to be loaded on two types of entities, such as variables and occasions, in the measurement model. This enables to investigate how these entities are associated with the latent variable. The method aims to minimize a single least squares criterion to estimate parameters. An alternating least squares algorithm is developed to minimize this criterion. We conduct a simulation study to evaluate the performance of three-way GSCA. We also apply three-way GSCA to real data to demonstrate its empirical usefulness.

Ji Yeh Choi, Seungmi Yang, Arthur Tenenhaus, Heungsun Hwang

Combining Factors from Different Factor Analyses Based on Factor Congruence

While factor analysis is one of the most often used techniques in psychometrics, comparing or combining solutions from different factor analyses can be cumbersome even though it is necessary in several situations. For example, when applying multiple imputation (to account for incompleteness) or multiple outputation (which can be used to deal with clustering in multilevel data) often tens or hundreds of results have to be combined into one final solution. While different solutions have been in use, we propose a simple and easy to implement solution to match factors from different analyses based on factor congruence. To demonstrate this method, the Big Five Inventory data collected under the auspices of the Divorce in Flanders study was analysed combining multiple outputation and factor analysis. This multilevel sample consists of 7533 individuals coming from 4460 families with about 10% of missing values.

Anikó Lovik, Vahid Nassiri, Geert Verbeke, Geert Molenberghs

On the Bias in Eigenvalues of Sample Covariance Matrix

Principal component analysis (PCA) is a multivariate statistical technique frequently employed in research in behavioral and social sciences, and the results of PCA are often used to approximate those of exploratory factor analysis (EFA) because the former is easier to implement. In practice, the needed number of components or factors is often determined by the size of the first few eigenvalues of the sample covariance/correlation matrix. Lawley (1956) showed that if eigenvalues of population covariance matrix are distinct, then each sample eigenvalue contains a bias of order 1/N, which is typically ignored in practice. This article further shows that, under some regulatory conditions, the order of the bias term is p/N. Thus, when p is large, the bias term is no longer negligible even when N is large.

Kentaro Hayashi, Ke-Hai Yuan, Lu Liang

Using Product Indicators in Restricted Factor Analysis Models to Detect Nonuniform Measurement Bias

When sample sizes are too small to support multiple-group models, an alternative method to evaluate measurement invariance is restricted factor analysis (RFA), which is statistically equivalent to the more common multiple-indicator multiple-cause (MIMIC) model. Although these methods traditionally were capable of detecting only uniform measurement bias, RFA can be extended with latent moderated structural equations (LMS) to assess nonuniform measurement bias. As LMS is implemented in limited structural equation modeling (SEM) computer programs (e.g., Mplus), we propose the use of the product indicator (PI) method in RFA models, which is available in any SEM software. Using simulated data, we illustrate how to apply this method to test for measurement bias, and we compare the conclusions with those reached using LMS in Mplus. Both methods obtain comparable results, indicating that the PI method is a viable alternative to LMS for researchers without access to SEM software featuring LMS.

Laura Kolbe, Terrence D. Jorgensen

Polychoric Correlations for Ordered Categories Using the EM Algorithm

A new method for the estimation of polychoric correlations is proposed in this paper, which uses the Expectation-Maximization (EM) algorithm and the Conditional Covariance Formula. Simulation results show that this method attains the same level of accuracy as other methods, and is robust to deteriorated data quality.

Kenpei Shiina, Takashi Ueda, Saori Kubo

A Structural Equation Modeling Approach to Canonical Correlation Analysis

Canonical Correlation Analysis (CCA) is a generalization of multiple correlation that examines the relationship between two sets of variables. Spectral decomposition can be applied and canonical correlations and canonical weights are obtained. Anderson (2003) also provided the asymptotic distribution of the canonical weights under normality assumption. In this article, we propose a Structural Equation Modeling (SEM) approach to CCA. Mathematical forms are presented to show the equivalence among these models. The weight matrix is obtained as the inverse of the loading matrix and the variance or standard errors of weights are calculated through the Delta method. Different popular SEM software such as Lavaan, Mplus, EQS are demonstrated to illustrate the application, and the results are compared with those obtained from Anderson’s (2003) formula. Related issues are also discussed in the last section.

Zhenqiu (Laura) Lu, Fei Gu

Dealing with Person Differential Item Functioning in Social-Emotional Skill Assessment Using Anchoring Vignettes

When analyzed via item response theory, Likert-type items are modeled by estimating a set of thresholds (i.e., parameters that inform on the latent trait level required for endorsing a given scale option) that are assumed to be invariant across the population of individuals. If persons vary in response styles this assumption may not hold. This is called person differential functioning (PDIF). Anchoring vignettes offer an approach to learn how individuals translate the latent trait into Likert responses, and a method to assess potential variability in item thresholds across individuals. A vignette presents hypothetical persons differing on the attribute of interest (usually low, medium and high), and asks respondents to rate the hypothetical persons in the same Likert scale used in self-assessment. This can then be used to resolve PDIF, potentially producing measures that are more comparable. We investigated if the patters of responses to vignettes have a developmental trend and if they are related to cognitive capacity, using data from a large-scale educational assessment. We then investigated if anchor-adjusted scores produce more reliable and valid measures.

Ricardo Primi, Daniel Santos, Oliver P. John, Filip De Fruyt, Nelson Hauck-Filho

Random Permutation Tests of Nonuniform Differential Item Functioning in Multigroup Item Factor Analysis

The purpose of the present research was to introduce and evaluate random permutation testing applied to measurement invariance testing with ordered-categorical data. The random permutation test builds a reference distribution from the observed data that is used to calculate a p value for the observed (Δ)χ2 statistic. The reference distribution is built by repeatedly shuffling the grouping variable and then saving the Δχ2 statistic between the two models fitted to the resulting data. The present research consisted of two Monte Carlo simulations. The first simulation was designed to evaluate random permutation testing across a variety of conditions with scalar invariance testing in comparison to an existing analytical solution: the robust mean- and variance-adjusted Δχ2 test. The second simulation was designed to evaluate the random permutation test applied to testing configural invariance by evaluating overall model fit (the χ2 fit statistic). Simulation results and suggestions for the use of the random permutation test are provided.

Benjamin A. Kite, Terrence D. Jorgensen, Po-Yi Chen

Using Credible Intervals to Detect Differential Item Functioning in IRT Models

Differential item functioning (DIF) occurs when individuals from different groups with the same level of ability have different probabilities of answering an item correctly. In this paper, we develop a Bayesian approach to detect DIF based on the credible intervals within the framework of item response theory models. Our method performed well for both uniform and non-uniform DIF conditions in the two-parameter logistic model. The efficacy of the proposed approach is demonstrated through simulation studies and a real data application.

Ya-Hui Su, Joyce Chang, Henghsiu Tsai

Bayesian Network for Modeling Uncertainty in Attribute Hierarchy

In the attribute hierarchy method, cognitive attributes are assumed to be organized hierarchically. Content specialists usually conduct a task analysis on a sample of items to specify the cognitive attributes required by the correctly answered items, and to order these attributes to create an attribute hierarchy. However, the problem-solving performance of experts and novices was almost certain to be different. Additionally, experts’ knowledge is highly organized in deeply integrated schemas, while a novice views domain knowledge and problem-solving knowledge separately. Thus, this may bring uncertainty into the attribute hierarchy and lead to different attribute hierarchies for a test. Formally, a Bayesian network is a probabilistic graphical model that represents a set of random latent attributes or variables and their conditional dependencies via a directed acyclic graph. For example, a Bayesian network can be used to represent the probabilistic relationships between latent attributes in the attribute hierarchy. The purpose of this study is to apply Bayesian network for modeling uncertainty in an attribute hierarchy. The Bayesian network created from the attribute hierarchy, which is regarded as a flexible high-order model, is incorporated into three cognitive diagnostic models. The new model has an advantage of taking an account of subjectivity of the attribute hierarchy specified by experts with the uncertainty of item responses. Fraction subtraction data were analyzed to evaluate the performance of the new model.

Lihong Song, Wenyi Wang, Haiqi Dai, Shuliang Ding

A Cognitive Diagnosis Method Based on Mahalanobis Distance

Cognitive diagnosis methods (CDMs) is very important for cognitive diagnosis, the primary purpose for CDMs is to classify examinees into mutually exclusive categories. Although there exist many CDMs, researchers propose many better new CDMs. Among them, the generalized distance discrimination (GDD) and the Hamming distance discrimination (HDD) receive more and more attention for their advantages of simple and easy to use, high classification accuracy, thus, Mahalanobis distance discrimination (MDD), a generalized CDM is introduced. GDD and HDD are its special cases. Mahalanobis distance (MD) is employed for MDD to calculate the distance between an examinee’s observed response pattern (ORP) and all kinds of ideal response pattern (IRP). The Shannon entropy is specified as covariance. According to the principle of minimum distance and designing test blueprint, IRP can be bijection mapped to the state of knowledge. Under dichotomous model, the pattern match ratio and average attribute match ratio are selected as the criteria for evaluating the classification accuracy. The Monte Carlo simulation study shows that the performance of MDD is better than GDD and HDD.

Jianhua Xiong, Fen Luo, Shuliang Ding, Huiqiong Duan

An Joint Maximum Likelihood Estimation Approach to Cognitive Diagnosis Models

In this study, a simulation-based method for computing joint maximum likelihood estimates of cognitive diagnosis model parameters is proposed. The central theme of the approach is to reduce the complexity of models to focus on their most critical elements. In particular, an approach analogous to joint maximum likelihood estimation is taken, and the latent attribute vectors are regarded as structural parameters, not parameters to be removed by integration with this approach, the joint distribution of the latent attributes does not have to be specified, which reduces the number of parameters in the model. The Markov Chain Monte Carlo algorithm is used to simultaneously evaluate and optimize the likelihood function. This streamlined approach performed as well as more traditional methods for models such as the DINA, and affords the opportunity to fit more complicated models in which other methods may not be feasible.

Youn Seon Lim, Fritz Drasgow

An Exploratory Discrete Factor Loading Method for Q-Matrix Specification in Cognitive Diagnostic Models

The Q-matrix is usually unknown for many existing tests. If the Q-matrix is specified by subject matter experts but contains a large amount of misspecification, it will be difficult for the recovery of a high-quality Q-matrix through a validation method, because the performance of the validation method relies on the quality of a provisional Q-matrix. Under these two situations above, an exploratory technique is necessary. The purpose of this study is to explore a simple method for Q-matrix specification, called a discretized factor loading (DFL) method, in which exploratory factor analysis regarding latent attributes as latent factors is used to estimate a factor loading matrix after which a discretization process is employed on the factor loading matrix to obtain a binary Q-matrix. A series of simulation studies were conducted to investigate the performance of the DFL method under various conditions. The simulation results showed that the DFL method can provide a high-quality provisional Q-matrix.

Wenyi Wang, Lihong Song, Shuliang Ding

Identifiability of the Latent Attribute Space and Conditions of Q-Matrix Completeness for Attribute Hierarchy Models

Educational researchers have argued that a realistic view of the role of attributes in cognitively diagnostic modeling should account for the possibility that attributes are not isolated entities, but interdependent in their effect on test performance. Different approaches have been discussed in the literature; among them the proposition to impose a hierarchical structure so that mastery of one or more attributes is a prerequisite of mastering one or more other attributes. A hierarchical organization of attributes constrains the latent attribute space such that several proficiency classes, as they exist if attributes are not hierarchically organized, are no longer defined, because the corresponding attribute combinations cannot occur with the given attribute hierarchy. Hence, the identification of the latent attribute space is often difficult—especially, if the number of attributes is large. As an additional complication, constructing a complete Q-matrix may not at all be straightforward if the attributes underlying the test items are supposed to have a hierarchical structure. In this article, the conditions of identifiability of the latent space if attributes are hierarchically organized and the conditions of completeness of the Q-matrix are studied.

Hans-Friedrich Köhn, Chia-Yi Chiu

Different Expressions of a Knowledge State and Their Applications

Based on the Augment algorithm, any column of Q matrix can be expressed as a Boolean union of some columns of reachability matrix R, but the expression is not unique. There are two different expressions for a column of the reduced Q matrix, say x, a redundant expression of x and a concise expression of x. When a test length is short, the redundant expression of a knowledge state can be used to simplify the proof of an important property of the reachability matrix R in the design of cognitive diagnostic test, and provides a novel method to specify Q matrix. This specification method can be employed to deal with the polytomous Q matrix.

Shuliang Ding, Fen Luo, Wenyi Wang, Jianhua Xiong, Heiqiong Duan, Lihong Song

Accuracy and Reliability of Autoregressive Parameter Estimates: A Comparison Between Person-Specific and Multilevel Modeling Approaches

This simulation study compares the person-specific (PS) and multilevel modeling (MLM) approaches in the accuracy and reliability of autoregressive (AR) parameter estimates when data are generated from a first-order AR model and the functional form of the analytic model is correctly specified. Influences of a variety of factors on accuracy and reliability are examined, including time series length, sample size, the distribution of the AR coefficients, and the variability of the AR coefficients. Neither sample size nor distribution has an effect on accuracy or reliability. MLM generally has better accuracy than PS at both the population level and the individual level. However, in MLM, individuals who deviate farther from the sample mean are modeled less accurately than individuals who are closer to the sample mean. The two approaches do not differ in the reliability of the AR estimates. For both approaches, higher variability in the AR coefficients is associated with higher reliability. Implications on modeling practices are discussed.

Siwei Liu

A Two-Factor State Theory

When studying longitudinal phenomena, the notions of traits and states can be a useful classification. Specifically, traits represent basic human characteristics that have a permanency or enduring property, while on the other hand, states are environmental or ephemeral that are more time specific. Admittedly, research often focuses on traits and the relationships of these traits to other important variables. Moving in a different direction, this contribution focuses on the more ephemeral aspects of longitudinal variables, that is, states. A very practical justification for this direction is model fit indices. A probably more important rationale for expanding the state model is to obtain a more accurate reflection of the situation under study. To establish a common foundation, a longitudinal factor analytic model and a latent curve model are presented. Next, a statistical model of the ephemeral effects or state, which is analogous to Spearman’s Two-Factor Theory is given. Lastly, a substantive illustration demonstrates the worthwhileness of this Two-Factor State Theory.

John Tisak, Guido Alessandri, Marie S. Tisak

SPARK: A New Clustering Algorithm for Obtaining Sparse and Interpretable Centroids

k-means clustering is one of the popular procedures for multivariate analysis in which observations are classified into a reduced number of clusters. The resulting centroid matrix is refereed to capture variables which characterize clusters, but between-clusters contrasts in the centroid matrix are not always clear and thus difficult to interpret. In this research, we address the problem in interpretation and propose a new procedure of k-means clustering which produces a sparse and thus interpretable centroid matrix. The proposed procedure is called SPARK. In SPARK, the sparseness of the centroid matrix is constrained and therefore it contains a number of exact zero elements. Because of this, the contrasts between-clusters are highlighted and it allows us to interpret clusters easier in comparison with the standard k-means clustering. A sparsity selection procedure for determining the optimal sparsity of the centroid with reduced computational load is also proposed. Behaviors of the proposed procedure are evaluated by two real data examples, and the results indicate that SPARK performs well for dealing with real world problems.

Naoto Yamashita, Kohei Adachi

Springer Professional

About this book

Table of Contents

Frontmatter