Sie können Operatoren mit Ihrer Suchanfrage kombinieren, um diese noch präziser einzugrenzen. Klicken Sie auf den Suchoperator, um eine Erklärung seiner Funktionsweise anzuzeigen.
Findet Dokumente, in denen beide Begriffe in beliebiger Reihenfolge innerhalb von maximal n Worten zueinander stehen. Empfehlung: Wählen Sie zwischen 15 und 30 als maximale Wortanzahl (z.B. NEAR(hybrid, antrieb, 20)).
Findet Dokumente, in denen der Begriff in Wortvarianten vorkommt, wobei diese VOR, HINTER oder VOR und HINTER dem Suchbegriff anschließen können (z.B., leichtbau*, *leichtbau, *leichtbau*).
Der Artikel untersucht die kulturelle Sensibilität frühkindlicher Bewertungen, insbesondere solcher, die auf Lernfortschritten beruhen, die in den Vereinigten Staaten weit verbreitet sind. Sie wendet sich gegen die weit verbreitete Spekulation, dass diese Bewertungen, die als Früherkennungs- und Kindergarteneintrittsuntersuchungen (EL-KEAs) bekannt sind, Kinder aus kulturell und sprachlich vielfältigen Familien (CLD) benachteiligen könnten. Die Studie konzentriert sich auf das Desired Results Developmental Profile (DRDP), ein prominentes EL-KEA, und beschäftigt Rasch Personenpassanalysen, um die kulturelle Sensibilität dieser Bewertung zu untersuchen. Die Forschung geht davon aus, dass Kinder aus CLD-Familien im Durchschnitt eine höhere Fehlpassquote aufweisen könnten, was auf eine potenzielle Voreingenommenheit hindeutet. Der Artikel diskutiert auch die Grenzen aktueller Validierungsmethoden, wie der differentiellen Item-Funktion (DIF), bei der Erkennung dieser Art von Voreingenommenheit. Es bietet einen detaillierten Überblick über die in EL-KEA verwendete Bewertungstechnologie, einschließlich der Lernfortschritte (LPs) und ihrer Rolle bei der Messung des Lernens und der Entwicklung von Kindern. Die Studie unterstreicht die Notwendigkeit, sich angesichts des weit verbreiteten Einsatzes von EL-KEA bei gefährdeten Bevölkerungsgruppen von Kindern dringend auf kulturelle Sensibilität in der frühkindlichen Beurteilung zu konzentrieren. Der Artikel untersucht auch die möglichen Auswirkungen der Ergebnisse auf die Beurteilungspraxis und -forschung und legt nahe, dass Kennzahlen zur Passgenauigkeit von Personen ein wertvolles Werkzeug zur Identifizierung atypischer oder potenziell ungültiger Beurteilungen sein könnten, um eine zielgerichtete Unterrichtsplanung zu leiten und die berufliche Entwicklung zu beeinflussen, um die Variabilität der Beurteilung zu verringern und die Beurteilungsgenauigkeit zu verbessern.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
Abstract
This study investigated concerns about potential racial/ethnic bias in the Desired Results Developmental Profile (DRDP), a widely used observational assessment in early childhood education (ECE). Specifically, we examined whether the learning progressions (LPs) underlying the DRDP items exhibit differential applicability to children from culturally and linguistically diverse (CLD) families—comparing groups of children identified as Latino/a, Black, or White. Using Rasch person fit analyses with a large public preschool sample (N = 80,058), we tested the hypothesis that greater positive misfit for Latino/a and Black children would indicate bias. Contrary to this hypothesis, our findings revealed comparable person fit distributions across the three racial/ethnic groups, suggesting no evidence of racial/ethnic bias in the DRDP. Additionally, the study identified key trends, including higher positive misfit among children in special education, likely reflecting greater intra-individual variability, and slightly lower misfit for children identified as female compared to males. Person fit improved over the course of the academic year, underscoring the potential influence of raters and contexts on consistency. Teacher-level variance emerged as the largest contributor to person fit variance, highlighting the need for ongoing professional development and support to ensure consistency. The results support the cultural sensitivity of the DRDP and emphasize the importance of continued research on rater effects and potential biases in early learning assessment. This study also demonstrates the promise of person fit metrics as a tool to enhance assessment validity, equity, and instructional planning in ECE contexts serving children from diverse families.
The original version of this paper was updated due to changes of affiliations for authors Karen Draney and Mark Wilson; changes in the textbody as stated in the correction article and the supplementary materials that should be published separately.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Observational early learning assessments, sometimes employed as kindergarten entry assessments (KEAs), are in widespread use in the United States, inside and outside of publicly funded early care and education (Ackerman, 2018; Goldstein & Flake, 2016). This category of assessments, which we label early learning and kindergarten entry assessments (EL-KEAs) hereafter, is designed to measure all children’s learning and development, including those from culturally and linguistically diverse (CLD) families and children with disabilities.
Currently, there is broad speculation that EL-KEAs could be biased against children from culturally and linguistically diverse families. A large early education policy coalition recently questioned whether EL-KEAs, which aim to measure children’s development along hypothetical developmental continua, disadvantage nonwhite children who often have a broader range of developmental experiences and trajectories (First 5 Center for Children’s Policy [First 5], 2022). Unfortunately, the coalition did not present an example of this bias or offer a well-developed hypothesis of how bias may occur. Indeed, we aim to fill this gap, in part, by operationalizing specific, testable hypotheses of potential assessment bias in this paper.
Anzeige
The EL-KEAs in question typically use an assessment technology called learning progressions (LPs; Harris et al., 2022; Mangione et al., 2019). LPs are descriptions of curricular sequences that delineate successive levels of progress in an area of learning or development. LPs are often associated with assessment rubrics that serve as the scoring model for the assessment items. Some investigators in the LPs literature, and the literature on learning trajectories, or LTs, as they are called in mathematics education, have posited that children with different sociocultural experiences may have different learning pathways that are not captured equally well by the LPs underlying the assessments (Harris et al., 2022). Unfortunately, such ideas are typically speculative, grounded in theory but lacking examples of bias in action. However, Kang and Furtak (2021) found that reframing LPs around sociocultural concepts led to important changes such as the consideration of different learning goals and different uses of the LPs in classrooms. Although not evidence of bias, Kang and Furtak’s findings showed how culture can influence LPs/LTs and their application.
Very little published research has examined the cross-cultural applicability of LPs within EL-KEAs. It is therefore difficult to evaluate their potential bias against children with diverse developmental experiences. We will call this cultural sensitivity and explore this definition throughout. As we explain below, common validation methods, such as differential item functioning (DIF), are probably insensitive to the bias in question. Given the paucity of research in this area and the widespread use of EL-KEAs with vulnerable populations of children, attention to cultural sensitivity deserves urgent focus.
In this paper, we used Rasch person fit analysis to examine the cultural sensitivity of one EL-KEA used widely in the United States, the Desired Results Developmental Profile, or DRDP, assessment (California Department of Education, 2015). Person fit analysis has been previously used to study the relationship between culture and assessment validity (e.g., Custers et al., 2000; Lamprianou & Boyle, 2004; Petridou & Williams, 2007; Şengül Avşar & Emons, 2021). We posited that the specific type of bias proposed by First 5 (2022) could produce higher average positive person misfit in children from CLD families.
The remainder of this introduction contains two main sections: First, we explain and contextualize the use of LPs within EL-KEAs. We connect early childhood LPs with the broader LP/LT literature and explain why current validation methods fall short. Second, we briefly describe person fit methods. We review relevant technical literature and articulate a hypothesis for the ways that cultural (in)sensitivity could manifest as person misfit.
Anzeige
1 Assessment-based LPs and EL-KEA systems
LPs play a crucial role in EL-KEAs, though research on them is limited. The three major EL-KEA systems—Desired Results Developmental Profile (DRDP), Teaching Strategies Gold (TS-GOLD; Heroman et al., 2010), and COR Advantage (HighScope, 2014)—incorporate LPs, but available documentation lacks depth. Mangione et al. (2019) provide a practitioner-focused overview of DRDP’s LPs, while the LPs in TS-GOLD and COR Advantage are primarily described in technical reports and marketing materials. These sources offer only surface-level explanations, leaving large gaps in reporting about their development and applications.
1.1 LPs in the DRDP
The LPs in EL-KEAs may be described as rubrics that operationalize children’s learning and development in an important sub-domain of functioning (Mangione et al., 2019). For example, Fig. 1 shows an LP for the DRDP. This LP contains a set of levels that describe what it means to achieve competence in communication and expressive language, in small and progressive steps. Each LP is an assessment item: The intention is that, during assessment, a teacher will use the LP as a rubric to assess a child’s learning and development. Note that the levels are ordered to reflect qualitatively distinct steps in growth and definitions focus on observable behaviors.
1.1.1 Connecting early childhood LPs with the broader LP literature
Research from K-12 LP/LT literature can help us better understand early childhood LPs. Duncan and Rivet’s (2018) framework suggests five key characteristics of LPs: (1) grain size of developmental levels, (2) construct scope, (3) how the LPs were created, (4) the types of levels, and (5) what progresses and how it occurs. For early childhood LPs, grain size tends to be broad, covering multiple years rather than narrower age spans. This approach promotes continuity, allowing assessments to span different developmental levels. It may limit sensitivity to finer distinctions in progress, which can affect how frequently teachers detect and respond to developmental changes. EL-KEA scope usually encompasses the five essential domains of readiness originally described by the National Education Goals Panel [NEGP] (1995), and further operationalized by the Head Start Early Learning Outcomes Framework (Office of Head Start, 2015). The early childhood LPs are typically created as they are in the science education tradition: by multidisciplinary teams who develop developmental sequences based on literature reviews and task analyses, using iterative refinement processes to reach consensus on measurement, often with a focus on cultural and linguistic sensitivity. The levels of most early childhood LPs follow a developmental trajectory, typically informed by established theories that shed light on what progresses and how it occurs: Cognitive development changes from simpler to more complex forms of logical thought (Piaget, 1941/1965). Environmental and cultural factors work in concert to shape development (Sarama & Clements, 2009). Children develop through scaffolded support from adults that is calibrated to their current needs (Vygotsky, 1978), a process that can be guided through child assessment (Griffin, 2007). An early curriculum can be effective by revisiting foundational skills repeatedly, with increasing depth (Bruner, 1960).
1.1.2 LPs and authentic child assessment
LPs are frequently associated with uncertainties regarding their conceptualization and appropriate applications, both within KEAs (Harvey & Ohle, 2018), and more broadly in educational contexts (Kubsch et al., 2022). Despite this ambiguity, the field has adapted to accommodate these challenges, conceptualizing LPs as frameworks that organize and bring coherence to complex and dynamic developmental processes. For example, the LPs embedded within EL-KEAs guide teachers in making structured developmental observations aligned with construct progressions (as in Fig. 1). Teachers generally regard this structured approach as beneficial (Little et al., 2020), distinguishing EL-KEAs from earlier observational assessments that lacked explicit observation guidelines (i.e., Meisels & Piker, 2001; Ready & Wright, 2011).
It is essential to recognize that a child’s progress along an LP or LT is neither expected to be linear nor strictly unidimensional (Wilson, 2009). Individual learners’ paths often deviate from hypothetical trajectories, as demonstrated in prior research (Duschl et al., 2011; Gotwals & Songer, 2010; Steedle & Shavelson, 2009). Furthermore, children’s capacities frequently exhibit variability both within and across tasks over time (Flavell, 1994). Nonetheless, scientific consensus affirms that LPs/LTs serve as valuable organizing frameworks to understand typical learning patterns and individual differences (National Research Council [NRC], 2008).
The inherent variability in young children’s daily performances emphasizes the importance of authentic formative assessment. A single, direct summative assessment conducted in a novel or artificial setting is unlikely to reliably capture a valid representation of a child’s developmental progress (NRC, 2008; Shepard et al., 1998). Factors such as rapport, motivation, and familiarity with assessment materials significantly influence children’s performance. Flexible assessment environments, which account for these factors, better accommodate developmental and cultural differences, allowing children to demonstrate their true knowledge and skills (NRC, 2001).
The use of LPs is anchored in a theoretically robust framework that organizes complex learning patterns, tracks typical developmental trajectories, and addresses individual variability. These tools empower well-supported educators to guide instruction effectively without imposing rigid developmental sequences (Heritage, 2008). The following section examines the evidence supporting their accuracy and practical application.
1.2 Validity of EL-KEAs for children from CLD families
In educational measurement, the current consensus is that an assessment’s accuracy and utility for a purpose should be substantiated by a validity argument (Kane, 2013), which integrates theory and evidence in support of an assessment’s interpretation and use(s) (American Educational Research Association et al., 2014). Although EL-KEAs are used for a variety of purposes in practice (Goldstein & Flake, 2016), their primary stated purpose is often to inform instruction as a formative assessment tool. The underlying premise is that assessments provide feedback to teachers (Sadler, 1989) enabling them to design learning environments that help children progress along developmental trajectories (Black & William, 1998). However, implementing formative assessment effectively is challenging in practice. Teachers often require more intensive professional development and sustained support than they currently receive (Holcomb et al., 2024). Furthermore, experimental research is needed to substantiate claims that teacher feedback informed by EL-KEAs leads to more effective instructional practices and improved child outcomes.
There are a core set of validation needs that apply to EL-KEAs’ common purposes: Evidence based on test content, alignment with educational standards, test reliability, and DIF analysis are necessary for most uses. These categories of evidence are also typical for large-scale summative assessments for older students (Schafer et al., 2009). Next, we briefly summarize the existing validity evidence for three systems.
1.2.1 Test content
Evidence of test content is typically gathered during the test construction process, where the development team designs, evaluates, and revises the instrument, producing descriptions of content domains, developmental sequences, and justifications for design choices: For the DRDP, see Kriener-Althen et al. (2020) and WestEd (2018). These evaluations, informed by research literature, task analysis, and expert opinions, assessed the instruments’ content adequacy in terms of breadth, depth, and suitability of items, and have been documented in technical manuals, whitepapers, and peer-reviewed articles (Lambert et al., 2015; Wakabayashi et al., 2019).
1.2.2 Alignment
The goal of alignment is to ensure that assessment content is congruent with educational standards and learning goals (Bhola et al., 2003; Herman et al., 2007; Martone & Sireci, 2009), ultimately ensuring that items measure what children are expected to learn. The DRDP offers detailed documentation of its alignment with California’s early learning standards and school readiness concepts (Kriener-Althen et al., 2020). At minimum, other systems typically provide basic statements of alignment with developmental indicators or state guidelines (e.g., Wakabayashi et al., 2019).
1.2.3 Reliability
EL-KEAs tend to have extensive documentation of their psychometric reliability, showing strong internal consistency and person separation reliability coefficients (Draney et al., 2021; Lambert et al., 2015; Wakabayashi et al., 2019). While interrater reliability (IRR) is generally adequate (k = 0.6 to 0.8; Kowalski et al., 2018; Chen-Gaddini et al., 2022a; Joseph et al., 2020), it remains a recognized challenge in observational assessments (Mashburn & Henry, 2005; Waterman et al., 2012) that requires rigorous, ongoing rater training (Cash et al., 2012).
1.2.4 External variables
EL-KEAs generally show strong convergent validity, with moderate correlations between EL-KEA scores and direct assessments of similar constructs, as well as evidence of increasing scores with age (Chen-Gaddini et al., 2022b; Kim et al., 2018; Russo et al., 2019; DCRG, 2018; Sussman et al., 2023). However, divergent validity tends to be weaker, as the expected low correlations between EL-KEAs and assessments of dissimilar constructs are often unclear, likely due to the global nature of early childhood development and imperfect construct alignment. The DRDP shows highly correlated latent variables across learning domains (around 0.9); somewhat higher than the correlations between academic domains observed in international assessments (Draney et al., 2021). The evidence suggests a need for further research to clarify these divergent validity issues (Chen-Gaddini et al., 2022b; Lambert et al., 2015; Wakabayashi et al., 2019).
1.2.5 Internal structure
Evidence for internal structure in EL-KEAs is robust, focusing on the alignment between theoretical and estimated models, typically using Rasch family models. Methods such as model fit, dimensionality analysis, and Wright Map analysis are commonly employed to assess internal structure, with all three EL-KEAs providing extensive documentation (DCRG, 2018; Lambert et al., 2015; Wakabayashi et al., 2019), and the DRDP leveraging its internal structure to represent children’s progress across multiple learning progressions with general construct levels (Sussman et al., 2023). Item invariance studies, including longitudinal DIF analyses, further support internal structure, with evidence suggesting that rater experience improves internal consistency (Lambert et al., 2015).
1.2.6 Fairness and consequences
Fairness and consequences of EL-KEAs have received limited attention, which is typical for large-scale assessments (Heubert & Hauser, 1999; Schafer et al., 2009), despite the potential for fairness issues in multicultural populations (Hambleton et al., 2004; Matsumoto & Van de Vijver, 2012). Although studies of DIF (Holland et al., 1993; Paek & Wilson, 2011; Van den Noortgate & De Boeck, 2005) have shown that DRDP (Draney et al., 2021) and TS-GOLD (Kim et al., 2018) function similarly across various demographic groups, this alone cannot justify fairness; additional research is needed, particularly regarding the consequences of using EL-KEAs for high-stakes decisions (Yun et al., 2021) and the impact of teacher professional development (Barghaus et al., 2023).
1.3 Person fit and cultural sensitivity
A core assumption of assessment validity is that assessments measure individuals from different groups with equal accuracy, typically evaluated through Rasch models (Bond et al., 2021; de Ayala, 2022) in EL-KEAs. Whereas DIF is a common method for assessing group fairness, it does not capture all aspects of cultural sensitivity, particularly in developmental assessments where DIF may be nonuniform (Millsap & Everson, 1993). Subgroup analysis and person fit analysis, which examine group-level measurement differences and individual-level fit, offer additional insights into fairness and cultural sensitivity (Aguinis et al., 2016; Meijer & Sijtsma, 2001; Roth et al., 2014). These methods are essential for identifying issues like implicit bias and ensuring valid cross-cultural comparisons (McQueen & Mendelovits, 2003; Schulz & Fraillon, 2011).
1.3.1 Person fit analysis
Person fit analysis is used to identify unusual or unexpected response patterns in individuals or groups, even in situations where the measurement model fits the overall data. It compares an individual’s response pattern to the expectations set by the model, based on person and item threshold locations, and highly improbable patterns are classified as misfitting. Misfit can be either negative (more consistent than expected) or positive (less consistent than expected). Excessive person misfit has been shown to potentially compromise the validity of an assessment by introducing biases in person or item parameters (Emons, 2009; Mousavi & Cui, 2020) or complicating the interpretation of individual scores (Walker, 2017). Person fit analysis has been applied to examine cultural and linguistic biases in large-scale assessments (Petridou & Williams, 2007; Şengül Avşar & Emons, 2021; Van der Flier, 1983). For example, Lamprianou and Boyle (2004) found that ethnic minority students who spoke English as a second language had higher levels of positive person misfit on a math test in England than their peers, suggesting potential bias in the assessment for this group.
1.3.2 Person fit statistics
Many different person fit statistics exist (Şengül Avşar, 2019), but only a handful are appropriate for use with the polytomous items common in EL-KEAs. The most widely applied statistics for such items are Wright and Masters’ (1982) parametric loss function statistics, called mean squares (MS), usually applied with the Rasch model. These statistics provide diagnostic insight into person fit by assessing the difference between an individual’s response pattern, modeled as a person response curve (PRC), and the modeled mean PRC (Ferrando, 2015; Reise, 2000; Turner & Englehard, 2023; Walker et al., 2018). An individual PRC that is steeper than the mean indicates more consistency than expected (negative misfit), while PRC that is flatter than the mean suggests less consistency (positive misfit). This is analogous to item misfit in Rasch modeling, where an item’s discrimination parameter is compared to the average across all items (Wu & Adams, 2013).
Unweighted MS, also called outfit, is the simple average of the squared standardized residuals and is often described as an “outlier sensitive” fit statistic that is more sensitive to unexpected ratings on items that are either very easy or very hard for a person. In contrast, weighted MS, also called infit, is an “information-weighted” statistic that is more sensitive to unexpected ratings on items that are roughly targeted to the person. Current literature recommends the use of weighted MS over the unweighted version, as it provides greater reliability (Müller, 2020). The present analysis focuses exclusively on weighted MS, but Supplement 1 provides a comparison of both weighted and unweighted MS results.
Equation 1 shows that, for person \(n\) responding to items indexed by \(i\), weighted MS \(({V}_{n}\)) is the mean of the sum of squared residuals, \({y}_{ni}\), that have been weighted by the information, \({W}_{ni}\) (Wright & Masters, 1982). All other things being equal, \({W}_{ni}\) will be greater for unexpected responses to items that are closer to person n’s latent ability on the measurement (logit) scale.
The expected value for MS statistics is 1.0. It is common to treat values above 1.3 as indicative of positive misfit or unmodeled noise or other sources of variance that may degrade measurement (Wright & Linacre, 1994). In contrast, values below 0.7 are considered negative misfit, which is typically less concerning. Negative misfit may be associated with local dependence or redundant items, but it is unlikely to degrade measurement. The t statistics historically used with MS are sensitive to sample size and are not recommended with large samples.
Applied to the current assessment, Table 1 presents the weighted MS fit statistics (i.e., infit) for responses in the Social and Emotional Development (SED) domain of the DRDP. The table includes data from six example respondents, all with identical raw scores of 20 and a corresponding latent estimate of 0.0 logits. The results show a range of person fit: one set of scores demonstrates expected misfit (MS close to 1.0), two sets exhibit low MS (< 0.7), and three show high MS (> 1.3). The first row of the table displays the predicted scores for a respondent with a raw score of 20, as derived from the measurement model, providing context for the observed MS values.
Table 1
Misfitting and fitting response patterns for the social and emotional domain of the DRDP assessment
Respondent ID
Raw score
Weighted MS
SED1
SED2
SED3
SED4
SED5
Predicted
20
-
3.87
3.92
4.01
4.18
4.07
MS = 1 (predicted misfit)
1001
20
0.97
5
4
4
3
4
Low MS (negative misfit)
1002
20
0.02
4
4
4
4
4
1003
20
0.51
3
4
4
5
4
High MS (positive misfit)
1004
20
1.50
3
5
4
3
5
1005
20
2.49
5
4
4
2
5
1006
20
7.01
4
6
6
1
3
The first response pattern, under MS = 1, exhibits typical orderliness. Although there was some variation around the predicted scores (notably for SED1 and SED4, which deviate more from the model’s predictions than other items), such divergence is expected in a probabilistic model and in assessment of child development (e.g., accommodating horizontal decalage). The second set of scores, under “Low MS,” aligns more closely with the predicted values than expected. For respondent 1002, the observed scores match the predicted values almost exactly, yielding a very low weighted MS of 0.02. Respondent 1003 shows a two-point change that results in the smallest increase in weighted MS, rising from 0.02 to 0.51 but still well below 1.0. Notably, the average MS associated with all 2-point changes is 0.51.
The DRDP was designed to encourage a pattern of negative misfit, with items developed to work in lockstep to reflect a generalized developmental trajectory across items (Sussman et al., 2023). Consequently, a child rated as a 4 on one item is likely to be rated a 4 on others. This built-in consistency, though only a matter of degree, results in less variance than the model formally predicts, leading to average child misfit below 1.0. As we will discuss later, this lower misfit (less variation than expected) is not concerning and does not indicate a problem with the assessment.
The final set of response patterns, under “High MS,” exhibits larger discrepancies between observed and expected scores. Respondent 1004, with an MS of 1.5, demonstrates a response pattern that would be flagged as misfitting. Respondents 1005 and 1006 show even more pronounced variability, with 1006 displaying a response pattern in the top 2% of positive misfit. Positive misfit cases, like those of respondents 1005 and 1006, raise concerns about valid measurement, as they suggest highly unexpected responses on certain items, signaling potential issues with the assessment.
1.3.3 Relating person misfit with cultural sensitivity in the DRDP
In this study, the central hypothesis tested through person fit analysis is that a child’s cultural background and related experiences may influence their intra-individual development, specifically the consistency in which different skills develop. The DRDP’s measurement model can be viewed as a testable framework for child development. As Thurstone (1937) noted, the parameter estimates take on substantive meaning because they become an empirical representation of the underlying theory. Thus, the assessment contains a probabilistic model of how children are expected to progress across various developmental domains. Person fit analyses allow us to evaluate each child’s match with these expectations.
The present person fit analysis compared the degree to which development within the three groups matched the expectations set forth by the model. Practically speaking, a broader range of lived experiences may selectively support development in different areas, affecting the rate and sequence of learning, and manifesting as relative strengths or weaknesses in development (e.g., Saxe, 1988). A child exhibiting significant inconsistency across developmental areas relative to the model will show high misfit (i.e., MS > > 1.0), whereas low intra-individual variation relative to the model results in low misfit (i.e., MS < < 1.0).
Following Rupp’s (2013) recommendation that person fit studies include explicit, a priori hypotheses regarding expected fit and directionality, we hypothesized that children who are Latino/a or Black would exhibit greater average positive misfit compared to their White peers. Such a pattern would suggest that children from diverse backgrounds are not rated with the same level of consistency as White children, providing evidence that the learning progressions (LPs) may not fully reflect the developmental trajectories of children with a broader range of cultural and developmental experiences. This could indicate potential biases in the assessment (e.g., Walker, 2017).
This study addresses several calls in the literature: (a) from the early childhood education (ECE) field to assess the fairness of EL-KEAs, (b) from research on learning progressions to explore their applicability to diverse children, and (c) from person fit literature to use multilevel regression methods to better understand factors influencing person fit. This study directly addresses concerns raised by First 5 (2022) regarding the cultural relevance of EL-KEAs. Again, if an assessment fails to account for the varying rates and sequences of development shaped by diverse cultural experiences, it could compromise the fairness and accuracy of the score interpretations. Our study sought evidence for the cultural sensitivity of the DRDP, and to support or contradict claims that the assessment should be regarded as an equitable measure of child development for children from CLD families, with implications for similar EL-KEAs.
2 Methods
2.1 Overview
In this study, we employed a regression framework to examine person fit on the DRDP assessment among children identified as Latino/a, Black, and White. To account for contextual factors, we used multilevel modeling to control for background variables and unmeasured variance at the teacher and school levels. The results were interpreted to assess whether the developmental progressions underlying the DRDP assessment are equally applicable to these three racial/ethnic groups.
2.2 Data and participants
This study utilized data from DRDPtech, a state assessment database that was part of the California Desired Results Program. DRDPtech has since been replaced, but it provided comprehensive data for this analysis. The database contains DRDP scores and teacher-reported demographics for children in state-funded infant/toddler, preschool, or kindergarten programs in California from 2015 through 2018.
We selected participants using a stratified sampling process as follows: (1) Starting with the full database (1.25 million observations), we retained children aged 2–5 who completed the preschool version of the DRDP. (2) Observations with missing data were excluded to ensure consistent parameter estimation across groups. (3) To eliminate dependency, we randomly selected one assessment per child, removing repeated measures. (4) Given the small size of the group of children identified as Black, we sampled all identified Black children and randomly selected equal numbers of Latino/a and White children to create a balanced sample. (5) To estimate the complete measurement model, we added 15,000 infants/toddlers and 1200 kindergarteners with complete data, stratified equally across the three racial/ethnic groups. Our software and estimation require complete data for all categories, and using more than the minimum cases helps ensure model stability. (6) The final sample included 96,258 children nested in 4326 school sites. The median number of children per site was 13, ranging from 1 to 664. The misfit analysis focused on the preschool subsample (N = 80,058), as this group aligned with the primary research aim of evaluating person fit across racial/ethnic categories.
2.3 Data availability statement
The data are owned by California and are not publicly available.
2.4 Participant demographics
We analyzed demographic variables available in DRDPtech that are commonly used to explain assessment variance in similar studies. These variables include teacher-reported information about children’s race/ethnicity, age, gender, special education status, the language(s) that the teacher speaks with the child, and the type of preschool setting. In addition, to control for the effects related to semester of assessment (raters become more experienced with the DRDP during the year), we recorded which semester the assessment was conducted in.
Children identified as both Latino/a and White were classified as Latino/a to maintain consistency with prior research practices and ensure balanced group sizes. Children identified as multiracial were excluded from the analysis.
Table 2 shows that racial/ethnic and age distributions were balanced across groups. Table 3 highlights additional demographic variables such as language match, special education status, and childcare settings. Children in special education comprised 6% of the analytic sample, which aligns with statewide numbers. As expected, about two-thirds of children identified as Latino/a were classified as multilingual learners (MLs), whereas far fewer ML classifications were found among children identified as Black or White.
Table 2
Children’s race/ethnicity and age
Age
Race/ethnicity
n
M
SD
Min
Max
Latino/a
26686
53.0
6.9
36
71
Black
26686
52.5
7.1
36
71
White
26686
53.0
6.9
36
71
Table 3
Crosstabulation of race/ethnicity with categorical variables
Latino/a
White
Black
N
%
n
%
n
%
Gender
Female
13,418
50.3
12,993
48.7
13,607
51.0
Male
13,249
49.6
13,682
51.3
13,064
49.0
Missing
19
0.1
11
< 0.1
15
0.1
Special education
General education
23,061
86.4
22,377
83.9
23,027
86.3
Special education
1342
5.0
1864
7.0
1155
4.3
Missing
2283
8.6
2445
9.2
2504
9.4
Multilingual learner
Yes
18,044
67.6
4097
15.4
1917
7.2
No
8642
32.4
22,589
84.6
24,769
92.8
Language match
Match
17,691
66.3
23,342
87.5
24,979
93.6
No match
8846
33.1
3252
12.2
1596
6.0
Missing
149
0.6
92
0.3
111
0.4
Setting
Center-based
2264
8.5
1894
7.1
3040
11.4
Family care
251
0.9
325
1.2
229
0.9
Head start
4786
17.9
5713
21.4
7112
26.7
State school
15,507
58.1
14,938
56.0
12,254
45.9
Other setting
1122
4.2
1332
5.0
830
3.1
Missing
2756
10.3
2484
9.3
3221
12.1
Semester
Fall
14,207
44.5
14,154
43.9
13,705
42.7
Winter
14,767
47.2
15,172
48.7
14,533
46.2
Spring
3112
8.3
2760
7.4
3848
11.2
Language match, a constructed variable, indicates whether the teacher reported speaking the child’s home language. About 68% of children identified as Latino/a had a teacher who spoke their home language, compared to 88% for children identified as White and 94% for children identified as Black. These differences suggest potential variations in teacher–child communication patterns across groups.
Childcare setting distributions were generally balanced across racial/ethnic groups, further supporting the comparability of the sample across these categories. Semester was also balanced across racial/ethnic groups, as expected, indicating valid sampling. Winter DRDP assessments were optional, hence the smaller number of completed Winter assessments relative to fall and spring.
2.5 Assessment
The outcomes in this study were assessment scores for the five major domains of the DRDP assessment. Children were rated by their teachers or childcare providers, and scores were generated using the operational DRDP assessment system methods (DCRG, 2018): In brief, items within each domain were scaled using the unidimensional Rasch PCM (Masters, 1982), incorporating a dimensional alignment technique (Feuerstahler & Wilson, 2021) to place the five domains on a common measurement scale. The Rasch models were estimated using the (unidimensional) random coefficients multinomial logit model (Adams et al., 1997). Person scaled scores for each dimension (also called domain) were generated using Warm’s Likelihood Estimates.
Descriptive statistics for the DRDP scores, disaggregated by race/ethnicity, are provided in Table 4. The mean scores for children identified as Latino/a and Black were generally lower than those for children identified as White, consistent with prior research (e.g., Isaacs, 2012). The group-level mean difference of approximately 0.4 to 0.6 logits is unlikely to influence the person fit results (Paek & Wilson, 2011), as this difference is small relative to the range of the latent distribution, and our regression models account for age, which is strongly correlated with the latent trait.
Table 4
DRDP assessment scores disaggregated by race/ethnicity (in logits)
Domain
Race/ethnicity
M
SD
min
max
ATLREG
Latino/a
0.84
2.63
− 10.13
6.34
White
0.69
2.62
− 10.13
6.34
Black
1.25
2.55
− 10.13
6.34
SED
Latino/a
1.03
2.50
− 10.27
5.96
White
1.49
2.42
− 10.27
5.96
Black
1.08
2.50
− 10.27
5.96
LLD
Latino/a
0.79
2.36
− 10.64
6.08
White
1.44
2.25
− 10.64
6.08
Black
0.99
2.34
− 10.64
6.08
COG
Latino/a
0.75
2.44
− 10.34
6.31
White
1.35
2.35
− 10.34
6.31
Black
0.79
2.44
− 10.34
6.31
PDHLTH
Latino/a
0.89
2.39
− 10.67
5.93
White
1.29
2.25
− 10.67
5.93
Black
0.90
2.35
− 10.67
5.93
Note. ATLREG Approaches to Learning/Self-Regulation, SED Social and Emotional Development, LLD Language and Literacy Development, COG Cognition (including math and science), PDHLTH Physical Development/Health. n of each row = 26,686
2.5.1 Analysis plan
Our analysis employed weighted mean squares (MS) as the outcome in multilevel regression to examine racial/ethnic differences in person fit across the five major domains of the DRDP assessment.
2.5.2 Weighted MS
Weighted MS, often referred to as infit, has been widely used in the education and healthcare literatures, as well as in large-scale assessment systems such as NAEP and PISA. The strengths and limitations of this person fit statistic are well established (Li & Olejnik, 1997; Smith et al., 2008). Current guidelines recommend avoiding unweighted MS, and research suggests that different fit statistics for polytomous items perform similarly across different assessment conditions, with the choice of method having minimal impact on results (Şengül Avşar, 2019). Our comparison of weighted MS and unweighted MS indicated that the choice had no impact on the conclusions of this study (Again, Research Supplement 1 compares results from weighted and unweighted MS). Additionally, our exploratory analysis comparing weighted MS with the LPz statistic (Drasgow et al., 1985) revealed a strong correlation of 0.98 between the resulting person fit estimates. Based on these findings, we concluded that reporting weighted MS alone provided an efficient and rigorous approach for this study.
2.5.3 Multilevel regression
Petridou and Williams (2007) applied a multilevel regression framework and found that a random intercept at the classroom level explained a nontrivial portion of the person fit variance, recommending further multilevel regression studies. Petridou and Williams along with Reise (2000) employed a logistic framework with binary misfit cutoffs, but Cui and Mousavi (2015) used a linear model. Given that, in this study, the weighted MS distributions were similar across groups and lacked a clear misfit threshold, we chose a linear model. However, to cross-validate our findings, we also conducted logistic regression (i.e., an odds of MS > 1.3 outcome) and found no meaningful differences in the results. Additionally, we replicated the analysis using only participants with large positive misfit (MS > 1.3) and obtained qualitatively identical results to those using the full sample. Although person misfit statistics can be estimated for groups (Smith & Plackner, 2009), these are infrequently reported in the literature, so we conducted an individual-level analysis.
Statistical analyses were conducted in R (R Core Team, 2021), utilizing TAM (Robitzsch et al., 2024) for IRT analyses and weighted MS, lme4 (Bates et al., 2015) and lmerTest (Kuznetsova et al., 2017) for regression, and ggeffects (Lüdecke, 2018) for marginal means.
2.5.4 Effect sizes and interpretation of person fit results
Traditional t-tests of statistical significance, based on the chi-squared distribution, are sensitive to sample size and thus are not suitable for our large samples. Instead, person fit researchers have emphasized the use of effect sizes. To assess misfit, we applied typical cutoffs of less than 0.7 for negative misfit and greater than 1.3 for positive misfit (Wright & Linacre, 1994). We also developed a custom standardized effect size (ES) metric to interpret between-group differences. The range between the two cutoffs (0.7 and 1.3), a span of 0.6 points, defines the expected variability around the expectation and serves as a useful standard for interpreting between-group differences. Given that the average of all two-point deviations produced a weighted MS of 0.51, we used the mean of 0.6 and 0.51 to define the denominator for our standardized ES as 0.555. This ES allows for interpreting between-group differences as a fraction of either the model’s expected variation or typical two-point fluctuation. Differences smaller than 0.1 were considered negligible, while larger differences were deemed meaningful, though this value of 0.1 is somewhat arbitrary.
2.5.5 Organization of the results
The results are presented in two sections. The first section provides descriptive statistics prior to statistical adjustment, including distributions of weighted MS by race/ethnicity across the five DRDP domains. The second section presents multilevel regression results in two formats: first, marginal means that show the expected average fit for children categorized as Latino/a, Black, and White, controlling for other variables; and second, regression model coefficients that show the association between person fit and the research variables (fixed effects), as well as the unmodeled variance at the teacher and school levels (random effects).
3 Results
3.1 Person fit distributions among racial/ethnic groups
This section describes the bivariate distributions of person fit and race/ethnicity across each of the five DRDP domains. Table 5 presents the average weighted MS values for each domain and racial/ethnic category (prior to statistical adjustment). Additionally, the column labeled “Raw Deviations” shows the mean difference between predicted and observed item scores, akin to the comparison between respondents 1002 and 1003 in Table 1. For example, the first row of Table 5 indicates that for the ATLREG domain, children identified as Black had a mean weighted MS of 0.84, which corresponds to a 2.41-point raw difference from the “baseline” response vector with the lowest MS. Raw deviations are influenced by the number of items, so they must be interpreted within each domain. Whereas the Weighted MS and Raw Deviations columns were almost perfectly correlated, the latter offers insight into how increases in misfit manifested as changes in scores on the learning progressions.
Table 5
Distribution of raw score deviations and corresponding weighted MS statistics
Domain
Category
Weighted MS
Raw deviations
M
SD
M
SD
ATLREG
Black
0.84
1.04
2.41
2.06
Latino/a
0.78
0.89
2.31
2.01
White
0.82
0.94
2.30
2.02
All
0.82
0.96
2.34
2.03
SED
Black
0.52
0.70
1.99
2.16
Latino/a
0.52
0.70
2.00
2.14
White
0.56
0.71
2.16
2.17
All
0.53
0.71
2.05
2.16
LLD
Black
0.74
0.73
4.92
3.45
Latino/a
0.74
0.72
4.93
3.39
White
0.76
0.70
5.05
3.34
All
0.75
0.72
4.97
3.39
COG
Black
0.71
0.69
4.62
3.38
Latino/a
0.72
0.68
4.70
3.35
White
0.72
0.64
4.77
3.25
All
0.72
0.67
4.70
3.33
PDHLTH
Black
0.85
1.05
4.90
4.04
Latino/a
0.78
0.83
4.68
3.64
White
0.82
0.84
4.77
3.59
All
0.82
0.91
4.78
3.76
Note. Each row n = 26,686, except for All (N = 80,058)
For three of the five domains (SED, LLD, and COG), children classified as White exhibited the same or higher average misfit than the other two racial/ethnic groups, which does not support the racial/ethnic bias hypothesis. However, conflicting results were found for ATLREG and PDHLTH, where children identified as Black had slightly higher average misfit compared to the other groups, suggesting the possibility of bias. Standard deviations were consistent across racial/ethnic groups within each domain, providing no evidence of bias. Histograms of the weighted MS statistics (not shown) revealed no subtle differences in distributions that could skew or hinder statistical comparisons, supporting the appropriateness of models for continuous data.
As anticipated, the average weighted MS values were generally low, with all averages in Table 5 falling below the expected value of 1.0. Some values were near the 0.7 threshold, and SED was slightly below this threshold at approximately 0.5. Low weighted MS is characteristic of the DRDP assessment items, which were designed and optimized to ensure within-person consistency across items. Analyses conducted on a subset of children considered misfitting (MS > 1.3) yielded similar results. To retain as much information as possible, the full sample was included in the analysis.
3.2 Multilevel regression
We employed multilevel regression to estimate average person fit for children classified as Latino/a, White, and Black, controlling for the research variables. Research Supplement 2 presents the results of our model building process: Model 1 contained only the variance components. Model 2 added fixed effects for race/ethnicity. Model 3, the final model, included complete fixed effects for all the research variables. Here, we focus on the final model that was applied to each of the five DRDP outcomes. This final model regressed weighted MS on race/ethnicity, grand mean-centered age, special education status, gender, semester, and setting type. ML status and language match were not statistically significant and were removed. Treating age as a continuous variable was more appropriate than as an indicator, as the increase in model fit using the latter was negligible. Interactions between race/ethnicity and setting or race/ethnicity and semester, though theoretically plausible, were not statistically significant and were omitted. Random intercepts for teachers and school sites were included, as likelihood ratio tests showed significant improvements in model fit. The final model is presented in Eq. 2:
where the subscripts represent child \(i\), teacher \(j\), and site \(k\). Black served as the reference category for race/ethnicity and Fall is the reference category for semester. The grand mean centering of age (\({\text{Age}}_{ijk}-{\text{Age}}_{...}\)) adjusts the model intercept and random effects variances to the mean age (4.4 years). \({\beta }_{8}\) through \({\beta }_{12}\) are dummy variables for childcare setting, with missing data as the reference category. The model assumes that the two random intercepts are normally distributed with means of zero and variances of \(\psi , \gamma\), given the covariates, uncorrelated random effects, and heteroskedastic level-1 errors conditional on random effects.
We presented marginal effects first, followed by model diagnostics and regression coefficients.
3.2.1 Marginal person fit among racial/ethnic groups
To compare person fit across racial/ethnic groups, we estimated average person fit for children in each group while holding other variables constant (i.e., marginal effects). This allowed for direct comparisons between the three groups. We first examined person fit in the five DRDP domains and then evaluated racial/ethnic differences.
3.2.2 Differences across DRDP domains
Figure 2 contains a graph of estimated person fit for the five DRDP domains and three racial/ethnic groups. The solid horizontal line at 1.0 represents the model’s expected fit, whereas the dashed line at 0.7 marks the lower bound for negative misfit (i.e., less problematic misfit). The boundary for positive misfit, at 1.3, is beyond the range shown. All estimated means were below 1.0, indicating that average misfit was lower than expected. This result aligns with the DRDP’s design, which aims for within-person consistency across items within a domain. The mean weighted MS estimates were above baseline, so there was some inconsistency, though less than usual.
Fig. 2
Estimated person fit for the DRDP domains and racial/ethnic categories
These findings provide validity evidence for the DRDP’s internal structure, supporting claims that it captures children’s consistent development within each domain. However, a competing hypothesis is that ratings may be inaccurate, potentially influenced by construct-irrelevant factors, such as rater effects (e.g., a halo effect). We cannot rule out this possibility with the current data and address this limitation in the discussion.
3.2.3 Consistency of SED ratings
Our second observation was that the SED ratings exhibited greater consistency compared to the other domains. Notably, SED also crossed below the 0.7 threshold for negative misfit, suggesting that, in addition to its greater consistency relative to other domains, the SED ratings were more consistent than expected by the model. This finding may reflect intrinsic consistency among the items, a general developmental trend in SED, or potential halo effects or other construct-irrelevant influences. It could also indicate redundancy among the items, which might be addressed by reducing the number of items. However, we view this situation as less problematic than positive misfit, which reflects greater than expected inconsistency. Overall, response inconsistency does not appear to be a significant issue within any of the DRDP domains.
3.2.4 Racial/ethnic differences
Within each domain, person fit differences across racial/ethnic groups were minimal. Statistically, only three contrasts were significant at the p < 0.05 level: Latino/a (green marker) vs. Black (red marker) in ATLREG (p < 0.001), LLD (p < 0.05), and PDHLTH (p < 0.001). The remaining racial/ethnic comparisons within each domain were not statistically significant. Even for the significant contrasts, the average difference was 0.015, with a range of 0 to 0.058, resulting in a very small effect size (ES = 0.027, as defined in the methods). In conclusion, the racial/ethnic differences in person fit, as estimated by the multilevel models, were minimal.
3.2.5 Model fit and regression coefficients
Table 6 presents estimates of the variance explained by the model. The first row, labeled “Site,” represents the variance in person fit explained by the random intercept at the site level. This variance ranged from 4 to 8%, indicating that site-level factors accounted for a modest portion of the total variance. The second row, “Teacher,” shows that the teacher-level random intercept explained between 13 and 19% of the variance, the largest contribution of any structural component in the model. This variance may reflect unmeasured classroom variables, including a potential rater effect, where teachers’ idiosyncratic rating styles influenced their assessments. The “Error” row indicates that between 73 and 82% of the variance remained unexplained by the model. Next, intraclass correlations (ICCs) are also included to facilitate broader comparisons with the literature. ICCs at the teacher level ranged from 0.126 to 0.189, whereas ICCs at the site ranged from 0.034 to 0.074. Finally, the “Marginal R2” row shows that fixed effects accounted for only 1% of the variance in each model. This result is not surprising or inherently problematic, as one would expect person fit to be largely independent of demographic variables in a fair and well-designed assessment.
Table 7 presents the coefficients and associated statistics from the final regression models. The results were relatively consistent across the five DRDP domains, allowing for some general conclusions, tempered by the study’s limitations discussed later. Again, the small variance explained by the fixed effects highlights the substantial variability around the estimates and underscores the need for cautious interpretation.
Table 7
Regression coefficients for multilevel regression of person fit on research variables
ATLREG
SED
LLD
COG
PDHLTH
Est
SE
Est
SE
Est
SE
Est
SE
Est
SE
Fixed part
Race/ethnicity
Latino/a
− 0.059
0.010***
− 0.006
0.007
− 0.017
0.007**
− 0.003
0.007
− 0.037
0.009***
White
− 0.019
0.010
0.010
0.007
− 0.003
0.007
− 0.009
0.007
− 0.010
0.009
Age
− 0.006
0.001***
− 0.004
4.0E − 04***
− 0.006
4.0E − 04***
− 0.004
3.7E − 04***
− 0.009
5.0E − 04***
IEP
0.136
0.015***
0.099
0.011***
0.119
0.011***
0.089
0.010***
0.126
0.013***
Female
− 0.076
0.007***
− 0.047
0.005***
− 0.053
0.005***
− 0.029
0.005***
− 0.074
0.006***
Semester
Winter
− 0.107
0.014***
− 0.069
0.010***
− 0.072
0.010***
− 0.069
0.009***
− 0.134
0.013***
Spring
− 0.138
0.008***
− 0.070
0.006***
− 0.066
0.006***
− 0.082
0.005***
− 0.128
0.007***
Setting
Center
0.003
0.021
− 0.018
0.015
− 0.019
0.015
− 0.015
0.014
− 0.023
0.019
Family
− 0.137
0.055**
− 0.044
0.039
− 0.016
0.040
0.036
0.041
− 0.031
0.055
Head start
− 0.036
0.018**
− 0.032
0.013**
0.000
0.013
− 0.013
0.013
− 0.017
0.017
Other
− 0.022
0.026
− 0.003
0.018
− 0.011
0.019
− 0.024
0.018
0.065
0.024**
State school
0.017
0.015
0.015
0.011
0.037
0.011
0.038
0.010***
0.019
0.014
Intercept
0.966
0.601
0.806
0.780
0.938
Random part
Teacher
0.343
0.248
0.268
0.293
0.392
Site
0.217
0.130
0.142
0.178
0.249
Error
0.878
0.641
0.643
0.579
0.787
N
72,795
72,795
72,795
72,795
72,795
Note. *p < 0.05, **p < 0.01, ***p < 0.001
3.2.6 Racial/ethnic differences
The regression coefficients for the three racial/ethnic categories, which underpin the analysis in Fig. 2, indicate negligible differences in person fit associated with race/ethnicity. Specifically, differences between Black children (the reference group) and White children were not statistically significant across any DRDP domains. Although statistically significant differences were observed between Latino/a and Black children for three of the five domains, these differences were minor. The mean absolute discrepancy between these groups was 0.024 points, translating to an ES of 0.043, which we interpret as negligible. The largest effect was observed for the ATLREG domain, where Latino/a children were estimated to have 0.06 points lower weighted MS scores than Black children (p < 0.001), corresponding to an ES of 0.11—a small but potentially meaningful effect. However, this result is likely an outlier, given that it was 54% larger in magnitude than the next largest coefficient and was associated with the largest standard errors across domains. Overall, these analyses do not provide evidence of meaningful racial/ethnic bias in person fit.
3.2.7 Other child variables
Age was statistically significant in all models (p < 0.001). The small negative coefficient suggests that older preschoolers were rated slightly more consistently than younger ones, all else being equal. Special education status (i.e., IEP) also emerged as a significant predictor (p < 0.001) with a positive coefficient across all models, ranging from 0.09 to 0.14. This difference, averaging 0.1 points, translates to an ES of 0.18—a relatively small but practically significant finding. Thus, children in special education exhibited somewhat less within-domain consistency than their peers.
Gender differences also reached statistical significance, with females rated more consistently than males (p < 0.001). The effect ranged from 0.03 to 0.06 points, a modest difference that suggests females may exhibit slightly more consistent development or may be rated more consistently for other reasons.
3.2.8 Semester differences
Semester was a significant predictor across all five models, with person fit consistently higher in the fall compared to winter and spring. No difference was detected between the latter two semesters. The effect size, averaging 0.17, indicates a small but meaningful decrease in person fit after the fall. This systematic pattern, observed across all five domains, may reflect changes in child behavior, such as increased intra-individual consistency, or teacher behavior, such as more consistent ratings of children over time. This finding introduces a new aspect to the discussion of rater effects in EL-KEAs that is explored further in the discussion.
3.2.9 Setting differences
Most contrasts between childcare settings were not statistically significant. However, state schools tended to exhibit the highest person fit scores, whereas family care settings exhibited the lowest. Post hoc contrasts confirmed that some of these differences were significant at p < 0.05 or better. On average, the difference between state schools and family care across DRDP domains was 0.05 points (ES = 0.09), which is small and likely negligible. However, the ATLREG domain displayed a notably larger difference of 0.173 points (ES = 0.31). While potentially important, this finding should be interpreted cautiously due to the relatively large standard errors for ATLREG and the absence of consistent trends across domains.
3.2.10 Random effects
The bottom of Table 7 displays random effects variance at different levels of the variable. As expected, the teacher-level variance is greater than the site-level variance, and the error variance eclipses the others. Interested readers may refer to Research Supplement 2 to observe the small reduction in error variance afforded by more complex models, in line with the small variance explained by the fixed effects discussed above.
4 Discussion
4.1 Summary of key findings
This study addressed ongoing calls from the ECE field to investigate the cultural sensitivity of EL-KEAs and related concerns in the assessment-based LP/LT literature. Specifically, we examined whether the DRDP’s learning progressions are less applicable for children identified as Latino/a or Black, compared to those identified as White. Using person fit analyses, we evaluated the consistency of teachers’ ratings across the five essential domains of early learning and development measured by the DRDP assessment. A central hypothesis was that greater positive misfit for Latino/a and/or Black children compared to White children would support the concerns about racial/ethnic bias raised by the ECE community (i.e., First 5, 2022) and in the LP/LT literature.
However, our findings revealed no evidence of such disparities. Children from all three racial/ethnic groups demonstrated comparable consistency in their teachers’ ratings, contradicting the bias hypothesis. This outcome aligns with the DRDP’s emphasis on fair and equitable assessment practices (DCRG, 2018). The credibility of our results is bolstered by established theories (e.g., Custers et al., 2000; Lamprianou & Boyle, 2004; Van der Flier, 1983) and prior literature advocating for person fit analyses to detect potential racial/ethnic bias (e.g., Petridou & Williams, 2007; Şengül Avşar & Emons, 2021). The current study adds evidence supporting the fairness of the DRDP in capturing developmental progress across diverse populations.
Several additional findings warrant discussion. Children in special education exhibited higher positive misfit (i.e., less consistent ratings) than their peers in general education. This result aligns with expectations given the documented intra-individual variability in special education populations (Flanagan & McDonough, 2022), which can contribute to person misfit (Engelhard, 2009). If replicated, these findings could have important implications for both assessment and instruction. For example, if higher misfit in special education stems from flawed measurement processes, this points to the need for instrument revisions or additional training for special education teachers. Conversely, if these findings reflect genuine intra-individual variability, they underscore the importance of early formative assessment and differentiated instruction tailored to the unique developmental profiles of children in special education.
Additionally, person misfit was found to decrease after the fall semester, a novel and consistent finding observed across all DRDP domains. This decrease, although relatively small, was comparable in magnitude to the observed difference between general and special education groups, indicating practical significance. Several plausible explanations merit consideration: Children’s development could become more consistent due to shared environments and experiences (i.e., curriculum and instruction). Or teachers could become more comfortable with the assessment process after the fall. Both factors could contribute to changes in person fit. This finding supports the need for a longitudinal perspective on rater behavior within EL-KEA contexts, which we discuss further, below.
Two additional findings are noteworthy. First, children identified as female were rated with slightly greater consistency than males. This result aligns with Rudner et al.’s (1995) findings in the NAEP assessment and Cui and Mousavi’s (2015) result from a ninth-grade large-scale mathematics achievement test but contrasts with Petridou and Williams’ (2007) study, which found no gender effects in primary mathematics assessments. Although the difference in consistency by gender is small and unlikely to have significant implications, it adds a new perspective to existing evidence that females are often assessed more favorably in early childhood contexts (e.g., Magnuson et al., 2016). Second, differences in person fit across educational settings were generally minimal, with a potential exception: person fit appeared lowest in family care settings and highest in state schools. This disparity could reflect unmeasured differences in child populations (children are not randomly assigned to settings) and variations in rater training. Anecdotal reports suggest that state school raters often receive more comprehensive DRDP training than those in family care, which may explain raters’ greater ability to perceive nuanced intrapersonal differences and thus higher person fit scores. Additional study is needed, but these findings highlight the value of person fit as a lens to study and understand raters, which has been underutilized to date.
4.2 Implications for LP-based assessment with children from CLD families
This study contributes new insights into the cultural sensitivity of assessment-based LPs for children from CLD backgrounds. Our analyses found no evidence that these children exhibited less consistent developmental trajectories than their peers when evaluated using a common measurement model. Whereas the current results should not be overgeneralized to suggest that racial/ethnic differences in developmental profiles are negligible or that the DRDP is entirely free from bias, they directly address specific concerns raised by ECE practitioners (First 5, 2022) and within the LP/LT literature (Harris et al., 2022). In this regard, our findings offer reassurance regarding the DRDP’s cultural sensitivity.
Additionally, this study provides a framework for future investigations into potential bias. Claims of cultural insensitivity in operational assessments must be substantiated with empirical evidence to identify where improvements are most needed, supporting the common goal of equitable assessment. For example, Wu et al. (2021) documented distinct learning trajectories among international student populations, and research on multilingual learners highlights divergent developmental patterns (Castro et al., 2021; Yow & Markman, 2011). These frameworks, however, remain underexplored in the context of EL-KEAs or LP-based assessments for early childhood populations.
It is also important to note that valid LP-based assessments do not require all children to follow identical developmental paths or progress at the same rate. Critical arguments pointing out that LPs do not accommodate all children’s developmental trajectories often overlook the probabilistic nature of statistical methods used in LP-based assessments, which account for individual differences. Future critiques should focus on the extent to which measurement models are effective in accommodating such variability, rather than on simplistic notions that LPs impose fixed or “orderly” developmental trajectories (e.g., Empson, 2011).
4.3 Implications for assessment with EL-KEAs: practice and research
The use of person fit metrics in EL-KEAs holds considerable promise for both assessment practice and future research. First, person fit can serve as a tool for identifying atypical or potentially invalid assessments. By flagging scores with high positive misfit, educators can conduct follow-up evaluations to uncover unique strengths or weaknesses in a child’s developmental profile, supporting more targeted and effective instructional planning and family engagement. In summative contexts, flagging misfitting scores can also enhance the reliability of assessments by prompting reflection on ratings and ensuring that accurate scores are used for decision-making.
Second, person fit metrics can be a valuable tool for identifying teachers who may benefit from additional support or training. Teachers with extreme person misfit scores could be targeted for further professional development, such as tailored coaching or interviews, to improve their assessment accuracy and consistency. This logic applies to instructional and professional development contexts, where person fit offers practical insights that can inform direct action.
Finally, person fit provides a novel lens for examining rater effects in observational assessment. Rater effects research is strikingly underdeveloped (Carbonneau et al., 2020; Cash et al., 2012; Ready & Wright, 2011); not keeping pace with the widespread implementation of EL-KEAs and expansion of observational assessment to new developmental constructs (e.g., Garcia et al., 2019). Person fit metrics provide a rigorous way to examine the behavior and accuracy of raters independent of children’s achievement or developmental levels, which could be a valuable way to avoid deficit-oriented interpretations of assessment data. Interestingly, our person fit results showed intraclass correlation coefficients between 12.4 and 18.9%, whereas rater effects for scaled scores tend to fall in the 30 to 50% range (e.g., Lambert et al., 2015; Mashburn & Henry, 2005; Sussman et al., 2023). Person fit thus appears to be less influenced by external factors than scaled scores. Comparing person fit metrics with scaled scores and other external variables could shed light on the nature of rater effects and deepen our understanding of their impact on assessment outcomes.
An anonymous reviewer posed an important question: could what is often attributed to rater effects or bias instead stem from the diverse ways EL-KEAs are implemented and perceived in practice? For instance, consider Teacher A, who employs a simplified heuristic to streamline assessments, compared to Teacher B, who receives coaching support and views assessment as a critical tool for instructional planning. In this scenario, Teacher A’s scoring variability might be more constrained, whereas Teacher B’s scores might reflect a broader and more nuanced interpretation of student performance. This distinction underscores the need for assessment validation efforts to systematically investigate rater variance, addressing the specific factors that explain such variance. This could include variability in professional development, differences in instructional goals, and other contextual factors (Waterman et al., 2012). Addressing these complexities is crucial for refining the validity and utility of EL-KEAs, particularly in diverse educational settings.
In sum, increasing the emphasis on person fit within EL-KEA systems holds great potential in practice and research. In practice, its utility should be evaluated through pilot testing where administrative, interpretive, and political challenges can be better understood and the influence of the metrics more thoroughly assessed. For research, person fit metrics may help address longstanding questions about rater reliability in early childhood assessment. Person fit methodology aligns with established empirical frameworks for studying rater effects (Raudenbush et al., 2008). Understanding their root causes, including how they influence person fit, could lead to the design of more effective rater training programs (Hoyt & Kerns, 1999), to mitigate rater effects.
4.4 Limitations
This study—despite its methodological rigor—focused on a single aspect of potential assessment bias. Statistical methods like person fit can generate evidence against specific biases but cannot conclusively prove the absence of all forms of bias. Although we found no evidence of racial/ethnic bias in the DRDP, this does not definitively prove the instrument’s fairness. To guard against misinterpretation, it is important to reiterate that potential biases related to race/ethnicity cannot be entirely ruled out based on this analysis alone.
Our findings suggest that rater effects were the largest source of variance, but the reasons behind these effects remain unclear. Interpreting person fit in observational assessments is inherently complex because (a) variance may originate from both examinees and raters, and (b) person fit results do not explain the reasons for misfit (Rupp, 2013). For example, large positive misfit can result from a careless observer with inaccurate ratings of an otherwise typical child, or by a careful observer who recognizes a child’s uncommon pattern of intrapersonal strengths and weakness. Further research—focusing on the reasons behind misfit—is necessary to address unresolved questions.
Additionally, two specific limitations warrant mention. First, certain key variables, such as family socioeconomic status (SES), were not measured. Although the income-censored nature of the public preschool sample mitigates the concerns and SES has not been linked to person fit to the best of our knowledge, SES is a potentially confounding variable. Second, this study focused exclusively on the weighted mean square (MS) statistic for person fit, which is widely regarded as the best option for polytomous data. Although we tested alternative metrics (e.g., see Supplement 2) and found negligible differences between methods, a comprehensive analysis using a broader range of person fit indicators could yield additional insights, in particular local fit metrics that might be able to generate more diagnostic and thus useful information.
5 Conclusion
This study addressed concerns regarding the cultural sensitivity of learning progressions (LPs) in early childhood education (ECE) by examining the applicability of the Desired Results Developmental Profile (DRDP) for children identified as Latino/a, Black, and White. Our analysis revealed no racial or ethnic differences in person fit, suggesting that the DRDP’s models of learning and development are equally relevant for children from culturally and linguistically diverse backgrounds. These findings directly address concerns voiced by the ECE community about potential racial/ethnic bias in LP-based assessments, contributing to the growing body of evidence supporting the validity and fairness of early learning and developmental assessments for diverse populations.
Additionally, the study highlighted important challenges in assessing special education populations, who exhibited higher misfit due to intra-individual variability. It also revealed that assessments early in the year tended to show less consistency, underscoring the potential influence of rater familiarity and the need for further study.
Furthermore, the use of person fit metrics in EL-KEAs presents significant potential for enhancing both assessment practice and future research. These metrics may help identify misfit in individual assessments, guide targeted instructional planning, and inform professional development to reduce rater variability and improve assessment accuracy.
Overall, the findings suggest that person fit metrics offer a promising avenue for refining observational assessment in early childhood education, making assessments more reliable and relevant for diverse populations. Future research should explore how these metrics can be further integrated into practice and pilot-tested within EL-KEAs to optimize their impact on assessment validity, instructional quality, and teacher development.
Declarations
Conflict of interest
The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Ackerman, D. J. (2018). Real world compromises: Policy and practice impacts of kindergarten entry assessment-related validity and reliability challenges (Research Report No. RR-18–13). Educational Testing Service. https://doi.org/10.1002/ets2.12201
Adams, R. J., Wilson, M., & Wang, W. C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement,21(1), 1–23.CrossRef
Aguinis, H., Culpepper, S. A., & Pierce, C. A. (2016). Differential prediction generalization in college admissions testing. Journal of Educational Psychology,108(7), 1045–1059. https://doi.org/10.1037/edu0000104CrossRef
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
Barghaus, K. M., Dahlke, K., Fantuzzo, J. W., Howard, E. C., Tucker, N., Weinberg, E., Liu, F., Brumley, B., Williams, R., & Flanagan, K. (2023). Validation of the Pennsylvania kindergarten entry inventory: Examining neglected validities in large-scale, teacher-report assessment. Early Education and Development,34(4), 940–962. https://doi.org/10.1080/10409289.2022.2076049CrossRef
Carbonneau, K. J., Van Orman, D. S. J., Lemberger-Truelove, M. E., & Atencio, D. J. (2020). Leveraging the power of observations: Locating the sources of error in the individualized classroom assessment scoring system. Early Education and Development,31(1), 84–99. https://doi.org/10.1080/10409289.2019.1617572CrossRef
Cash, A. H., Hamre, B. K., Pianta, R. C., & Myers, S. S. (2012). Rater calibration when observational assessment occurs at large scale: Degree of calibration and characteristics of raters associated with calibration. Early Childhood Research Quarterly,27(3), 529–542. https://doi.org/10.1016/j.ecresq.2011.12.006CrossRef
Castro, D. C., Gillanders, C., Prishker, N., & Rodriguez, R. (2021). A sociocultural, integrative, and interdisciplinary perspective on the development and education of young bilingual children with disabilities. In D. Castro & A. J. Artiles (Eds.), Language, learning, and disability in the education of young bilingual children (pp. 27–45). Multilingual Matters.
Chen-Gaddini, M.; Sussman J., Newton, E., Ruiz Jimenez, G. S., Kriener-Althen, K., Gochyyev, P., Draney, K., & Mangione, P. (2022a). DRDP technical report: Interrater reliability. WestEd.
Chen-Gaddini, M.; Sussman J., Newton, E., Ruiz Jimenez, G. S., Kriener-Althen, K., Gochyyev, P., Draney, K., & Mangione, P. (2022b). DRDP technical report: Validity in relation to external assessments of child development. WestEd.
Custers, J. W. H., Hoijtink, H., van der Net, J., & Helders, P. J. M. (2000). Cultural differences in functional status measurement: Analyses of person fit according to the Rasch model. Quality of Life Research,9(5), 571–578. https://doi.org/10.1023/A:1008949108089CrossRef
de Ayala, R. J. (2022). The theory and practice of item response theory (Second Edition). Guilford Publications.
Draney, K., Sussman, J., Kriener-Althen, K., Newton, E. K., Gochyyev, P., & Mangione, P. (2021). DRDP technical report for early infancy through kindergarten: Structural validity and reliability. California Department of Education.
Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal ofMathematical and Statistical Psychology,38(1), 67–86. https://doi.org/10.1111/j.2044-8317.1985.tb00817.xCrossRef
Duncan, R. G., & Rivet, A. E. (2018). Learning progressions. In F. Fischer, C. Hmelo-Silver, & S. R. Goldman (Eds.), International handbook of the learning sciences (pp. 422–432). Routledge.CrossRef
Emons, W. H. M. (2009). Detection and diagnosis of person misfit from patterns of summed polytomous item scores. Applied Psychological Measurement,33(8), 599–619. https://doi.org/10.1177/0146621609334378CrossRef
Empson, S. (2011). On the idea of learning trajectories: Promises and pitfalls. The Mathematics Enthusiast,8(3), 571–598.CrossRef
Engelhard, G. (2009). Using item response theory and model—data fit to conceptualize differential item and person functioning for students with disabilities. Educational andPsychological Measurement,69(4), 585–602. https://doi.org/10.1177/0013164408323240CrossRef
Ferrando, P. J. (2015). Assessing person fit in typical-response measures. In S. P. Reise & D. A. Revicki (Eds.), Handbook of item response theory modeling: Applications to typical performance assessment (pp. 128–155). Routledge.
Flanagan, D. P., & McDonough, E. M. (2022). Contemporary intellectual assessment: Theories, tests, and issues (4th Ed.) Guilford Publications
Flavell, J. H. (1994). Cognitive development: Past, present, and future. In R. D. Parke, P. A. Ornstein, J. J. Rieser, & C. Zahn-Waxler (Eds.), A century of developmental psychology (pp. 569–588). American Psychological Association.CrossRef
Garcia, E. B., Sulik, M. J., & Obradović, J. (2019). Teachers’ perceptions of students’ executive functions: Disparities by gender, ethnicity, and ELL status. Journal of Educational Psychology,111(5), 918–931. https://doi.org/10.1037/edu0000308CrossRef
Goldstein, J., & Flake, J. K. (2016). Towards a framework for the validation of early childhood assessment systems. Educational Assessment, Evaluation and Accountability,28(3), 273–293. https://doi.org/10.1007/s11092-015-9231-8CrossRef
Gotwals, A. W., & Songer, N. B. (2010). Reasoning up and down a food chain: Using an assessment framework to investigate students’ middle knowledge. Science Education,94, 259–281.CrossRef
Hambleton, R. K., Merenda, P. F., & Spielberger, C. D. (Eds.). (2004). Adapting educational and psychological tests for cross-cultural assessment. Psychology Press.
Harris, L. R., Adie, L., & Wyatt-Smith, C. (2022). Learning progression–based assessments: A systematic review of student and teacher uses. Review of Educational Research. https://doi.org/10.3102/00346543221081552CrossRef
Harvey, H., & Ohle, K. (2018). What’s the purpose? Educators’ perceptions and use of a state-mandated kindergarten entry assessment. Education Policy Analysis Archives,26, 142–142. https://doi.org/10.14507/epaa.26.3877CrossRef
Heritage, M. (2008). Learning progressions: Supporting instruction and formative assessment. Council of Chief School Officers.
Herman, J. L., Webb, N. M., & Zuniga, S. A. (2007). Measurement issues in the alignment of standards and assessments. Applied Measurement in Education,20(1), 101–126.
Heroman, C., Burts, D. C., Berke, K., & Bickart, T. S. (2010). Teaching strategies GOLD objectives for development & learning: Birth through kindergarten. Teaching Strategies.
Heubert, J. P., & Hauser, R. M. (1999). High stakes: Testing for tracking, promotion, and graduation. National Academy Press. https://eric.ed.gov/?id=ED439151. Accessed 31 Mar 2025.
HighScope. (2014). COR advantage. HighScope Educational Research Foundation.
Holcomb, T. S., Li, Z., Lambert, R., & Ferrara, A. (2024). Educator perspectives on a kindergarten entry assessment: Implementation experiences, support, and data utilization. Perspectives on early childhood psychology and education,8(1), 76–110. https://doi.org/10.58948/2834-8257.1055CrossRef
Holland, P. W., Wainer, H., Service, E. T. (1993). Differential item functioning. Psychology Press.
Hoyt, W. T., & Kems, M.-D. (1999). Magnitude and moderators of bias in observer ratings: A meta-analysis. Psychological Methods,4, 403–424.CrossRef
Isaacs, J. B. (2012). The school readiness of poor children. Brookings Institute.
Joseph, G., Soderberg, J. S., Stull, S., Cummings, K., McCutchen, D., & Han, R. J. (2020). Inter-rater reliability of Washington state’s kindergarten entry assessment. Early Education and Development,31(5), 764–777. https://doi.org/10.1080/10409289.2019.1674589CrossRef
Kang, H., & Furtak, E. M. (2021). Learning theory, classroom assessment, and equity. Educational Measurement: Issues and Practice,40(3), 73–82. https://doi.org/10.1111/emip.12423CrossRef
Kim, D.-H., Lambert, R. G., Durham, S., & Burts, D. C. (2018). Examining the validity of GOLD® with 4-year-old dual language learners. Early Education and Development,29(4), 477–493. https://doi.org/10.1080/10409289.2018.1460125CrossRef
Kowalski, K., Brown, R. D., Pretti-Frontczak, K., Uchida, C., & Sacks, D. F. (2018). The accuracy of teachers’ judgments for assessing young children’s emerging literacy and math skills. Psychology in the Schools,55(9), 997–1012. https://doi.org/10.1002/pits.22152CrossRef
Kriener-Althen, K., Newton, E. K., Draney, K., & Mangione, P. L. (2020). Measuring readiness for kindergarten using the desired results developmental profile. Early Education and Development,31(5), 739–763.CrossRef
Kubsch, M., Czinczel, B., Lossjew, J., Wyrwich, T., Bednorz, D., Bernholt, S., Fiedler, D., Straub, S., Cress, U., Drachsler, H., Neumann, K., & Rummel, N. (2022). Toward learning progression analytics—developing learning environments for the automated analysis of learning using evidence centered design. Frontiers in Education,7,. https://doi.org/10.3389/feduc.2022.981910
Lamprianou, I., & Boyle, B. (2004). Accuracy of measurement in the context of mathematics national curriculum tests in England for ethnic minority pupils and pupils who speak English as an additional language. Journal of Educational Measurement,41(3), 239–259. https://doi.org/10.1111/j.1745-3984.2004.tb01164.xCrossRef
Little, M., Cohen-Vogel, L., Sadler, J., & Merrill, B. (2020). Moving kindergarten entry assessments from policy to practice: Evidence from North Carolina. Early Education and Development,31(5), 796–815. https://doi.org/10.1080/10409289.2020.1724600CrossRef
Magnuson, K. A., Kelchen, R., Duncan, G. J., Schindler, H. S., Shager, H., & Yoshikawa, H. (2016). Do the effects of early childhood education programs differ by gender? A meta-analysis. Early Childhood Research Quarterly,36, 521–536. https://doi.org/10.1016/j.ecresq.2015.12.021CrossRef
Mangione, P. L., Osborne, T., Mendenhall, H. (2019). What’s next? How learning progressions help teachers support children’s development and learning. Young Children, 74(3), 20–25. https://www.proquest.com/docview/2250997981. Accessed 31 Mar 2025.
Martone, A., & Sireci, S. G. (2009). Evaluating alignment between curriculum, assessment, and instruction. Review of Educational Research,79(4), 1332–1361.CrossRef
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika,47, 149–174.CrossRef
Matsumoto, D., spsampsps Van de Vijver, F. J. (2012). Cross-cultural research methods. In H. Cooper, P. M. Camic, D. L. Long, A. T. Panter, D. Rindskopf, spsampsps K. J. Sher (Eds.), APA handbook of research methods in psychology, Vol. 1. Foundations, planning, measures, and psychometrics (pp. 85–100). American Psychological Association. https://doi.org/10.1037/13619-006
Meisels, S. J., & Piker, R. A. (2001). An analysis of early literacy assessments used for instruction. CIERA Report. CIERA/University of Michigan. https://eric.ed.gov/?id=ED452514. Accessed 31 Mar 2025.
Mousavi, A., & Cui, Y. (2020). The effect of person misfit on item parameter estimation and classification accuracy: A simulation study. Education Sciences,10, 1–15. https://doi.org/10.3390/educsci10110324CrossRef
National Education Goals Panel. (1995). The national education goals report: Building a nation of learners, 1995. US Government Printing Office. https://files.eric.ed.gov/fulltext/ED389097.pdf. Accessed 31 Mar 2025.
National Research Council [NRC]. (2001). Eager to learn: Educating our preschoolers. The National Academies Press.
National Research Council [NRC]. (2008). Early childhood assessment: Why, what, and how. The National Academies Press. https://doi.org/10.17226/12446
Paek, I., & Wilson, M. (2011). Formulating the Rasch differential item functioning model under the marginal maximum likelihood estimation context and its comparison with Mantel-Haenszel procedure in short test and small sample conditions. Educational and Psychological Measurement,71(6), 1023–1046. https://doi.org/10.1177/0013164411400734CrossRef
Petridou, A., & Williams, J. (2007). Accounting for aberrant test response patterns using multilevel models. Journal of Educational Measurement,44(3), 227–247.CrossRef
Piaget, J. (1941/1965). The child’s conception of number. W. W. Norton and Company.
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
Raudenbush, S. W., Martinez, A., Bloom, H., Zhu, P., & Lin, F. (2008). An eight-step paradigm for studying the reliability of group-level measures. Working paper, University of Chicago.
Ready, D. D., & Wright, D. L. (2011). Accuracy and inaccuracy in teachers’ perceptions of young children’s cognitive abilities: The role of child background and classroom context. American Educational Research Journal,48(2), 335–360. https://doi.org/10.3102/0002831210374874CrossRef
Roth, P. L., Le, H., Oh, I.-S., Van Iddekinge, C. H., Buster, M. A., Robbins, S. B., & Campion, M. A. (2014). Differential validity for cognitive ability tests in employment and educational settings: Not much more than range restriction? Journal of Applied Psychology,99(1), 1–20. https://doi.org/10.1037/a0034377CrossRef
Rudner, L. M., & National Center for Education Statistics (Eds.). (1995). Use of person-fit statistics in reporting and analyzing national assessment of educational progress results. U.S. Dept. of Education, Office of Educational Research and Improvement.
Rupp, A. A. (2013). A systematic review of the methodology for person fit research in item response theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling,55(1), 3–38.
Russo, J. M., Williford, A. P., Markowitz, A. J., Vitiello, V. E., & Bassok, D. (2019). Examining the validity of a widely-used school readiness assessment: Implications for teachers and early childhood programs. Early Childhood Research Quarterly,48, 14–25. https://doi.org/10.1016/j.ecresq.2019.02.003CrossRef
Schafer, W. D., Wang, J., & Wang, V. (2009). Validity in action: State assessment validity evidence for compliance with NCLB. In R. W. Lissitz (Ed.), The concept of validity: Revisions, new directions and applications (pp. 195–212). Information Age Publishing.
Şengül Avşar, A., & Emons, W. H. M. (2021). A cross-cultural comparison of non-cognitive outputs towards science between Turkish and Dutch students taking into account detected person misfit. Studies in Educational Evaluation,70, 101053. https://doi.org/10.1016/j.stueduc.2021.101053CrossRef
Şengül Avşar, A. (2019). Comparison of person-fit statistics for polytomous items in different test conditions. Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi, 10(4), 348–364. https://doi.org/10.21031/epod.525647
Smith, R. M., & Plackner, C. (2009). The family approach to assessing fit in Rasch measurement. Journal of Applied Measurement,10(4), 424–437.
Smith, A. B., Rush, R., Fallowfield, L. J., Velikova, G., & Sharpe, M. (2008). Rasch fit statistics and sample size considerations for polytomous data. BMC Medical Research Methodology,8(1), 1–11. https://doi.org/10.1186/1471-2288-8-33CrossRef
Steedle, J. T., & Shavelson, R. J. (2009). Supporting valid interpretations of learning progression level diagnoses. Journal of Research in Science Teaching,46(6), 699–715. https://doi.org/10.1002/tea.20308CrossRef
Sussman, J., Draney, K., & Wilson, M. (2023). Language and literacy trajectories for dual language learners (DLLs) with different home languages: Linguistic distance and implications for practice. Journal of Educational Psychology,115(6), 891–910.CrossRef
Turner, K. T., & Engelhard Jr., G. (2023). Functional data analysis and person response functions. Measurement: Interdisciplinary Research and Perspectives, 21(3), 129–146. https://doi.org/10.1080/15366367.2022.2054130
Van den Noortgate, W., & De Boeck, P. (2005). Assessing and explaining differential item functioning using logistic mixed models. Journal of Educational and Behavioral Statistics,30(4), 443–464. https://doi.org/10.3102/10769986030004443CrossRef
Van der Flier, H. (1983). Deviant response patters and comparability of test scores. Journal of Cross-Cultural Psychology,13(3), 267–298.CrossRef
Vygotsky, L. S. (1978). Mind in society. Harvard University Press.
Wakabayashi, T., Claxton, J., & Smith, E. V., Jr. (2019). Validation of a revised observation-based assessment tool for children birth through kindergarten: The COR advantage. Journal of Psychoeducational Assessment,37(1), 69–90.CrossRef
Walker, A. A., Jennings, J. K., & Engelhard, G., Jr. (2018). Using person response functions to investigate areas of person misfit related to item characteristics. Educational Assessment,23(1), 47–68. https://doi.org/10.1080/10627197.2017.1415143CrossRef
Waterman, C., McDermott, P. A., Fantuzzo, J. W., & Gadsden, V. L. (2012). The matter of assessor variance in early childhood education—Or whose score is it anyway? Early Childhood Research Quarterly,27(1), 46–54. https://doi.org/10.1016/j.ecresq.2011.06.003CrossRef
Wilson, M. (2009). Measuring progressions: Assessment structures underlying a learning progression. Journal of Research in Science Teaching,46(6), 716–730. https://doi.org/10.1002/tea.20318CrossRef
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. MESA Press.
Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions,8(3), 370.
Wu, M., & Adams, R. J. (2013). Properties of Rasch residual fit statistics. Journal of Applied Measurement,14(4), 339–355.
Wu, X., Zhang, Y., Wu, R., & Chang, H.-H. (2021). A comparative study on cognitive diagnostic assessment of mathematical key competencies and learning trajectories: ——PISA data analysis based on 19, 454 students from 8 countries. Current Psychology. https://doi.org/10.1007/s12144-020-01230-0CrossRef
Yun, C., Melnick, H., & Wechsler, M. (2021.). High-quality early childhood assessment: Learning from states’ use of kindergarten entry assessments. Learning Policy Institute.