Zum Inhalt

Cultural sensitivity of early childhood assessments based on learning progressions: a Rasch person fit analysis

  • Open Access
  • 14.04.2025
Erschienen in:

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …
insite
INHALT
download
DOWNLOAD
print
DRUCKEN
insite
SUCHEN

Abstract

Der Artikel untersucht die kulturelle Sensibilität frühkindlicher Bewertungen, insbesondere solcher, die auf Lernfortschritten beruhen, die in den Vereinigten Staaten weit verbreitet sind. Sie wendet sich gegen die weit verbreitete Spekulation, dass diese Bewertungen, die als Früherkennungs- und Kindergarteneintrittsuntersuchungen (EL-KEAs) bekannt sind, Kinder aus kulturell und sprachlich vielfältigen Familien (CLD) benachteiligen könnten. Die Studie konzentriert sich auf das Desired Results Developmental Profile (DRDP), ein prominentes EL-KEA, und beschäftigt Rasch Personenpassanalysen, um die kulturelle Sensibilität dieser Bewertung zu untersuchen. Die Forschung geht davon aus, dass Kinder aus CLD-Familien im Durchschnitt eine höhere Fehlpassquote aufweisen könnten, was auf eine potenzielle Voreingenommenheit hindeutet. Der Artikel diskutiert auch die Grenzen aktueller Validierungsmethoden, wie der differentiellen Item-Funktion (DIF), bei der Erkennung dieser Art von Voreingenommenheit. Es bietet einen detaillierten Überblick über die in EL-KEA verwendete Bewertungstechnologie, einschließlich der Lernfortschritte (LPs) und ihrer Rolle bei der Messung des Lernens und der Entwicklung von Kindern. Die Studie unterstreicht die Notwendigkeit, sich angesichts des weit verbreiteten Einsatzes von EL-KEA bei gefährdeten Bevölkerungsgruppen von Kindern dringend auf kulturelle Sensibilität in der frühkindlichen Beurteilung zu konzentrieren. Der Artikel untersucht auch die möglichen Auswirkungen der Ergebnisse auf die Beurteilungspraxis und -forschung und legt nahe, dass Kennzahlen zur Passgenauigkeit von Personen ein wertvolles Werkzeug zur Identifizierung atypischer oder potenziell ungültiger Beurteilungen sein könnten, um eine zielgerichtete Unterrichtsplanung zu leiten und die berufliche Entwicklung zu beeinflussen, um die Variabilität der Beurteilung zu verringern und die Beurteilungsgenauigkeit zu verbessern.

Supplementary Information

The online version contains supplementary material available at https://​doi.​org/​10.​1007/​s11092-025-09453-0.
The original version of this paper was updated due to changes of affiliations for authors Karen Draney and Mark Wilson; changes in the textbody as stated in the correction article and the supplementary materials that should be published separately.
A correction to this article is available online at https://​doi.​org/​10.​1007/​s11092-025-09462-z.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Observational early learning assessments, sometimes employed as kindergarten entry assessments (KEAs), are in widespread use in the United States, inside and outside of publicly funded early care and education (Ackerman, 2018; Goldstein & Flake, 2016). This category of assessments, which we label early learning and kindergarten entry assessments (EL-KEAs) hereafter, is designed to measure all children’s learning and development, including those from culturally and linguistically diverse (CLD) families and children with disabilities.
Currently, there is broad speculation that EL-KEAs could be biased against children from culturally and linguistically diverse families. A large early education policy coalition recently questioned whether EL-KEAs, which aim to measure children’s development along hypothetical developmental continua, disadvantage nonwhite children who often have a broader range of developmental experiences and trajectories (First 5 Center for Children’s Policy [First 5], 2022). Unfortunately, the coalition did not present an example of this bias or offer a well-developed hypothesis of how bias may occur. Indeed, we aim to fill this gap, in part, by operationalizing specific, testable hypotheses of potential assessment bias in this paper.
The EL-KEAs in question typically use an assessment technology called learning progressions (LPs; Harris et al., 2022; Mangione et al., 2019). LPs are descriptions of curricular sequences that delineate successive levels of progress in an area of learning or development. LPs are often associated with assessment rubrics that serve as the scoring model for the assessment items. Some investigators in the LPs literature, and the literature on learning trajectories, or LTs, as they are called in mathematics education, have posited that children with different sociocultural experiences may have different learning pathways that are not captured equally well by the LPs underlying the assessments (Harris et al., 2022). Unfortunately, such ideas are typically speculative, grounded in theory but lacking examples of bias in action. However, Kang and Furtak (2021) found that reframing LPs around sociocultural concepts led to important changes such as the consideration of different learning goals and different uses of the LPs in classrooms. Although not evidence of bias, Kang and Furtak’s findings showed how culture can influence LPs/LTs and their application.
Very little published research has examined the cross-cultural applicability of LPs within EL-KEAs. It is therefore difficult to evaluate their potential bias against children with diverse developmental experiences. We will call this cultural sensitivity and explore this definition throughout. As we explain below, common validation methods, such as differential item functioning (DIF), are probably insensitive to the bias in question. Given the paucity of research in this area and the widespread use of EL-KEAs with vulnerable populations of children, attention to cultural sensitivity deserves urgent focus.
In this paper, we used Rasch person fit analysis to examine the cultural sensitivity of one EL-KEA used widely in the United States, the Desired Results Developmental Profile, or DRDP, assessment (California Department of Education, 2015). Person fit analysis has been previously used to study the relationship between culture and assessment validity (e.g., Custers et al., 2000; Lamprianou & Boyle, 2004; Petridou & Williams, 2007; Şengül Avşar & Emons, 2021). We posited that the specific type of bias proposed by First 5 (2022) could produce higher average positive person misfit in children from CLD families.
The remainder of this introduction contains two main sections: First, we explain and contextualize the use of LPs within EL-KEAs. We connect early childhood LPs with the broader LP/LT literature and explain why current validation methods fall short. Second, we briefly describe person fit methods. We review relevant technical literature and articulate a hypothesis for the ways that cultural (in)sensitivity could manifest as person misfit.

1 Assessment-based LPs and EL-KEA systems

LPs play a crucial role in EL-KEAs, though research on them is limited. The three major EL-KEA systems—Desired Results Developmental Profile (DRDP), Teaching Strategies Gold (TS-GOLD; Heroman et al., 2010), and COR Advantage (HighScope, 2014)—incorporate LPs, but available documentation lacks depth. Mangione et al. (2019) provide a practitioner-focused overview of DRDP’s LPs, while the LPs in TS-GOLD and COR Advantage are primarily described in technical reports and marketing materials. These sources offer only surface-level explanations, leaving large gaps in reporting about their development and applications.

1.1 LPs in the DRDP

The LPs in EL-KEAs may be described as rubrics that operationalize children’s learning and development in an important sub-domain of functioning (Mangione et al., 2019). For example, Fig. 1 shows an LP for the DRDP. This LP contains a set of levels that describe what it means to achieve competence in communication and expressive language, in small and progressive steps. Each LP is an assessment item: The intention is that, during assessment, a teacher will use the LP as a rubric to assess a child’s learning and development. Note that the levels are ordered to reflect qualitatively distinct steps in growth and definitions focus on observable behaviors.
Fig. 1
A sample DRDP item rubric
Bild vergrößern

1.1.1 Connecting early childhood LPs with the broader LP literature

Research from K-12 LP/LT literature can help us better understand early childhood LPs. Duncan and Rivet’s (2018) framework suggests five key characteristics of LPs: (1) grain size of developmental levels, (2) construct scope, (3) how the LPs were created, (4) the types of levels, and (5) what progresses and how it occurs. For early childhood LPs, grain size tends to be broad, covering multiple years rather than narrower age spans. This approach promotes continuity, allowing assessments to span different developmental levels. It may limit sensitivity to finer distinctions in progress, which can affect how frequently teachers detect and respond to developmental changes. EL-KEA scope usually encompasses the five essential domains of readiness originally described by the National Education Goals Panel [NEGP] (1995), and further operationalized by the Head Start Early Learning Outcomes Framework (Office of Head Start, 2015). The early childhood LPs are typically created as they are in the science education tradition: by multidisciplinary teams who develop developmental sequences based on literature reviews and task analyses, using iterative refinement processes to reach consensus on measurement, often with a focus on cultural and linguistic sensitivity. The levels of most early childhood LPs follow a developmental trajectory, typically informed by established theories that shed light on what progresses and how it occurs: Cognitive development changes from simpler to more complex forms of logical thought (Piaget, 1941/1965). Environmental and cultural factors work in concert to shape development (Sarama & Clements, 2009). Children develop through scaffolded support from adults that is calibrated to their current needs (Vygotsky, 1978), a process that can be guided through child assessment (Griffin, 2007). An early curriculum can be effective by revisiting foundational skills repeatedly, with increasing depth (Bruner, 1960).

1.1.2 LPs and authentic child assessment

LPs are frequently associated with uncertainties regarding their conceptualization and appropriate applications, both within KEAs (Harvey & Ohle, 2018), and more broadly in educational contexts (Kubsch et al., 2022). Despite this ambiguity, the field has adapted to accommodate these challenges, conceptualizing LPs as frameworks that organize and bring coherence to complex and dynamic developmental processes. For example, the LPs embedded within EL-KEAs guide teachers in making structured developmental observations aligned with construct progressions (as in Fig. 1). Teachers generally regard this structured approach as beneficial (Little et al., 2020), distinguishing EL-KEAs from earlier observational assessments that lacked explicit observation guidelines (i.e., Meisels & Piker, 2001; Ready & Wright, 2011).
It is essential to recognize that a child’s progress along an LP or LT is neither expected to be linear nor strictly unidimensional (Wilson, 2009). Individual learners’ paths often deviate from hypothetical trajectories, as demonstrated in prior research (Duschl et al., 2011; Gotwals & Songer, 2010; Steedle & Shavelson, 2009). Furthermore, children’s capacities frequently exhibit variability both within and across tasks over time (Flavell, 1994). Nonetheless, scientific consensus affirms that LPs/LTs serve as valuable organizing frameworks to understand typical learning patterns and individual differences (National Research Council [NRC], 2008).
The inherent variability in young children’s daily performances emphasizes the importance of authentic formative assessment. A single, direct summative assessment conducted in a novel or artificial setting is unlikely to reliably capture a valid representation of a child’s developmental progress (NRC, 2008; Shepard et al., 1998). Factors such as rapport, motivation, and familiarity with assessment materials significantly influence children’s performance. Flexible assessment environments, which account for these factors, better accommodate developmental and cultural differences, allowing children to demonstrate their true knowledge and skills (NRC, 2001).
The use of LPs is anchored in a theoretically robust framework that organizes complex learning patterns, tracks typical developmental trajectories, and addresses individual variability. These tools empower well-supported educators to guide instruction effectively without imposing rigid developmental sequences (Heritage, 2008). The following section examines the evidence supporting their accuracy and practical application.

1.2 Validity of EL-KEAs for children from CLD families

In educational measurement, the current consensus is that an assessment’s accuracy and utility for a purpose should be substantiated by a validity argument (Kane, 2013), which integrates theory and evidence in support of an assessment’s interpretation and use(s) (American Educational Research Association et al., 2014). Although EL-KEAs are used for a variety of purposes in practice (Goldstein & Flake, 2016), their primary stated purpose is often to inform instruction as a formative assessment tool. The underlying premise is that assessments provide feedback to teachers (Sadler, 1989) enabling them to design learning environments that help children progress along developmental trajectories (Black & William, 1998). However, implementing formative assessment effectively is challenging in practice. Teachers often require more intensive professional development and sustained support than they currently receive (Holcomb et al., 2024). Furthermore, experimental research is needed to substantiate claims that teacher feedback informed by EL-KEAs leads to more effective instructional practices and improved child outcomes.
There are a core set of validation needs that apply to EL-KEAs’ common purposes: Evidence based on test content, alignment with educational standards, test reliability, and DIF analysis are necessary for most uses. These categories of evidence are also typical for large-scale summative assessments for older students (Schafer et al., 2009). Next, we briefly summarize the existing validity evidence for three systems.

1.2.1 Test content

Evidence of test content is typically gathered during the test construction process, where the development team designs, evaluates, and revises the instrument, producing descriptions of content domains, developmental sequences, and justifications for design choices: For the DRDP, see Kriener-Althen et al. (2020) and WestEd (2018). These evaluations, informed by research literature, task analysis, and expert opinions, assessed the instruments’ content adequacy in terms of breadth, depth, and suitability of items, and have been documented in technical manuals, whitepapers, and peer-reviewed articles (Lambert et al., 2015; Wakabayashi et al., 2019).

1.2.2 Alignment

The goal of alignment is to ensure that assessment content is congruent with educational standards and learning goals (Bhola et al., 2003; Herman et al., 2007; Martone & Sireci, 2009), ultimately ensuring that items measure what children are expected to learn. The DRDP offers detailed documentation of its alignment with California’s early learning standards and school readiness concepts (Kriener-Althen et al., 2020). At minimum, other systems typically provide basic statements of alignment with developmental indicators or state guidelines (e.g., Wakabayashi et al., 2019).

1.2.3 Reliability

EL-KEAs tend to have extensive documentation of their psychometric reliability, showing strong internal consistency and person separation reliability coefficients (Draney et al., 2021; Lambert et al., 2015; Wakabayashi et al., 2019). While interrater reliability (IRR) is generally adequate (k = 0.6 to 0.8; Kowalski et al., 2018; Chen-Gaddini et al., 2022a; Joseph et al., 2020), it remains a recognized challenge in observational assessments (Mashburn & Henry, 2005; Waterman et al., 2012) that requires rigorous, ongoing rater training (Cash et al., 2012).

1.2.4 External variables

EL-KEAs generally show strong convergent validity, with moderate correlations between EL-KEA scores and direct assessments of similar constructs, as well as evidence of increasing scores with age (Chen-Gaddini et al., 2022b; Kim et al., 2018; Russo et al., 2019; DCRG, 2018; Sussman et al., 2023). However, divergent validity tends to be weaker, as the expected low correlations between EL-KEAs and assessments of dissimilar constructs are often unclear, likely due to the global nature of early childhood development and imperfect construct alignment. The DRDP shows highly correlated latent variables across learning domains (around 0.9); somewhat higher than the correlations between academic domains observed in international assessments (Draney et al., 2021). The evidence suggests a need for further research to clarify these divergent validity issues (Chen-Gaddini et al., 2022b; Lambert et al., 2015; Wakabayashi et al., 2019).

1.2.5 Internal structure

Evidence for internal structure in EL-KEAs is robust, focusing on the alignment between theoretical and estimated models, typically using Rasch family models. Methods such as model fit, dimensionality analysis, and Wright Map analysis are commonly employed to assess internal structure, with all three EL-KEAs providing extensive documentation (DCRG, 2018; Lambert et al., 2015; Wakabayashi et al., 2019), and the DRDP leveraging its internal structure to represent children’s progress across multiple learning progressions with general construct levels (Sussman et al., 2023). Item invariance studies, including longitudinal DIF analyses, further support internal structure, with evidence suggesting that rater experience improves internal consistency (Lambert et al., 2015).

1.2.6 Fairness and consequences

Fairness and consequences of EL-KEAs have received limited attention, which is typical for large-scale assessments (Heubert & Hauser, 1999; Schafer et al., 2009), despite the potential for fairness issues in multicultural populations (Hambleton et al., 2004; Matsumoto & Van de Vijver, 2012). Although studies of DIF (Holland et al., 1993; Paek & Wilson, 2011; Van den Noortgate & De Boeck, 2005) have shown that DRDP (Draney et al., 2021) and TS-GOLD (Kim et al., 2018) function similarly across various demographic groups, this alone cannot justify fairness; additional research is needed, particularly regarding the consequences of using EL-KEAs for high-stakes decisions (Yun et al., 2021) and the impact of teacher professional development (Barghaus et al., 2023).

1.3 Person fit and cultural sensitivity

A core assumption of assessment validity is that assessments measure individuals from different groups with equal accuracy, typically evaluated through Rasch models (Bond et al., 2021; de Ayala, 2022) in EL-KEAs. Whereas DIF is a common method for assessing group fairness, it does not capture all aspects of cultural sensitivity, particularly in developmental assessments where DIF may be nonuniform (Millsap & Everson, 1993). Subgroup analysis and person fit analysis, which examine group-level measurement differences and individual-level fit, offer additional insights into fairness and cultural sensitivity (Aguinis et al., 2016; Meijer & Sijtsma, 2001; Roth et al., 2014). These methods are essential for identifying issues like implicit bias and ensuring valid cross-cultural comparisons (McQueen & Mendelovits, 2003; Schulz & Fraillon, 2011).

1.3.1 Person fit analysis

Person fit analysis is used to identify unusual or unexpected response patterns in individuals or groups, even in situations where the measurement model fits the overall data. It compares an individual’s response pattern to the expectations set by the model, based on person and item threshold locations, and highly improbable patterns are classified as misfitting. Misfit can be either negative (more consistent than expected) or positive (less consistent than expected). Excessive person misfit has been shown to potentially compromise the validity of an assessment by introducing biases in person or item parameters (Emons, 2009; Mousavi & Cui, 2020) or complicating the interpretation of individual scores (Walker, 2017). Person fit analysis has been applied to examine cultural and linguistic biases in large-scale assessments (Petridou & Williams, 2007; Şengül Avşar & Emons, 2021; Van der Flier, 1983). For example, Lamprianou and Boyle (2004) found that ethnic minority students who spoke English as a second language had higher levels of positive person misfit on a math test in England than their peers, suggesting potential bias in the assessment for this group.

1.3.2 Person fit statistics

Many different person fit statistics exist (Şengül Avşar, 2019), but only a handful are appropriate for use with the polytomous items common in EL-KEAs. The most widely applied statistics for such items are Wright and Masters’ (1982) parametric loss function statistics, called mean squares (MS), usually applied with the Rasch model. These statistics provide diagnostic insight into person fit by assessing the difference between an individual’s response pattern, modeled as a person response curve (PRC), and the modeled mean PRC (Ferrando, 2015; Reise, 2000; Turner & Englehard, 2023; Walker et al., 2018). An individual PRC that is steeper than the mean indicates more consistency than expected (negative misfit), while PRC that is flatter than the mean suggests less consistency (positive misfit). This is analogous to item misfit in Rasch modeling, where an item’s discrimination parameter is compared to the average across all items (Wu & Adams, 2013).
Unweighted MS, also called outfit, is the simple average of the squared standardized residuals and is often described as an “outlier sensitive” fit statistic that is more sensitive to unexpected ratings on items that are either very easy or very hard for a person. In contrast, weighted MS, also called infit, is an “information-weighted” statistic that is more sensitive to unexpected ratings on items that are roughly targeted to the person. Current literature recommends the use of weighted MS over the unweighted version, as it provides greater reliability (Müller, 2020). The present analysis focuses exclusively on weighted MS, but Supplement 1 provides a comparison of both weighted and unweighted MS results.
Equation 1 shows that, for person \(n\) responding to items indexed by \(i\), weighted MS \(({V}_{n}\)) is the mean of the sum of squared residuals, \({y}_{ni}\), that have been weighted by the information, \({W}_{ni}\) (Wright & Masters, 1982). All other things being equal, \({W}_{ni}\) will be greater for unexpected responses to items that are closer to person n’s latent ability on the measurement (logit) scale.
$${V}_{n}=\sum_{i=1}^{L}{{W}_{ni} z}_{ni}^{2} / \sum_{i=1}^{L}{W}_{ni}= \sum_{i=1}^{L}{y}_{ni}^{2} / \sum_{i=1}^{L}{W}_{ni},$$
(1)
The expected value for MS statistics is 1.0. It is common to treat values above 1.3 as indicative of positive misfit or unmodeled noise or other sources of variance that may degrade measurement (Wright & Linacre, 1994). In contrast, values below 0.7 are considered negative misfit, which is typically less concerning. Negative misfit may be associated with local dependence or redundant items, but it is unlikely to degrade measurement. The t statistics historically used with MS are sensitive to sample size and are not recommended with large samples.
Applied to the current assessment, Table 1 presents the weighted MS fit statistics (i.e., infit) for responses in the Social and Emotional Development (SED) domain of the DRDP. The table includes data from six example respondents, all with identical raw scores of 20 and a corresponding latent estimate of 0.0 logits. The results show a range of person fit: one set of scores demonstrates expected misfit (MS close to 1.0), two sets exhibit low MS (< 0.7), and three show high MS (> 1.3). The first row of the table displays the predicted scores for a respondent with a raw score of 20, as derived from the measurement model, providing context for the observed MS values.
Table 1
Misfitting and fitting response patterns for the social and emotional domain of the DRDP assessment
Respondent ID
Raw score
Weighted MS
SED1
SED2
SED3
SED4
SED5
Predicted
20
-
3.87
3.92
4.01
4.18
4.07
MS = 1 (predicted misfit)
1001
20
0.97
5
4
4
3
4
Low MS (negative misfit)
    
1002
20
0.02
4
4
4
4
4
1003
20
0.51
3
4
4
5
4
High MS (positive misfit)
1004
20
1.50
3
5
4
3
5
1005
20
2.49
5
4
4
2
5
1006
20
7.01
4
6
6
1
3
The first response pattern, under MS = 1, exhibits typical orderliness. Although there was some variation around the predicted scores (notably for SED1 and SED4, which deviate more from the model’s predictions than other items), such divergence is expected in a probabilistic model and in assessment of child development (e.g., accommodating horizontal decalage). The second set of scores, under “Low MS,” aligns more closely with the predicted values than expected. For respondent 1002, the observed scores match the predicted values almost exactly, yielding a very low weighted MS of 0.02. Respondent 1003 shows a two-point change that results in the smallest increase in weighted MS, rising from 0.02 to 0.51 but still well below 1.0. Notably, the average MS associated with all 2-point changes is 0.51.
The DRDP was designed to encourage a pattern of negative misfit, with items developed to work in lockstep to reflect a generalized developmental trajectory across items (Sussman et al., 2023). Consequently, a child rated as a 4 on one item is likely to be rated a 4 on others. This built-in consistency, though only a matter of degree, results in less variance than the model formally predicts, leading to average child misfit below 1.0. As we will discuss later, this lower misfit (less variation than expected) is not concerning and does not indicate a problem with the assessment.
The final set of response patterns, under “High MS,” exhibits larger discrepancies between observed and expected scores. Respondent 1004, with an MS of 1.5, demonstrates a response pattern that would be flagged as misfitting. Respondents 1005 and 1006 show even more pronounced variability, with 1006 displaying a response pattern in the top 2% of positive misfit. Positive misfit cases, like those of respondents 1005 and 1006, raise concerns about valid measurement, as they suggest highly unexpected responses on certain items, signaling potential issues with the assessment.

1.3.3 Relating person misfit with cultural sensitivity in the DRDP

In this study, the central hypothesis tested through person fit analysis is that a child’s cultural background and related experiences may influence their intra-individual development, specifically the consistency in which different skills develop. The DRDP’s measurement model can be viewed as a testable framework for child development. As Thurstone (1937) noted, the parameter estimates take on substantive meaning because they become an empirical representation of the underlying theory. Thus, the assessment contains a probabilistic model of how children are expected to progress across various developmental domains. Person fit analyses allow us to evaluate each child’s match with these expectations.
The present person fit analysis compared the degree to which development within the three groups matched the expectations set forth by the model. Practically speaking, a broader range of lived experiences may selectively support development in different areas, affecting the rate and sequence of learning, and manifesting as relative strengths or weaknesses in development (e.g., Saxe, 1988). A child exhibiting significant inconsistency across developmental areas relative to the model will show high misfit (i.e., MS > > 1.0), whereas low intra-individual variation relative to the model results in low misfit (i.e., MS < < 1.0).
Following Rupp’s (2013) recommendation that person fit studies include explicit, a priori hypotheses regarding expected fit and directionality, we hypothesized that children who are Latino/a or Black would exhibit greater average positive misfit compared to their White peers. Such a pattern would suggest that children from diverse backgrounds are not rated with the same level of consistency as White children, providing evidence that the learning progressions (LPs) may not fully reflect the developmental trajectories of children with a broader range of cultural and developmental experiences. This could indicate potential biases in the assessment (e.g., Walker, 2017).
This study addresses several calls in the literature: (a) from the early childhood education (ECE) field to assess the fairness of EL-KEAs, (b) from research on learning progressions to explore their applicability to diverse children, and (c) from person fit literature to use multilevel regression methods to better understand factors influencing person fit. This study directly addresses concerns raised by First 5 (2022) regarding the cultural relevance of EL-KEAs. Again, if an assessment fails to account for the varying rates and sequences of development shaped by diverse cultural experiences, it could compromise the fairness and accuracy of the score interpretations. Our study sought evidence for the cultural sensitivity of the DRDP, and to support or contradict claims that the assessment should be regarded as an equitable measure of child development for children from CLD families, with implications for similar EL-KEAs.

2 Methods

2.1 Overview

In this study, we employed a regression framework to examine person fit on the DRDP assessment among children identified as Latino/a, Black, and White. To account for contextual factors, we used multilevel modeling to control for background variables and unmeasured variance at the teacher and school levels. The results were interpreted to assess whether the developmental progressions underlying the DRDP assessment are equally applicable to these three racial/ethnic groups.

2.2 Data and participants

This study utilized data from DRDPtech, a state assessment database that was part of the California Desired Results Program. DRDPtech has since been replaced, but it provided comprehensive data for this analysis. The database contains DRDP scores and teacher-reported demographics for children in state-funded infant/toddler, preschool, or kindergarten programs in California from 2015 through 2018.
We selected participants using a stratified sampling process as follows: (1) Starting with the full database (1.25 million observations), we retained children aged 2–5 who completed the preschool version of the DRDP. (2) Observations with missing data were excluded to ensure consistent parameter estimation across groups. (3) To eliminate dependency, we randomly selected one assessment per child, removing repeated measures. (4) Given the small size of the group of children identified as Black, we sampled all identified Black children and randomly selected equal numbers of Latino/a and White children to create a balanced sample. (5) To estimate the complete measurement model, we added 15,000 infants/toddlers and 1200 kindergarteners with complete data, stratified equally across the three racial/ethnic groups. Our software and estimation require complete data for all categories, and using more than the minimum cases helps ensure model stability. (6) The final sample included 96,258 children nested in 4326 school sites. The median number of children per site was 13, ranging from 1 to 664. The misfit analysis focused on the preschool subsample (N = 80,058), as this group aligned with the primary research aim of evaluating person fit across racial/ethnic categories.

2.3 Data availability statement

The data are owned by California and are not publicly available.

2.4 Participant demographics

We analyzed demographic variables available in DRDPtech that are commonly used to explain assessment variance in similar studies. These variables include teacher-reported information about children’s race/ethnicity, age, gender, special education status, the language(s) that the teacher speaks with the child, and the type of preschool setting. In addition, to control for the effects related to semester of assessment (raters become more experienced with the DRDP during the year), we recorded which semester the assessment was conducted in.
Children identified as both Latino/a and White were classified as Latino/a to maintain consistency with prior research practices and ensure balanced group sizes. Children identified as multiracial were excluded from the analysis.
Table 2 shows that racial/ethnic and age distributions were balanced across groups. Table 3 highlights additional demographic variables such as language match, special education status, and childcare settings. Children in special education comprised 6% of the analytic sample, which aligns with statewide numbers. As expected, about two-thirds of children identified as Latino/a were classified as multilingual learners (MLs), whereas far fewer ML classifications were found among children identified as Black or White.
Table 2
Children’s race/ethnicity and age
  
Age
Race/ethnicity
n
M
SD
Min
Max
Latino/a
26686
53.0
6.9
36
71
Black
26686
52.5
7.1
36
71
White
26686
53.0
6.9
36
71
Table 3
Crosstabulation of race/ethnicity with categorical variables
 
Latino/a
White
Black
 
N
%
n
%
n
%
Gender
      
Female
13,418
50.3
12,993
48.7
13,607
51.0
Male
13,249
49.6
13,682
51.3
13,064
49.0
Missing
19
0.1
11
 < 0.1
15
0.1
Special education
      
General education
23,061
86.4
22,377
83.9
23,027
86.3
Special education
1342
5.0
1864
7.0
1155
4.3
Missing
2283
8.6
2445
9.2
2504
9.4
Multilingual learner
      
Yes
18,044
67.6
4097
15.4
1917
7.2
No
8642
32.4
22,589
84.6
24,769
92.8
Language match
      
Match
17,691
66.3
23,342
87.5
24,979
93.6
No match
8846
33.1
3252
12.2
1596
6.0
Missing
149
0.6
92
0.3
111
0.4
Setting
      
Center-based
2264
8.5
1894
7.1
3040
11.4
Family care
251
0.9
325
1.2
229
0.9
Head start
4786
17.9
5713
21.4
7112
26.7
State school
15,507
58.1
14,938
56.0
12,254
45.9
Other setting
1122
4.2
1332
5.0
830
3.1
Missing
2756
10.3
2484
9.3
3221
12.1
Semester
      
Fall
14,207
44.5
14,154
43.9
13,705
42.7
Winter
14,767
47.2
15,172
48.7
14,533
46.2
Spring
3112
8.3
2760
7.4
3848
11.2
Language match, a constructed variable, indicates whether the teacher reported speaking the child’s home language. About 68% of children identified as Latino/a had a teacher who spoke their home language, compared to 88% for children identified as White and 94% for children identified as Black. These differences suggest potential variations in teacher–child communication patterns across groups.
Childcare setting distributions were generally balanced across racial/ethnic groups, further supporting the comparability of the sample across these categories. Semester was also balanced across racial/ethnic groups, as expected, indicating valid sampling. Winter DRDP assessments were optional, hence the smaller number of completed Winter assessments relative to fall and spring.

2.5 Assessment

The outcomes in this study were assessment scores for the five major domains of the DRDP assessment. Children were rated by their teachers or childcare providers, and scores were generated using the operational DRDP assessment system methods (DCRG, 2018): In brief, items within each domain were scaled using the unidimensional Rasch PCM (Masters, 1982), incorporating a dimensional alignment technique (Feuerstahler & Wilson, 2021) to place the five domains on a common measurement scale. The Rasch models were estimated using the (unidimensional) random coefficients multinomial logit model (Adams et al., 1997). Person scaled scores for each dimension (also called domain) were generated using Warm’s Likelihood Estimates.
Descriptive statistics for the DRDP scores, disaggregated by race/ethnicity, are provided in Table 4. The mean scores for children identified as Latino/a and Black were generally lower than those for children identified as White, consistent with prior research (e.g., Isaacs, 2012). The group-level mean difference of approximately 0.4 to 0.6 logits is unlikely to influence the person fit results (Paek & Wilson, 2011), as this difference is small relative to the range of the latent distribution, and our regression models account for age, which is strongly correlated with the latent trait.
Table 4
DRDP assessment scores disaggregated by race/ethnicity (in logits)
Domain
Race/ethnicity
M
SD
min
max
ATLREG
Latino/a
0.84
2.63
 − 10.13
6.34
 
White
0.69
2.62
 − 10.13
6.34
 
Black
1.25
2.55
 − 10.13
6.34
SED
Latino/a
1.03
2.50
 − 10.27
5.96
 
White
1.49
2.42
 − 10.27
5.96
 
Black
1.08
2.50
 − 10.27
5.96
LLD
Latino/a
0.79
2.36
 − 10.64
6.08
 
White
1.44
2.25
 − 10.64
6.08
 
Black
0.99
2.34
 − 10.64
6.08
COG
Latino/a
0.75
2.44
 − 10.34
6.31
 
White
1.35
2.35
 − 10.34
6.31
 
Black
0.79
2.44
 − 10.34
6.31
PDHLTH
Latino/a
0.89
2.39
 − 10.67
5.93
 
White
1.29
2.25
 − 10.67
5.93
 
Black
0.90
2.35
 − 10.67
5.93
Note. ATLREG Approaches to Learning/Self-Regulation, SED Social and Emotional Development, LLD Language and Literacy Development, COG Cognition (including math and science), PDHLTH Physical Development/Health. n of each row = 26,686

2.5.1 Analysis plan

Our analysis employed weighted mean squares (MS) as the outcome in multilevel regression to examine racial/ethnic differences in person fit across the five major domains of the DRDP assessment.

2.5.2 Weighted MS

Weighted MS, often referred to as infit, has been widely used in the education and healthcare literatures, as well as in large-scale assessment systems such as NAEP and PISA. The strengths and limitations of this person fit statistic are well established (Li & Olejnik, 1997; Smith et al., 2008). Current guidelines recommend avoiding unweighted MS, and research suggests that different fit statistics for polytomous items perform similarly across different assessment conditions, with the choice of method having minimal impact on results (Şengül Avşar, 2019). Our comparison of weighted MS and unweighted MS indicated that the choice had no impact on the conclusions of this study (Again, Research Supplement 1 compares results from weighted and unweighted MS). Additionally, our exploratory analysis comparing weighted MS with the LPz statistic (Drasgow et al., 1985) revealed a strong correlation of 0.98 between the resulting person fit estimates. Based on these findings, we concluded that reporting weighted MS alone provided an efficient and rigorous approach for this study.

2.5.3 Multilevel regression

Petridou and Williams (2007) applied a multilevel regression framework and found that a random intercept at the classroom level explained a nontrivial portion of the person fit variance, recommending further multilevel regression studies. Petridou and Williams along with Reise (2000) employed a logistic framework with binary misfit cutoffs, but Cui and Mousavi (2015) used a linear model. Given that, in this study, the weighted MS distributions were similar across groups and lacked a clear misfit threshold, we chose a linear model. However, to cross-validate our findings, we also conducted logistic regression (i.e., an odds of MS > 1.3 outcome) and found no meaningful differences in the results. Additionally, we replicated the analysis using only participants with large positive misfit (MS > 1.3) and obtained qualitatively identical results to those using the full sample. Although person misfit statistics can be estimated for groups (Smith & Plackner, 2009), these are infrequently reported in the literature, so we conducted an individual-level analysis.
Statistical analyses were conducted in R (R Core Team, 2021), utilizing TAM (Robitzsch et al., 2024) for IRT analyses and weighted MS, lme4 (Bates et al., 2015) and lmerTest (Kuznetsova et al., 2017) for regression, and ggeffects (Lüdecke, 2018) for marginal means.

2.5.4 Effect sizes and interpretation of person fit results

Traditional t-tests of statistical significance, based on the chi-squared distribution, are sensitive to sample size and thus are not suitable for our large samples. Instead, person fit researchers have emphasized the use of effect sizes. To assess misfit, we applied typical cutoffs of less than 0.7 for negative misfit and greater than 1.3 for positive misfit (Wright & Linacre, 1994). We also developed a custom standardized effect size (ES) metric to interpret between-group differences. The range between the two cutoffs (0.7 and 1.3), a span of 0.6 points, defines the expected variability around the expectation and serves as a useful standard for interpreting between-group differences. Given that the average of all two-point deviations produced a weighted MS of 0.51, we used the mean of 0.6 and 0.51 to define the denominator for our standardized ES as 0.555. This ES allows for interpreting between-group differences as a fraction of either the model’s expected variation or typical two-point fluctuation. Differences smaller than 0.1 were considered negligible, while larger differences were deemed meaningful, though this value of 0.1 is somewhat arbitrary.

2.5.5 Organization of the results

The results are presented in two sections. The first section provides descriptive statistics prior to statistical adjustment, including distributions of weighted MS by race/ethnicity across the five DRDP domains. The second section presents multilevel regression results in two formats: first, marginal means that show the expected average fit for children categorized as Latino/a, Black, and White, controlling for other variables; and second, regression model coefficients that show the association between person fit and the research variables (fixed effects), as well as the unmodeled variance at the teacher and school levels (random effects).

3 Results

3.1 Person fit distributions among racial/ethnic groups

This section describes the bivariate distributions of person fit and race/ethnicity across each of the five DRDP domains. Table 5 presents the average weighted MS values for each domain and racial/ethnic category (prior to statistical adjustment). Additionally, the column labeled “Raw Deviations” shows the mean difference between predicted and observed item scores, akin to the comparison between respondents 1002 and 1003 in Table 1. For example, the first row of Table 5 indicates that for the ATLREG domain, children identified as Black had a mean weighted MS of 0.84, which corresponds to a 2.41-point raw difference from the “baseline” response vector with the lowest MS. Raw deviations are influenced by the number of items, so they must be interpreted within each domain. Whereas the Weighted MS and Raw Deviations columns were almost perfectly correlated, the latter offers insight into how increases in misfit manifested as changes in scores on the learning progressions.
Table 5
Distribution of raw score deviations and corresponding weighted MS statistics
Domain
Category
Weighted MS
Raw deviations
  
M
SD
M
SD
ATLREG
Black
0.84
1.04
2.41
2.06
 
Latino/a
0.78
0.89
2.31
2.01
 
White
0.82
0.94
2.30
2.02
 
All
0.82
0.96
2.34
2.03
SED
Black
0.52
0.70
1.99
2.16
 
Latino/a
0.52
0.70
2.00
2.14
 
White
0.56
0.71
2.16
2.17
 
All
0.53
0.71
2.05
2.16
LLD
Black
0.74
0.73
4.92
3.45
 
Latino/a
0.74
0.72
4.93
3.39
 
White
0.76
0.70
5.05
3.34
 
All
0.75
0.72
4.97
3.39
COG
Black
0.71
0.69
4.62
3.38
 
Latino/a
0.72
0.68
4.70
3.35
 
White
0.72
0.64
4.77
3.25
 
All
0.72
0.67
4.70
3.33
PDHLTH
Black
0.85
1.05
4.90
4.04
 
Latino/a
0.78
0.83
4.68
3.64
 
White
0.82
0.84
4.77
3.59
 
All
0.82
0.91
4.78
3.76
Note. Each row n = 26,686, except for All (N = 80,058)
For three of the five domains (SED, LLD, and COG), children classified as White exhibited the same or higher average misfit than the other two racial/ethnic groups, which does not support the racial/ethnic bias hypothesis. However, conflicting results were found for ATLREG and PDHLTH, where children identified as Black had slightly higher average misfit compared to the other groups, suggesting the possibility of bias. Standard deviations were consistent across racial/ethnic groups within each domain, providing no evidence of bias. Histograms of the weighted MS statistics (not shown) revealed no subtle differences in distributions that could skew or hinder statistical comparisons, supporting the appropriateness of models for continuous data.
As anticipated, the average weighted MS values were generally low, with all averages in Table 5 falling below the expected value of 1.0. Some values were near the 0.7 threshold, and SED was slightly below this threshold at approximately 0.5. Low weighted MS is characteristic of the DRDP assessment items, which were designed and optimized to ensure within-person consistency across items. Analyses conducted on a subset of children considered misfitting (MS > 1.3) yielded similar results. To retain as much information as possible, the full sample was included in the analysis.

3.2 Multilevel regression

We employed multilevel regression to estimate average person fit for children classified as Latino/a, White, and Black, controlling for the research variables. Research Supplement 2 presents the results of our model building process: Model 1 contained only the variance components. Model 2 added fixed effects for race/ethnicity. Model 3, the final model, included complete fixed effects for all the research variables. Here, we focus on the final model that was applied to each of the five DRDP outcomes. This final model regressed weighted MS on race/ethnicity, grand mean-centered age, special education status, gender, semester, and setting type. ML status and language match were not statistically significant and were removed. Treating age as a continuous variable was more appropriate than as an indicator, as the increase in model fit using the latter was negligible. Interactions between race/ethnicity and setting or race/ethnicity and semester, though theoretically plausible, were not statistically significant and were omitted. Random intercepts for teachers and school sites were included, as likelihood ratio tests showed significant improvements in model fit. The final model is presented in Eq. 2:
$${\text{Weighted MS}}_{ijk}={\alpha }_{0}+{\beta }_{1}{\text{Latino}}_{ijk}+{\beta }_{2}{\text{White}}_{ijk}+{\beta }_{3}{(\text{Age}}_{ijk}-{\text{Age}}_{\dots })+{\beta }_{4}\text{Special Educatio}{\text{n}}_{ijk}+{\beta }_{5}{\text{Female}}_{ijk}+{\beta }_{6}{\text{Winter}}_{ijk}+{\beta }_{7}{\text{Spring}}_{ijk}+{\beta }_{8}{\text{Center Care}}_{k}+{\beta }_{9}{\text{Family Care}}_{k}+{\beta }_{10}{\text{Head Start}}_{k}+{\beta }_{11}{\text{State School}}_{k}+{\beta }_{12}{\text{Other Setting}}_{k}+{{\zeta }_{j}+{\zeta }_{k}+\epsilon }_{ijk}$$
(2)
where the subscripts represent child \(i\), teacher \(j\), and site \(k\). Black served as the reference category for race/ethnicity and Fall is the reference category for semester. The grand mean centering of age (\({\text{Age}}_{ijk}-{\text{Age}}_{...}\)) adjusts the model intercept and random effects variances to the mean age (4.4 years). \({\beta }_{8}\) through \({\beta }_{12}\) are dummy variables for childcare setting, with missing data as the reference category. The model assumes that the two random intercepts are normally distributed with means of zero and variances of \(\psi , \gamma\), given the covariates, uncorrelated random effects, and heteroskedastic level-1 errors conditional on random effects.
We presented marginal effects first, followed by model diagnostics and regression coefficients.

3.2.1 Marginal person fit among racial/ethnic groups

To compare person fit across racial/ethnic groups, we estimated average person fit for children in each group while holding other variables constant (i.e., marginal effects). This allowed for direct comparisons between the three groups. We first examined person fit in the five DRDP domains and then evaluated racial/ethnic differences.

3.2.2 Differences across DRDP domains

Figure 2 contains a graph of estimated person fit for the five DRDP domains and three racial/ethnic groups. The solid horizontal line at 1.0 represents the model’s expected fit, whereas the dashed line at 0.7 marks the lower bound for negative misfit (i.e., less problematic misfit). The boundary for positive misfit, at 1.3, is beyond the range shown. All estimated means were below 1.0, indicating that average misfit was lower than expected. This result aligns with the DRDP’s design, which aims for within-person consistency across items within a domain. The mean weighted MS estimates were above baseline, so there was some inconsistency, though less than usual.
Fig. 2
Estimated person fit for the DRDP domains and racial/ethnic categories
Bild vergrößern
These findings provide validity evidence for the DRDP’s internal structure, supporting claims that it captures children’s consistent development within each domain. However, a competing hypothesis is that ratings may be inaccurate, potentially influenced by construct-irrelevant factors, such as rater effects (e.g., a halo effect). We cannot rule out this possibility with the current data and address this limitation in the discussion.

3.2.3 Consistency of SED ratings

Our second observation was that the SED ratings exhibited greater consistency compared to the other domains. Notably, SED also crossed below the 0.7 threshold for negative misfit, suggesting that, in addition to its greater consistency relative to other domains, the SED ratings were more consistent than expected by the model. This finding may reflect intrinsic consistency among the items, a general developmental trend in SED, or potential halo effects or other construct-irrelevant influences. It could also indicate redundancy among the items, which might be addressed by reducing the number of items. However, we view this situation as less problematic than positive misfit, which reflects greater than expected inconsistency. Overall, response inconsistency does not appear to be a significant issue within any of the DRDP domains.

3.2.4 Racial/ethnic differences

Within each domain, person fit differences across racial/ethnic groups were minimal. Statistically, only three contrasts were significant at the p < 0.05 level: Latino/a (green marker) vs. Black (red marker) in ATLREG (p < 0.001), LLD (p < 0.05), and PDHLTH (p < 0.001). The remaining racial/ethnic comparisons within each domain were not statistically significant. Even for the significant contrasts, the average difference was 0.015, with a range of 0 to 0.058, resulting in a very small effect size (ES = 0.027, as defined in the methods). In conclusion, the racial/ethnic differences in person fit, as estimated by the multilevel models, were minimal.

3.2.5 Model fit and regression coefficients

Table 6 presents estimates of the variance explained by the model. The first row, labeled “Site,” represents the variance in person fit explained by the random intercept at the site level. This variance ranged from 4 to 8%, indicating that site-level factors accounted for a modest portion of the total variance. The second row, “Teacher,” shows that the teacher-level random intercept explained between 13 and 19% of the variance, the largest contribution of any structural component in the model. This variance may reflect unmeasured classroom variables, including a potential rater effect, where teachers’ idiosyncratic rating styles influenced their assessments. The “Error” row indicates that between 73 and 82% of the variance remained unexplained by the model. Next, intraclass correlations (ICCs) are also included to facilitate broader comparisons with the literature. ICCs at the teacher level ranged from 0.126 to 0.189, whereas ICCs at the site ranged from 0.034 to 0.074. Finally, the “Marginal R2” row shows that fixed effects accounted for only 1% of the variance in each model. This result is not surprising or inherently problematic, as one would expect person fit to be largely independent of demographic variables in a fair and well-designed assessment.
Table 6
Variance explained by regression in Eq. 2
 
ATLREG
SED
LLD
COG
PDHLTH
Random Part: R2
     
Site
0.05
0.04
0.04
0.07
0.08
Teacher
0.13
0.13
0.14
0.19
0.18
Error
0.81
0.82
0.81
0.73
0.73
Random Part: ICC
     
Site
0.050
0.034
0.040
0.070
0.074
Teacher
0.126
0.126
0.143
0.189
0.184
Fixed part
     
Marginal R2
0.01
0.01
0.01
0.01
0.01
Table 7 presents the coefficients and associated statistics from the final regression models. The results were relatively consistent across the five DRDP domains, allowing for some general conclusions, tempered by the study’s limitations discussed later. Again, the small variance explained by the fixed effects highlights the substantial variability around the estimates and underscores the need for cautious interpretation.
Table 7
Regression coefficients for multilevel regression of person fit on research variables
 
ATLREG
 
SED
 
LLD
 
COG
 
PDHLTH
 
 
Est
SE
 
Est
SE
 
Est
SE
 
Est
SE
 
Est
SE
 
Fixed part
               
Race/ethnicity
               
Latino/a
 − 0.059
0.010***
 
 − 0.006
0.007
 
 − 0.017
0.007**
 
 − 0.003
0.007
 
 − 0.037
0.009***
 
White
 − 0.019
0.010
 
0.010
0.007
 
 − 0.003
0.007
 
 − 0.009
0.007
 
 − 0.010
0.009
 
Age
 − 0.006
0.001***
 
 − 0.004
4.0E − 04***
 
 − 0.006
4.0E − 04***
 
 − 0.004
3.7E − 04***
 
 − 0.009
5.0E − 04***
 
IEP
0.136
0.015***
 
0.099
0.011***
 
0.119
0.011***
 
0.089
0.010***
 
0.126
0.013***
 
Female
 − 0.076
0.007***
 
 − 0.047
0.005***
 
 − 0.053
0.005***
 
 − 0.029
0.005***
 
 − 0.074
0.006***
 
Semester
               
Winter
 − 0.107
0.014***
 
 − 0.069
0.010***
 
 − 0.072
0.010***
 
 − 0.069
0.009***
 
 − 0.134
0.013***
 
Spring
 − 0.138
0.008***
 
 − 0.070
0.006***
 
 − 0.066
0.006***
 
 − 0.082
0.005***
 
 − 0.128
0.007***
 
Setting
               
Center
0.003
0.021
 
 − 0.018
0.015
 
 − 0.019
0.015
 
 − 0.015
0.014
 
 − 0.023
0.019
 
Family
 − 0.137
0.055**
 
 − 0.044
0.039
 
 − 0.016
0.040
 
0.036
0.041
 
 − 0.031
0.055
 
Head start
 − 0.036
0.018**
 
 − 0.032
0.013**
 
0.000
0.013
 
 − 0.013
0.013
 
 − 0.017
0.017
 
Other
 − 0.022
0.026
 
 − 0.003
0.018
 
 − 0.011
0.019
 
 − 0.024
0.018
 
0.065
0.024**
 
State school
0.017
0.015
 
0.015
0.011
 
0.037
0.011
 
0.038
0.010***
 
0.019
0.014
 
Intercept
0.966
  
0.601
  
0.806
  
0.780
  
0.938
  
Random part
               
Teacher
0.343
  
0.248
  
0.268
  
0.293
  
0.392
  
Site
0.217
  
0.130
  
0.142
  
0.178
  
0.249
  
Error
0.878
  
0.641
  
0.643
  
0.579
  
0.787
  
N
72,795
  
72,795
  
72,795
  
72,795
  
72,795
  
Note. *p < 0.05, **p < 0.01, ***p < 0.001

3.2.6 Racial/ethnic differences

The regression coefficients for the three racial/ethnic categories, which underpin the analysis in Fig. 2, indicate negligible differences in person fit associated with race/ethnicity. Specifically, differences between Black children (the reference group) and White children were not statistically significant across any DRDP domains. Although statistically significant differences were observed between Latino/a and Black children for three of the five domains, these differences were minor. The mean absolute discrepancy between these groups was 0.024 points, translating to an ES of 0.043, which we interpret as negligible. The largest effect was observed for the ATLREG domain, where Latino/a children were estimated to have 0.06 points lower weighted MS scores than Black children (p < 0.001), corresponding to an ES of 0.11—a small but potentially meaningful effect. However, this result is likely an outlier, given that it was 54% larger in magnitude than the next largest coefficient and was associated with the largest standard errors across domains. Overall, these analyses do not provide evidence of meaningful racial/ethnic bias in person fit.

3.2.7 Other child variables

Age was statistically significant in all models (p < 0.001). The small negative coefficient suggests that older preschoolers were rated slightly more consistently than younger ones, all else being equal. Special education status (i.e., IEP) also emerged as a significant predictor (p < 0.001) with a positive coefficient across all models, ranging from 0.09 to 0.14. This difference, averaging 0.1 points, translates to an ES of 0.18—a relatively small but practically significant finding. Thus, children in special education exhibited somewhat less within-domain consistency than their peers.
Gender differences also reached statistical significance, with females rated more consistently than males (p < 0.001). The effect ranged from 0.03 to 0.06 points, a modest difference that suggests females may exhibit slightly more consistent development or may be rated more consistently for other reasons.

3.2.8 Semester differences

Semester was a significant predictor across all five models, with person fit consistently higher in the fall compared to winter and spring. No difference was detected between the latter two semesters. The effect size, averaging 0.17, indicates a small but meaningful decrease in person fit after the fall. This systematic pattern, observed across all five domains, may reflect changes in child behavior, such as increased intra-individual consistency, or teacher behavior, such as more consistent ratings of children over time. This finding introduces a new aspect to the discussion of rater effects in EL-KEAs that is explored further in the discussion.

3.2.9 Setting differences

Most contrasts between childcare settings were not statistically significant. However, state schools tended to exhibit the highest person fit scores, whereas family care settings exhibited the lowest. Post hoc contrasts confirmed that some of these differences were significant at p < 0.05 or better. On average, the difference between state schools and family care across DRDP domains was 0.05 points (ES = 0.09), which is small and likely negligible. However, the ATLREG domain displayed a notably larger difference of 0.173 points (ES = 0.31). While potentially important, this finding should be interpreted cautiously due to the relatively large standard errors for ATLREG and the absence of consistent trends across domains.

3.2.10 Random effects

The bottom of Table 7 displays random effects variance at different levels of the variable. As expected, the teacher-level variance is greater than the site-level variance, and the error variance eclipses the others. Interested readers may refer to Research Supplement 2 to observe the small reduction in error variance afforded by more complex models, in line with the small variance explained by the fixed effects discussed above.

4 Discussion

4.1 Summary of key findings

This study addressed ongoing calls from the ECE field to investigate the cultural sensitivity of EL-KEAs and related concerns in the assessment-based LP/LT literature. Specifically, we examined whether the DRDP’s learning progressions are less applicable for children identified as Latino/a or Black, compared to those identified as White. Using person fit analyses, we evaluated the consistency of teachers’ ratings across the five essential domains of early learning and development measured by the DRDP assessment. A central hypothesis was that greater positive misfit for Latino/a and/or Black children compared to White children would support the concerns about racial/ethnic bias raised by the ECE community (i.e., First 5, 2022) and in the LP/LT literature.
However, our findings revealed no evidence of such disparities. Children from all three racial/ethnic groups demonstrated comparable consistency in their teachers’ ratings, contradicting the bias hypothesis. This outcome aligns with the DRDP’s emphasis on fair and equitable assessment practices (DCRG, 2018). The credibility of our results is bolstered by established theories (e.g., Custers et al., 2000; Lamprianou & Boyle, 2004; Van der Flier, 1983) and prior literature advocating for person fit analyses to detect potential racial/ethnic bias (e.g., Petridou & Williams, 2007; Şengül Avşar & Emons, 2021). The current study adds evidence supporting the fairness of the DRDP in capturing developmental progress across diverse populations.
Several additional findings warrant discussion. Children in special education exhibited higher positive misfit (i.e., less consistent ratings) than their peers in general education. This result aligns with expectations given the documented intra-individual variability in special education populations (Flanagan & McDonough, 2022), which can contribute to person misfit (Engelhard, 2009). If replicated, these findings could have important implications for both assessment and instruction. For example, if higher misfit in special education stems from flawed measurement processes, this points to the need for instrument revisions or additional training for special education teachers. Conversely, if these findings reflect genuine intra-individual variability, they underscore the importance of early formative assessment and differentiated instruction tailored to the unique developmental profiles of children in special education.
Additionally, person misfit was found to decrease after the fall semester, a novel and consistent finding observed across all DRDP domains. This decrease, although relatively small, was comparable in magnitude to the observed difference between general and special education groups, indicating practical significance. Several plausible explanations merit consideration: Children’s development could become more consistent due to shared environments and experiences (i.e., curriculum and instruction). Or teachers could become more comfortable with the assessment process after the fall. Both factors could contribute to changes in person fit. This finding supports the need for a longitudinal perspective on rater behavior within EL-KEA contexts, which we discuss further, below.
Two additional findings are noteworthy. First, children identified as female were rated with slightly greater consistency than males. This result aligns with Rudner et al.’s (1995) findings in the NAEP assessment and Cui and Mousavi’s (2015) result from a ninth-grade large-scale mathematics achievement test but contrasts with Petridou and Williams’ (2007) study, which found no gender effects in primary mathematics assessments. Although the difference in consistency by gender is small and unlikely to have significant implications, it adds a new perspective to existing evidence that females are often assessed more favorably in early childhood contexts (e.g., Magnuson et al., 2016). Second, differences in person fit across educational settings were generally minimal, with a potential exception: person fit appeared lowest in family care settings and highest in state schools. This disparity could reflect unmeasured differences in child populations (children are not randomly assigned to settings) and variations in rater training. Anecdotal reports suggest that state school raters often receive more comprehensive DRDP training than those in family care, which may explain raters’ greater ability to perceive nuanced intrapersonal differences and thus higher person fit scores. Additional study is needed, but these findings highlight the value of person fit as a lens to study and understand raters, which has been underutilized to date.

4.2 Implications for LP-based assessment with children from CLD families

This study contributes new insights into the cultural sensitivity of assessment-based LPs for children from CLD backgrounds. Our analyses found no evidence that these children exhibited less consistent developmental trajectories than their peers when evaluated using a common measurement model. Whereas the current results should not be overgeneralized to suggest that racial/ethnic differences in developmental profiles are negligible or that the DRDP is entirely free from bias, they directly address specific concerns raised by ECE practitioners (First 5, 2022) and within the LP/LT literature (Harris et al., 2022). In this regard, our findings offer reassurance regarding the DRDP’s cultural sensitivity.
Additionally, this study provides a framework for future investigations into potential bias. Claims of cultural insensitivity in operational assessments must be substantiated with empirical evidence to identify where improvements are most needed, supporting the common goal of equitable assessment. For example, Wu et al. (2021) documented distinct learning trajectories among international student populations, and research on multilingual learners highlights divergent developmental patterns (Castro et al., 2021; Yow & Markman, 2011). These frameworks, however, remain underexplored in the context of EL-KEAs or LP-based assessments for early childhood populations.
It is also important to note that valid LP-based assessments do not require all children to follow identical developmental paths or progress at the same rate. Critical arguments pointing out that LPs do not accommodate all children’s developmental trajectories often overlook the probabilistic nature of statistical methods used in LP-based assessments, which account for individual differences. Future critiques should focus on the extent to which measurement models are effective in accommodating such variability, rather than on simplistic notions that LPs impose fixed or “orderly” developmental trajectories (e.g., Empson, 2011).

4.3 Implications for assessment with EL-KEAs: practice and research

The use of person fit metrics in EL-KEAs holds considerable promise for both assessment practice and future research. First, person fit can serve as a tool for identifying atypical or potentially invalid assessments. By flagging scores with high positive misfit, educators can conduct follow-up evaluations to uncover unique strengths or weaknesses in a child’s developmental profile, supporting more targeted and effective instructional planning and family engagement. In summative contexts, flagging misfitting scores can also enhance the reliability of assessments by prompting reflection on ratings and ensuring that accurate scores are used for decision-making.
Second, person fit metrics can be a valuable tool for identifying teachers who may benefit from additional support or training. Teachers with extreme person misfit scores could be targeted for further professional development, such as tailored coaching or interviews, to improve their assessment accuracy and consistency. This logic applies to instructional and professional development contexts, where person fit offers practical insights that can inform direct action.
Finally, person fit provides a novel lens for examining rater effects in observational assessment. Rater effects research is strikingly underdeveloped (Carbonneau et al., 2020; Cash et al., 2012; Ready & Wright, 2011); not keeping pace with the widespread implementation of EL-KEAs and expansion of observational assessment to new developmental constructs (e.g., Garcia et al., 2019). Person fit metrics provide a rigorous way to examine the behavior and accuracy of raters independent of children’s achievement or developmental levels, which could be a valuable way to avoid deficit-oriented interpretations of assessment data. Interestingly, our person fit results showed intraclass correlation coefficients between 12.4 and 18.9%, whereas rater effects for scaled scores tend to fall in the 30 to 50% range (e.g., Lambert et al., 2015; Mashburn & Henry, 2005; Sussman et al., 2023). Person fit thus appears to be less influenced by external factors than scaled scores. Comparing person fit metrics with scaled scores and other external variables could shed light on the nature of rater effects and deepen our understanding of their impact on assessment outcomes.
An anonymous reviewer posed an important question: could what is often attributed to rater effects or bias instead stem from the diverse ways EL-KEAs are implemented and perceived in practice? For instance, consider Teacher A, who employs a simplified heuristic to streamline assessments, compared to Teacher B, who receives coaching support and views assessment as a critical tool for instructional planning. In this scenario, Teacher A’s scoring variability might be more constrained, whereas Teacher B’s scores might reflect a broader and more nuanced interpretation of student performance. This distinction underscores the need for assessment validation efforts to systematically investigate rater variance, addressing the specific factors that explain such variance. This could include variability in professional development, differences in instructional goals, and other contextual factors (Waterman et al., 2012). Addressing these complexities is crucial for refining the validity and utility of EL-KEAs, particularly in diverse educational settings.
In sum, increasing the emphasis on person fit within EL-KEA systems holds great potential in practice and research. In practice, its utility should be evaluated through pilot testing where administrative, interpretive, and political challenges can be better understood and the influence of the metrics more thoroughly assessed. For research, person fit metrics may help address longstanding questions about rater reliability in early childhood assessment. Person fit methodology aligns with established empirical frameworks for studying rater effects (Raudenbush et al., 2008). Understanding their root causes, including how they influence person fit, could lead to the design of more effective rater training programs (Hoyt & Kerns, 1999), to mitigate rater effects.

4.4 Limitations

This study—despite its methodological rigor—focused on a single aspect of potential assessment bias. Statistical methods like person fit can generate evidence against specific biases but cannot conclusively prove the absence of all forms of bias. Although we found no evidence of racial/ethnic bias in the DRDP, this does not definitively prove the instrument’s fairness. To guard against misinterpretation, it is important to reiterate that potential biases related to race/ethnicity cannot be entirely ruled out based on this analysis alone.
Our findings suggest that rater effects were the largest source of variance, but the reasons behind these effects remain unclear. Interpreting person fit in observational assessments is inherently complex because (a) variance may originate from both examinees and raters, and (b) person fit results do not explain the reasons for misfit (Rupp, 2013). For example, large positive misfit can result from a careless observer with inaccurate ratings of an otherwise typical child, or by a careful observer who recognizes a child’s uncommon pattern of intrapersonal strengths and weakness. Further research—focusing on the reasons behind misfit—is necessary to address unresolved questions.
Additionally, two specific limitations warrant mention. First, certain key variables, such as family socioeconomic status (SES), were not measured. Although the income-censored nature of the public preschool sample mitigates the concerns and SES has not been linked to person fit to the best of our knowledge, SES is a potentially confounding variable. Second, this study focused exclusively on the weighted mean square (MS) statistic for person fit, which is widely regarded as the best option for polytomous data. Although we tested alternative metrics (e.g., see Supplement 2) and found negligible differences between methods, a comprehensive analysis using a broader range of person fit indicators could yield additional insights, in particular local fit metrics that might be able to generate more diagnostic and thus useful information.

5 Conclusion

This study addressed concerns regarding the cultural sensitivity of learning progressions (LPs) in early childhood education (ECE) by examining the applicability of the Desired Results Developmental Profile (DRDP) for children identified as Latino/a, Black, and White. Our analysis revealed no racial or ethnic differences in person fit, suggesting that the DRDP’s models of learning and development are equally relevant for children from culturally and linguistically diverse backgrounds. These findings directly address concerns voiced by the ECE community about potential racial/ethnic bias in LP-based assessments, contributing to the growing body of evidence supporting the validity and fairness of early learning and developmental assessments for diverse populations.
Additionally, the study highlighted important challenges in assessing special education populations, who exhibited higher misfit due to intra-individual variability. It also revealed that assessments early in the year tended to show less consistency, underscoring the potential influence of rater familiarity and the need for further study.
Furthermore, the use of person fit metrics in EL-KEAs presents significant potential for enhancing both assessment practice and future research. These metrics may help identify misfit in individual assessments, guide targeted instructional planning, and inform professional development to reduce rater variability and improve assessment accuracy.
Overall, the findings suggest that person fit metrics offer a promising avenue for refining observational assessment in early childhood education, making assessments more reliable and relevant for diverse populations. Future research should explore how these metrics can be further integrated into practice and pilot-tested within EL-KEAs to optimize their impact on assessment validity, instructional quality, and teacher development.

Declarations

Conflict of interest

The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
download
DOWNLOAD
print
DRUCKEN
Titel
Cultural sensitivity of early childhood assessments based on learning progressions: a Rasch person fit analysis
Verfasst von
Joshua Sussman
Karen Draney
Mark Wilson
Publikationsdatum
14.04.2025
Verlag
Springer Netherlands
Erschienen in
Educational Assessment, Evaluation and Accountability / Ausgabe 2/2025
Print ISSN: 1874-8597
Elektronische ISSN: 1874-8600
DOI
https://doi.org/10.1007/s11092-025-09453-0

Supplementary Information

Below is the link to the electronic supplementary material.
Zurück zum Zitat Ackerman, D. J. (2018). Real world compromises: Policy and practice impacts of kindergarten entry assessment-related validity and reliability challenges (Research Report No. RR-18–13). Educational Testing Service. https://​doi.​org/​10.​1002/​ets2.​12201
Zurück zum Zitat Adams, R. J., Wilson, M., & Wang, W. C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21(1), 1–23.CrossRef
Zurück zum Zitat Aguinis, H., Culpepper, S. A., & Pierce, C. A. (2016). Differential prediction generalization in college admissions testing. Journal of Educational Psychology, 108(7), 1045–1059. https://​doi.​org/​10.​1037/​edu0000104CrossRef
Zurück zum Zitat American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
Zurück zum Zitat Barghaus, K. M., Dahlke, K., Fantuzzo, J. W., Howard, E. C., Tucker, N., Weinberg, E., Liu, F., Brumley, B., Williams, R., & Flanagan, K. (2023). Validation of the Pennsylvania kindergarten entry inventory: Examining neglected validities in large-scale, teacher-report assessment. Early Education and Development, 34(4), 940–962. https://​doi.​org/​10.​1080/​10409289.​2022.​2076049CrossRef
Zurück zum Zitat Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://​doi.​org/​10.​18637/​jss.​v067.​i01CrossRef
Zurück zum Zitat Bhola, D. S., Impara, J. C., & Buckendahl, C. W. (2003). Aligning tests with states’ content standards: Methods and issues. Educational Measurement: Issues and Practice, 22(3), 21–29. https://​doi.​org/​10.​1111/​j.​1745-3992.​2003.​tb00134.​xCrossRef
Zurück zum Zitat Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, Policy & Practice, 5(1), 7–74. https://​doi.​org/​10.​1080/​0969595980050102​CrossRef
Zurück zum Zitat Bond, T., Yan, Z., & Heene, M. (2021). Applying the Rasch model: Fundamental measurement in the human sciences (4th ed.). Routledge.
Zurück zum Zitat Bruner, J. (1960). The process of education. Harvard University Press.CrossRef
Zurück zum Zitat California Department of Education. (2015). Desired results developmental profile—2015 assessment. Author. https://​www.​desiredresults.​us/​desired-results-system/​drdp-instrument-and-forms. Accessed 31 Mar 2025.
Zurück zum Zitat Carbonneau, K. J., Van Orman, D. S. J., Lemberger-Truelove, M. E., & Atencio, D. J. (2020). Leveraging the power of observations: Locating the sources of error in the individualized classroom assessment scoring system. Early Education and Development, 31(1), 84–99. https://​doi.​org/​10.​1080/​10409289.​2019.​1617572CrossRef
Zurück zum Zitat Cash, A. H., Hamre, B. K., Pianta, R. C., & Myers, S. S. (2012). Rater calibration when observational assessment occurs at large scale: Degree of calibration and characteristics of raters associated with calibration. Early Childhood Research Quarterly, 27(3), 529–542. https://​doi.​org/​10.​1016/​j.​ecresq.​2011.​12.​006CrossRef
Zurück zum Zitat Castro, D. C., Gillanders, C., Prishker, N., & Rodriguez, R. (2021). A sociocultural, integrative, and interdisciplinary perspective on the development and education of young bilingual children with disabilities. In D. Castro & A. J. Artiles (Eds.), Language, learning, and disability in the education of young bilingual children (pp. 27–45). Multilingual Matters.
Zurück zum Zitat Chen-Gaddini, M.; Sussman J., Newton, E., Ruiz Jimenez, G. S., Kriener-Althen, K., Gochyyev, P., Draney, K., & Mangione, P. (2022a). DRDP technical report: Interrater reliability. WestEd.
Zurück zum Zitat Chen-Gaddini, M.; Sussman J., Newton, E., Ruiz Jimenez, G. S., Kriener-Althen, K., Gochyyev, P., Draney, K., & Mangione, P. (2022b). DRDP technical report: Validity in relation to external assessments of child development. WestEd.
Zurück zum Zitat Cui, Y., & Mousavi, A. (2015). Explore the usefulness of person-fit analysis on large-scale assessment. International Journal of Testing, 15(1), 23–49. https://​doi.​org/​10.​1080/​15305058.​2014.​977444CrossRef
Zurück zum Zitat Custers, J. W. H., Hoijtink, H., van der Net, J., & Helders, P. J. M. (2000). Cultural differences in functional status measurement: Analyses of person fit according to the Rasch model. Quality of Life Research, 9(5), 571–578. https://​doi.​org/​10.​1023/​A:​1008949108089CrossRef
Zurück zum Zitat de Ayala, R. J. (2022). The theory and practice of item response theory (Second Edition). Guilford Publications.
Zurück zum Zitat Draney, K., Sussman, J., Kriener-Althen, K., Newton, E. K., Gochyyev, P., & Mangione, P. (2021). DRDP technical report for early infancy through kindergarten: Structural validity and reliability. California Department of Education.
Zurück zum Zitat Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal ofMathematical and Statistical Psychology, 38(1), 67–86. https://​doi.​org/​10.​1111/​j.​2044-8317.​1985.​tb00817.​xCrossRef
Zurück zum Zitat DRDP Collaborative Research Group [DCRG]. (2018). Technical report for the desired results developmental profile (2015). California department of education. https://​www.​desiredresults.​us/​sites/​default/​files/​docs/​resources/​research/​DRDP2015_​Technical%20​Report_​20180920_​clean508_​0.​pdf. Accessed 31 Mar 2025.
Zurück zum Zitat Duncan, R. G., & Rivet, A. E. (2018). Learning progressions. In F. Fischer, C. Hmelo-Silver, & S. R. Goldman (Eds.), International handbook of the learning sciences (pp. 422–432). Routledge.CrossRef
Zurück zum Zitat Duschl, R., Maeng, S., & Sezen, A. (2011). Learning progressions and teaching sequences: A review and analysis. Studies in Science Education, 47(2), 123–182. https://​doi.​org/​10.​1080/​03057267.​2011.​604476CrossRef
Zurück zum Zitat Emons, W. H. M. (2009). Detection and diagnosis of person misfit from patterns of summed polytomous item scores. Applied Psychological Measurement, 33(8), 599–619. https://​doi.​org/​10.​1177/​0146621609334378​CrossRef
Zurück zum Zitat Empson, S. (2011). On the idea of learning trajectories: Promises and pitfalls. The Mathematics Enthusiast, 8(3), 571–598.CrossRef
Zurück zum Zitat Engelhard, G. (2009). Using item response theory and model—data fit to conceptualize differential item and person functioning for students with disabilities. Educational andPsychological Measurement, 69(4), 585–602. https://​doi.​org/​10.​1177/​0013164408323240​CrossRef
Zurück zum Zitat Ferrando, P. J. (2015). Assessing person fit in typical-response measures. In S. P. Reise & D. A. Revicki (Eds.), Handbook of item response theory modeling: Applications to typical performance assessment (pp. 128–155). Routledge.
Zurück zum Zitat Feuerstahler, L., & Wilson, M. (2021). Scale alignment in the between-item multidimensional partial credit model. Applied Psychological Measurement, 45(4), 268–282. https://​doi.​org/​10.​1177/​0146621621101310​3CrossRef
Zurück zum Zitat First 5 Center for Children’s Policy [First 5]. (2022). Improving racial equity in kindergarten readiness inventory efforts. Author. https://​first5center.​org/​publications/​improving-racial-equity-in-kindergarten-readiness-inventory-efforts. Accessed 31 Mar 2025.
Zurück zum Zitat Flanagan, D. P., & McDonough, E. M. (2022). Contemporary intellectual assessment: Theories, tests, and issues (4th Ed.) Guilford Publications
Zurück zum Zitat Flavell, J. H. (1994). Cognitive development: Past, present, and future. In R. D. Parke, P. A. Ornstein, J. J. Rieser, & C. Zahn-Waxler (Eds.), A century of developmental psychology (pp. 569–588). American Psychological Association.CrossRef
Zurück zum Zitat Garcia, E. B., Sulik, M. J., & Obradović, J. (2019). Teachers’ perceptions of students’ executive functions: Disparities by gender, ethnicity, and ELL status. Journal of Educational Psychology, 111(5), 918–931. https://​doi.​org/​10.​1037/​edu0000308CrossRef
Zurück zum Zitat Goldstein, J., & Flake, J. K. (2016). Towards a framework for the validation of early childhood assessment systems. Educational Assessment, Evaluation and Accountability, 28(3), 273–293. https://​doi.​org/​10.​1007/​s11092-015-9231-8CrossRef
Zurück zum Zitat Gotwals, A. W., & Songer, N. B. (2010). Reasoning up and down a food chain: Using an assessment framework to investigate students’ middle knowledge. Science Education, 94, 259–281.CrossRef
Zurück zum Zitat Griffin, P. (2007). The comfort of competence and the uncertainty of assessment. Studies in Educational Evaluation, 33(1), 87–99. https://​doi.​org/​10.​1016/​j.​stueduc.​2007.​01.​007CrossRef
Zurück zum Zitat Hambleton, R. K., Merenda, P. F., & Spielberger, C. D. (Eds.). (2004). Adapting educational and psychological tests for cross-cultural assessment. Psychology Press.
Zurück zum Zitat Harris, L. R., Adie, L., & Wyatt-Smith, C. (2022). Learning progression–based assessments: A systematic review of student and teacher uses. Review of Educational Research. https://​doi.​org/​10.​3102/​0034654322108155​2CrossRef
Zurück zum Zitat Harvey, H., & Ohle, K. (2018). What’s the purpose? Educators’ perceptions and use of a state-mandated kindergarten entry assessment. Education Policy Analysis Archives, 26, 142–142. https://​doi.​org/​10.​14507/​epaa.​26.​3877CrossRef
Zurück zum Zitat Heritage, M. (2008). Learning progressions: Supporting instruction and formative assessment. Council of Chief School Officers.
Zurück zum Zitat Herman, J. L., Webb, N. M., & Zuniga, S. A. (2007). Measurement issues in the alignment of standards and assessments. Applied Measurement in Education, 20(1), 101–126.
Zurück zum Zitat Heroman, C., Burts, D. C., Berke, K., & Bickart, T. S. (2010). Teaching strategies GOLD objectives for development & learning: Birth through kindergarten. Teaching Strategies.
Zurück zum Zitat Heubert, J. P., & Hauser, R. M. (1999). High stakes: Testing for tracking, promotion, and graduation. National Academy Press. https://​eric.​ed.​gov/​?​id=​ED439151. Accessed 31 Mar 2025.
Zurück zum Zitat HighScope. (2014). COR advantage. HighScope Educational Research Foundation.
Zurück zum Zitat Holcomb, T. S., Li, Z., Lambert, R., & Ferrara, A. (2024). Educator perspectives on a kindergarten entry assessment: Implementation experiences, support, and data utilization. Perspectives on early childhood psychology and education, 8(1), 76–110. https://​doi.​org/​10.​58948/​2834-8257.​1055CrossRef
Zurück zum Zitat Holland, P. W., Wainer, H., Service, E. T. (1993). Differential item functioning. Psychology Press.
Zurück zum Zitat Hoyt, W. T., & Kems, M.-D. (1999). Magnitude and moderators of bias in observer ratings: A meta-analysis. Psychological Methods, 4, 403–424.CrossRef
Zurück zum Zitat Isaacs, J. B. (2012). The school readiness of poor children. Brookings Institute.
Zurück zum Zitat Joseph, G., Soderberg, J. S., Stull, S., Cummings, K., McCutchen, D., & Han, R. J. (2020). Inter-rater reliability of Washington state’s kindergarten entry assessment. Early Education and Development, 31(5), 764–777. https://​doi.​org/​10.​1080/​10409289.​2019.​1674589CrossRef
Zurück zum Zitat Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50, 1–73. https://​doi.​org/​10.​1111/​jedm.​12000CrossRef
Zurück zum Zitat Kang, H., & Furtak, E. M. (2021). Learning theory, classroom assessment, and equity. Educational Measurement: Issues and Practice, 40(3), 73–82. https://​doi.​org/​10.​1111/​emip.​12423CrossRef
Zurück zum Zitat Kim, D.-H., Lambert, R. G., Durham, S., & Burts, D. C. (2018). Examining the validity of GOLD® with 4-year-old dual language learners. Early Education and Development, 29(4), 477–493. https://​doi.​org/​10.​1080/​10409289.​2018.​1460125CrossRef
Zurück zum Zitat Kowalski, K., Brown, R. D., Pretti-Frontczak, K., Uchida, C., & Sacks, D. F. (2018). The accuracy of teachers’ judgments for assessing young children’s emerging literacy and math skills. Psychology in the Schools, 55(9), 997–1012. https://​doi.​org/​10.​1002/​pits.​22152CrossRef
Zurück zum Zitat Kriener-Althen, K., Newton, E. K., Draney, K., & Mangione, P. L. (2020). Measuring readiness for kindergarten using the desired results developmental profile. Early Education and Development, 31(5), 739–763.CrossRef
Zurück zum Zitat Kubsch, M., Czinczel, B., Lossjew, J., Wyrwich, T., Bednorz, D., Bernholt, S., Fiedler, D., Straub, S., Cress, U., Drachsler, H., Neumann, K., & Rummel, N. (2022). Toward learning progression analytics—developing learning environments for the automated analysis of learning using evidence centered design. Frontiers in Education, 7,. https://​doi.​org/​10.​3389/​feduc.​2022.​981910
Zurück zum Zitat Lambert, R. G., Kim, D.-H., & Burts, D. C. (2015). The measurement properties of the teaching strategies GOLD® assessment system. Early Childhood Research Quarterly, 33, 49–63. https://​doi.​org/​10.​1016/​j.​ecresq.​2015.​05.​004CrossRef
Zurück zum Zitat Lamprianou, I., & Boyle, B. (2004). Accuracy of measurement in the context of mathematics national curriculum tests in England for ethnic minority pupils and pupils who speak English as an additional language. Journal of Educational Measurement, 41(3), 239–259. https://​doi.​org/​10.​1111/​j.​1745-3984.​2004.​tb01164.​xCrossRef
Zurück zum Zitat Li, M. F., & Olejnik, S. (1997). The power of Rasch person–fit statistics in detecting unusual response patterns. Applied Psychological Measurement, 21(3), 215–231. https://​doi.​org/​10.​1177/​0146621697021300​2CrossRef
Zurück zum Zitat Little, M., Cohen-Vogel, L., Sadler, J., & Merrill, B. (2020). Moving kindergarten entry assessments from policy to practice: Evidence from North Carolina. Early Education and Development, 31(5), 796–815. https://​doi.​org/​10.​1080/​10409289.​2020.​1724600CrossRef
Zurück zum Zitat Lüdecke, D. (2018). ggeffects: Tidy data frames of marginal effects from regression models. Journal of Open Source Software, 3(26), 1–5. https://​joss.​theoj.​org/​papers/​10.​21105/​joss.​00772
Zurück zum Zitat Magnuson, K. A., Kelchen, R., Duncan, G. J., Schindler, H. S., Shager, H., & Yoshikawa, H. (2016). Do the effects of early childhood education programs differ by gender? A meta-analysis. Early Childhood Research Quarterly, 36, 521–536. https://​doi.​org/​10.​1016/​j.​ecresq.​2015.​12.​021CrossRef
Zurück zum Zitat Mangione, P. L., Osborne, T., Mendenhall, H. (2019). What’s next? How learning progressions help teachers support children’s development and learning. Young Children, 74(3), 20–25. https://​www.​proquest.​com/​docview/​2250997981. Accessed 31 Mar 2025.
Zurück zum Zitat Martone, A., & Sireci, S. G. (2009). Evaluating alignment between curriculum, assessment, and instruction. Review of Educational Research, 79(4), 1332–1361.CrossRef
Zurück zum Zitat Mashburn, A. J., & Henry, G. T. (2005). Assessing school readiness: Validity and bias in preschool and kindergarten teachers’ ratings. Educational Measurement: Issues and Practice, 23(4), 16–30. https://​doi.​org/​10.​1111/​j.​1745-3992.​2004.​tb00165.​xCrossRef
Zurück zum Zitat Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.CrossRef
Zurück zum Zitat Matsumoto, D., spsampsps Van de Vijver, F. J. (2012). Cross-cultural research methods. In H. Cooper, P. M. Camic, D. L. Long, A. T. Panter, D. Rindskopf, spsampsps K. J. Sher (Eds.), APA handbook of research methods in psychology, Vol. 1. Foundations, planning, measures, and psychometrics (pp. 85–100). American Psychological Association. https://​doi.​org/​10.​1037/​13619-006
Zurück zum Zitat McQueen, J., & Mendelovits, J. (2003). PISA reading: Cultural equivalence in a cross- cultural study. Language Testing, 20(2), 208–224. https://​doi.​org/​10.​1191/​0265532203lt253o​aCrossRef
Zurück zum Zitat Meijer, R. R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25(2), 107–135. https://​doi.​org/​10.​1177/​0146621012203195​7CrossRef
Zurück zum Zitat Meisels, S. J., & Piker, R. A. (2001). An analysis of early literacy assessments used for instruction. CIERA Report. CIERA/University of Michigan. https://​eric.​ed.​gov/​?​id=​ED452514. Accessed 31 Mar 2025.
Zurück zum Zitat Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17(4), 297–334. https://​doi.​org/​10.​1177/​0146621693017004​01CrossRef
Zurück zum Zitat Mousavi, A., & Cui, Y. (2020). The effect of person misfit on item parameter estimation and classification accuracy: A simulation study. Education Sciences, 10, 1–15. https://​doi.​org/​10.​3390/​educsci10110324CrossRef
Zurück zum Zitat Müller, M. (2020). Item fit statistics for Rasch analysis: Can we trust them? Journal of Statistical Distributions and Applications, 7(1), 5. https://​doi.​org/​10.​1186/​s40488-020-00108-7CrossRef
Zurück zum Zitat National Education Goals Panel. (1995). The national education goals report: Building a nation of learners, 1995. US Government Printing Office. https://​files.​eric.​ed.​gov/​fulltext/​ED389097.​pdf. Accessed 31 Mar 2025.
Zurück zum Zitat National Research Council [NRC]. (2001). Eager to learn: Educating our preschoolers. The National Academies Press.
Zurück zum Zitat National Research Council [NRC]. (2008). Early childhood assessment: Why, what, and how. The National Academies Press. https://​doi.​org/​10.​17226/​12446
Zurück zum Zitat Office of Head Start. (2015). Head Start early learning outcomes framework: ages birth to five. https://​eclkc.​ohs.​acf.​hhs.​gov/​sites/​default/​files/​pdf/​elof-ohs-framework.​pdf
Zurück zum Zitat Paek, I., & Wilson, M. (2011). Formulating the Rasch differential item functioning model under the marginal maximum likelihood estimation context and its comparison with Mantel-Haenszel procedure in short test and small sample conditions. Educational and Psychological Measurement, 71(6), 1023–1046. https://​doi.​org/​10.​1177/​0013164411400734​CrossRef
Zurück zum Zitat Petridou, A., & Williams, J. (2007). Accounting for aberrant test response patterns using multilevel models. Journal of Educational Measurement, 44(3), 227–247.CrossRef
Zurück zum Zitat Piaget, J. (1941/1965). The child’s conception of number. W. W. Norton and Company.
Zurück zum Zitat R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://​www.​R-project.​org/​
Zurück zum Zitat Raudenbush, S. W., Martinez, A., Bloom, H., Zhu, P., & Lin, F. (2008). An eight-step paradigm for studying the reliability of group-level measures. Working paper, University of Chicago.
Zurück zum Zitat Ready, D. D., & Wright, D. L. (2011). Accuracy and inaccuracy in teachers’ perceptions of young children’s cognitive abilities: The role of child background and classroom context. American Educational Research Journal, 48(2), 335–360. https://​doi.​org/​10.​3102/​0002831210374874​CrossRef
Zurück zum Zitat Reise, S. P. (2000). Using multilevel logistic regression to evaluate person-fit in IRT models. Multivariate Behavioral Research, 35(4), 543–568. https://​doi.​org/​10.​1207/​S15327906MBR3504​_​06CrossRef
Zurück zum Zitat Robitzsch, A., Kiefer, T., & Wu, M. (2024). TAM: Test Analysis Modules. R package version 4.2-21, https://​CRAN.​R-project.​org/​package=​TAM
Zurück zum Zitat Roth, P. L., Le, H., Oh, I.-S., Van Iddekinge, C. H., Buster, M. A., Robbins, S. B., & Campion, M. A. (2014). Differential validity for cognitive ability tests in employment and educational settings: Not much more than range restriction? Journal of Applied Psychology, 99(1), 1–20. https://​doi.​org/​10.​1037/​a0034377CrossRef
Zurück zum Zitat Rudner, L. M., & National Center for Education Statistics (Eds.). (1995). Use of person-fit statistics in reporting and analyzing national assessment of educational progress results. U.S. Dept. of Education, Office of Educational Research and Improvement.
Zurück zum Zitat Rupp, A. A. (2013). A systematic review of the methodology for person fit research in item response theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55(1), 3–38.
Zurück zum Zitat Russo, J. M., Williford, A. P., Markowitz, A. J., Vitiello, V. E., & Bassok, D. (2019). Examining the validity of a widely-used school readiness assessment: Implications for teachers and early childhood programs. Early Childhood Research Quarterly, 48, 14–25. https://​doi.​org/​10.​1016/​j.​ecresq.​2019.​02.​003CrossRef
Zurück zum Zitat Sadler, D. R. (1989). Formative assessment and the design of instructional systems. Instructional Science, 18(2), 119–144. https://​doi.​org/​10.​1007/​BF00117714CrossRef
Zurück zum Zitat Sarama, J., & Clements, D. H. (2009). Early childhood mathematics education research: Learning trajectories for young children. Routledge.
Zurück zum Zitat Saxe, G. B. (1988). Candy selling and math learning. Educational Researcher, 17(6), 14–21. https://​doi.​org/​10.​3102/​0013189X01700601​4CrossRef
Zurück zum Zitat Schafer, W. D., Wang, J., & Wang, V. (2009). Validity in action: State assessment validity evidence for compliance with NCLB. In R. W. Lissitz (Ed.), The concept of validity: Revisions, new directions and applications (pp. 195–212). Information Age Publishing.
Zurück zum Zitat Schulz, W., & Fraillon, J. (2011). The analysis of measurement equivalence in international studies using the Rasch model. Educational Research and Evaluation, 17(6), 447–464. https://​doi.​org/​10.​1080/​13803611.​2011.​630559CrossRef
Zurück zum Zitat Şengül Avşar, A., & Emons, W. H. M. (2021). A cross-cultural comparison of non-cognitive outputs towards science between Turkish and Dutch students taking into account detected person misfit. Studies in Educational Evaluation, 70, 101053. https://​doi.​org/​10.​1016/​j.​stueduc.​2021.​101053CrossRef
Zurück zum Zitat Şengül Avşar, A. (2019). Comparison of person-fit statistics for polytomous items in different test conditions. Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi, 10(4), 348–364. https://​doi.​org/​10.​21031/​epod.​525647
Zurück zum Zitat Shepard, L., Kagan, S. L., & Wurtz, E. (1998). Principles and recommendations for early childhood assessments. National Academies Press. https://​govinfo.​library.​unt.​edu/​negp/​reports/​prinrec.​pdf. Accessed 31 Mar 2025.
Zurück zum Zitat Smith, R. M., & Plackner, C. (2009). The family approach to assessing fit in Rasch measurement. Journal of Applied Measurement, 10(4), 424–437.
Zurück zum Zitat Smith, A. B., Rush, R., Fallowfield, L. J., Velikova, G., & Sharpe, M. (2008). Rasch fit statistics and sample size considerations for polytomous data. BMC Medical Research Methodology, 8(1), 1–11. https://​doi.​org/​10.​1186/​1471-2288-8-33CrossRef
Zurück zum Zitat Steedle, J. T., & Shavelson, R. J. (2009). Supporting valid interpretations of learning progression level diagnoses. Journal of Research in Science Teaching, 46(6), 699–715. https://​doi.​org/​10.​1002/​tea.​20308CrossRef
Zurück zum Zitat Sussman, J., Draney, K., & Wilson, M. (2023). Language and literacy trajectories for dual language learners (DLLs) with different home languages: Linguistic distance and implications for practice. Journal of Educational Psychology, 115(6), 891–910.CrossRef
Zurück zum Zitat Thurstone, L. L. (1937). Psychology as a Quantitative Rational Science. Science, 85(2201), 227–232. https://​doi.​org/​10.​1126/​science.​85.​2201.​227CrossRef
Zurück zum Zitat Turner, K. T., & Engelhard Jr., G. (2023). Functional data analysis and person response functions. Measurement: Interdisciplinary Research and Perspectives, 21(3), 129–146. https://​doi.​org/​10.​1080/​15366367.​2022.​2054130
Zurück zum Zitat Van den Noortgate, W., & De Boeck, P. (2005). Assessing and explaining differential item functioning using logistic mixed models. Journal of Educational and Behavioral Statistics, 30(4), 443–464. https://​doi.​org/​10.​3102/​1076998603000444​3CrossRef
Zurück zum Zitat Van der Flier, H. (1983). Deviant response patters and comparability of test scores. Journal of Cross-Cultural Psychology, 13(3), 267–298.CrossRef
Zurück zum Zitat Vygotsky, L. S. (1978). Mind in society. Harvard University Press.
Zurück zum Zitat Wakabayashi, T., Claxton, J., & Smith, E. V., Jr. (2019). Validation of a revised observation-based assessment tool for children birth through kindergarten: The COR advantage. Journal of Psychoeducational Assessment, 37(1), 69–90.CrossRef
Zurück zum Zitat Walker, A. A. (2017). Why education practitioners and stakeholders should care about person fit in educational assessments. Harvard Educational Review, 87(3), 426–444. https://​doi.​org/​10.​17763/​1943-5045-87.​3.​426CrossRef
Zurück zum Zitat Walker, A. A., Jennings, J. K., & Engelhard, G., Jr. (2018). Using person response functions to investigate areas of person misfit related to item characteristics. Educational Assessment, 23(1), 47–68. https://​doi.​org/​10.​1080/​10627197.​2017.​1415143CrossRef
Zurück zum Zitat Waterman, C., McDermott, P. A., Fantuzzo, J. W., & Gadsden, V. L. (2012). The matter of assessor variance in early childhood education—Or whose score is it anyway? Early Childhood Research Quarterly, 27(1), 46–54. https://​doi.​org/​10.​1016/​j.​ecresq.​2011.​06.​003CrossRef
Zurück zum Zitat WestEd. (2018). DRDP (2015) research summaries. Report prepared for the California department of education. https://​www.​desiredresults.​us/​research-summaries-drdp-2015-domain
Zurück zum Zitat Wilson, M. (2009). Measuring progressions: Assessment structures underlying a learning progression. Journal of Research in Science Teaching, 46(6), 716–730. https://​doi.​org/​10.​1002/​tea.​20318CrossRef
Zurück zum Zitat Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. MESA Press.
Zurück zum Zitat Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370.
Zurück zum Zitat Wu, M., & Adams, R. J. (2013). Properties of Rasch residual fit statistics. Journal of Applied Measurement, 14(4), 339–355.
Zurück zum Zitat Wu, X., Zhang, Y., Wu, R., & Chang, H.-H. (2021). A comparative study on cognitive diagnostic assessment of mathematical key competencies and learning trajectories: ——PISA data analysis based on 19, 454 students from 8 countries. Current Psychology. https://​doi.​org/​10.​1007/​s12144-020-01230-0CrossRef
Zurück zum Zitat Yow, W. Q., & Markman, E. M. (2011). Young bilingual children’s heightened sensitivity to referential cues. Journal of Cognition and Development, 12(1), 12–31. https://​doi.​org/​10.​1080/​15248372.​2011.​539524CrossRef
Zurück zum Zitat Yun, C., Melnick, H., & Wechsler, M. (2021.). High-quality early childhood assessment: Learning from states’ use of kindergarten entry assessments. Learning Policy Institute.
Bildnachweise
Schmalkalden/© Schmalkalden, NTT Data/© NTT Data, Verlagsgruppe Beltz/© Verlagsgruppe Beltz, EGYM Wellpass GmbH/© EGYM Wellpass GmbH, rku.it GmbH/© rku.it GmbH, zfm/© zfm, ibo Software GmbH/© ibo Software GmbH, Lorenz GmbH/© Lorenz GmbH, Axians Infoma GmbH/© Axians Infoma GmbH, genua GmbH/© genua GmbH, Prosoz Herten GmbH/© Prosoz Herten GmbH, Stormshield/© Stormshield, MACH AG/© MACH AG, OEDIV KG/© OEDIV KG, Rundstedt & Partner GmbH/© Rundstedt & Partner GmbH