Elsevier

Learning and Instruction

Volume 28, December 2013, Pages 1-11
Learning and Instruction

Construct validity of student perceptions of instructional quality is high, but not perfect: Dimensionality and generalizability of domain-independent assessments

https://doi.org/10.1016/j.learninstruc.2013.03.003Get rights and content

Highlights

  • Multidimensionality of students' instructional quality ratings was confirmed.

  • Dimensionality was identical in two subjects (English and German lesson).

  • The postulated model was accepted by a two-level confirmatory factor analysis.

  • Measurement properties of some scales were comparable across subjects and classes.

  • The use of ratings for teacher comparisons is more restricted for other scales.

Abstract

In educational research, student ratings are frequently used to assess aspects of instructional quality. The present study investigates two key aspects of construct validity in student ratings in domain independent assessments: (1) the dimensionality of ratings and (2) their generalizability across classes and two subjects (English and German lessons). A large, representative sample of N = 6909 ninth grade students from 280 classes was used and a structural model postulating five central dimensions of instructional quality (structure, classroom management, understandableness, motivation, student involvement) was tested by means of a two-level confirmatory factor analysis. The five-factor model adequately described the structure of student ratings. In terms of generalizability, the assumption of equal measurement properties of student ratings across classes held for some of the investigated dimensions (structure and classroom management), but not for others (understandableness, motivation, and student involvement). Possible explanations for these differences are finally discussed.

Introduction

The quantity and quality of instruction are important components in all comprehensive frameworks of school effectiveness. In their influential meta-analysis, Wang, Haertel, and Walberg (1993) demonstrated that the quality of teachers' instruction is a very important precondition of students' learning, and recent studies emphasize that the impact of teachers' instruction might be even stronger relative to other determinants of the academic learning process than had previously been assumed (Hattie, 2009; Helmke, 2010; Scheerens & Bosker, 1997; Seidel & Shavelson, 2007).

One widely used approach for assessing instructional quality is student questionnaires in which students are asked to rate several aspects of teachers' instructional quality. From a conceptual point of view, the use of student ratings has the advantage that the ratings of the same environment by many or even all students in each class can be assessed. Furthermore, these student perceptions are typically based on a large amount of experience within the particular class or school (e.g., den Brok, Brekelmans, & Wubbels, 2006). From a more practical point of view, student questionnaires are a relatively easily implemented and low-cost means of obtaining data from large groups of respondents, and are thus often the preferred method for assessing instructional quality in educational research (Kunter et al., 2008; Mainhard, Brekelmans, & Wubbels, 2011; Sälzer, Trautwein, Lüdtke, & Stamm, 2012).

However, because student ratings of teachers' instructional quality are an increasingly accepted assessment method, validity issues concerning these student ratings become all the more critical (see Marsh et al., 2012). Unfortunately, only limited empirical evidence is available with regard to the key aspects of construct validity in students' instructional quality ratings. In the current context, two questions are important: (1) Can students really differentiate between different dimensions of instructional quality? (2) Can student ratings of instructional quality be generalized across different subjects and different classes?

Addressing these questions concerning student ratings is complicated by the multilevel structure inherent in student ratings of instructional quality. As students are nested within classes, student ratings of instructional quality need to be analyzed at two distinct levels (Cronbach, 1976; Lüdtke, Robitzsch, Trautwein, & Kunter, 2009). The first level is that of individual students (also called the student level or within level); the second level is that of classes or schools (often referred to as the class level or between level).

In the present study, we use multilevel confirmatory factor analyses to analyze students' perceptions of five central dimensions of instructional quality for two subjects (English and German) in a large sample of students in secondary education. In doing so, the aim of the article is twofold. First, based on a strong data set and a solid instrument, our analyses shed light on the prospects and limitations of student ratings of instructional quality in general. Second, because questions concerning the dimensionality and generalizability of student ratings of instructional quality arise in many research projects (and need to be tested for the specific instruments used in these studies), the present article describes in some detail the steps involved in addressing these validity issues.

For the past 35 years, research has focused on the characteristics and practices of teachers who appear to be successful in their teaching. From a conceptual point of view, teachers' instructional practices refer to variables that are located at the class level. Extensive and complex literature exists concerning the organization of the classes and what teachers do in their classes (Hattie, 2009; Helmke, 2010; Muijs & Reynolds, 2007). Research showed that close monitoring, adequate pacing, and classroom management as well as the clarity of presentation, comprehensiveness of instruction, and a good classroom climate reflect the key aspects of teachers' instructional quality (Hattie, 2009; Muijs & Reynolds, 2007). Prominent theoretical models of instructional quality that summarize the most important aspects of instructional quality have been developed by Klieme, Schümer, and Knoll (2001) and Pianta, La Paro, and Hamre (2008). Both models deal with domain-independent instructional quality and consistently organize teachers' instructional behavior into three major domains: classroom organization (which includes effective treatment of interruption), student orientation (including a supportive climate and individualized instruction), and cognitive activation (including the use of deep content, higher order thinking tasks, and other demanding activities).

Empirical studies assessing different dimensions of teaching quality typically draw on three data sources (Anderson, 1982): external observers, teachers, and/or students. Although it has been argued that each of these perspectives has its specific strengths (Clausen, 2002; Turner & Meyer, 2000), student ratings have the conceptual advantage that they are typically based on a large amount of experiences within the particular class, which hopefully contributes to a more accurate picture of teachers instructional practices. Here all students within the same class rate the same teachers' instructional practices. The shared perceptions of students reflect the amount of agreement with regard to teachers' instructional practices. However, the crucial question remains as to whether students are commonly able to differentiate between the different dimensions of instructional quality at the class level.

In particular in the field of higher education, there is a longstanding debate as to whether students perceive teaching behavior as uni- or multidimensional (e.g., D'Apollonia & Abrami, 1997; Marsh & Roche, 1997). Surprisingly, this debate has been far less intense in the field of middle and secondary schooling. Researchers and practitioners in this field (Helmke, 2010; Klieme et al., 2001) agree that teaching is a complex activity consisting of multiple dimensions (e.g., clarity, teachers' interactions with students, organization, enthusiasm). At the same time, the extent to which students can describe teachers' instructional practices in a differentiated way is only partly understood and therefore reflects an important empirical question.

To address the question of dimensionality, the use of confirmatory factor analysis is one of the dominant statistical approaches (see Marsh et al., 2012). In confirmatory factor analysis, the fit of a hypothesized measurement model—in which associations between latent variables and their indicators are specified—is tested. Thus, factor analysis is also “intimately involved with questions of validity” (Nunnally, 1978; pp. 112–113). The process of determining whether the a priori factor structure fits the data (based on factor loadings and other parameter estimates as well as goodness-of-fit indices) answers the question asked by construct validity: Do the measures adequately assess the construct they purport to assess (see Cronbach & Meehl, 1955; Messick, 1995; Nunnally, 1978)? Factor analytical approaches are also a widely used means of establishing the construct validity of student ratings. In the field of university research, the studies of Marsh and Roche (1997) and Abrami, D'Apollonia, and Cohen (1990) are classic examples. Thereby, mostly single-level psychometric analyses (i.e. analyses based on aggregated scores at the class level) are typically used to determine the dimensionality of student ratings (Marsh et al., 2012). However, and most important in the present context, such single level analyses do not take into account that the factor structures and their psychometric properties (e.g., reliability) at the student level and the class-level may not be the same. This point was also made clear in the classic 1976 article by Cronbach, who noted that studying individual differences in the perceptions of different students within the same class might be interesting but is unrelated to constructs at the class level. To date few studies have investigated student ratings of instructional quality in secondary education at the student level and the class-level simultaneously within a factor analytic framework (e.g., Kunter et al., 2008). Kunter et al. (2008) used a sample of ninth grade students to demonstrate the empirical separability of three central theoretical dimensions of perceived instructional quality in math lessons—namely monitoring, cognitive challenge, and support for students—which also correspond to the three basic dimensions of instructional quality identified by Klieme et al. (2001) and Pianta et al. (2008). The goodness-of-fit indices (e.g., χ2, CFI, RMSEA, and SRMR) showed that the empirical data were in line with the three factor model at both levels of analysis. The intercorrelation of factors was well below 1.0 indicating the empirical separability of the factors (0.66 ≤ r ≤ 0.70 at the classroom level). In addition, student ratings correlated with corresponding teacher self-ratings (0.21 ≤ r ≤ 0.31), but they showed consistently lower associations with non-corresponding teacher ratings (−0.08 ≤ r ≤ 0.23). These results support the notion that student ratings of instructional quality can have high construct validity at both the individual and class levels. In the present article, we argue that the consideration of factor analytical results at both levels of analysis will also provide further insight into a second important concept of construct validity; the generalizability of students ratings across different contexts.

However, the separability of instructional quality dimensions as shown in Kunter et al. (2008) is only a first necessary condition for establishing the construct validity of student ratings. As already made clear by Cronbach and Meehl (1955), the construct validity of a measure is established by demonstrating its place in a nomological net of consistent, related empirical findings. The question as to whether the meaning and interpretation of measures and their use remain the same—are generalizable—across persons, groups, and contexts constitutes a further important issue (Cronbach & Meehl, 1955; Messick, 1995).

Psychological research often compares psychological variables in different contexts (e.g., grade-level cohorts, school tracks, or gender subgroups). These comparisons typically assume that the instrument measures the same psychological construct in all of these different contexts. Despite its appeal, this assumption is often not justified and needs to be tested (Bejar & Doyle, 1981). Testing for the generalizability of measures has thus become an important issue in recent years (Cheung & Rensvold, 1999; Marsh et al., 2010; Nitsche, Dickhäuser, Fasching, & Dresel, 2011).

The generalizability of measurement properties is also of considerable practical and theoretical interest in the field of student ratings of instructional quality. Practically, student ratings are often interpreted comparatively or normatively, and a meaningful interpretation of the ratings requires that they measure the same thing in the different contexts in which they are compared (Bejar & Doyle, 1981; Clausen, 2002; Klieme & Rakoczy, 2008). Theoretically, the generalizability has implications for the study of differences across different school subjects (e.g., English, mathematics, sciences) and for the study of differences across different classes. The generalizability across different school subjects refers to the question of whether instructional quality constructs such as classroom organization or teachers' support are the same and allow a meaningful comparison across different subjects (Klieme & Rakoczy, 2008).

The same is true for the comparability of students' instructional quality ratings within one and the same subject across different classes. Seen from a practical point of view, assessments of instructional quality are sometimes used for the purpose of ranking teachers or schools and for interventions (e.g., to justify the necessity of training for specific teachers.) Obviously, one of the prerequisites for using student ratings is the generalizability of student ratings of instructional quality across all of the surveyed classes (Bejar & Doyle, 1981).

Using student ratings of teachers' instructional quality has become increasingly accepted in research on learning and instruction. However, the question as to whether student ratings can be viewed as valid measures of instructional quality arises. In the present study, two key aspects of validity (i.e., dimensionality and generalizability of student ratings) are addressed. The data were derived from a large-scale educational assessment in Germany (DESI study) in which students rated the quality of their L1 (German) and L2 (English) instruction based on five theoretical dimensions in a questionnaire (i.e., structure, motivation, understandableness, student involvement, and classroom management). All scales dealt with domain-independent instructional quality, covering two of the three basic dimensions of instructional quality (Klieme et al., 2001; Pianta et al., 2008): classroom organization (structure and classroom management) and student orientation (motivation, student involvement, and understandableness).

Given the hierarchical data structure, we used a multilevel confirmatory factor analysis (MCFA) to investigate student ratings of instructional quality. An MCFA not only appropriately accounts for the nested data structure (i.e., students within classes) but also has the potential to provide new insights into key validity aspects such as the dimensionality of students' ratings and their generalizability across different subjects and classes.

We first expected that students would be able to differentiate between the above-mentioned central theoretical dimensions of instructional quality in German and English lessons. We hypothesized that a five-factor measurement model (i.e., structure, motivation, understandableness, student involvement, and classroom management) would adequately describe the student ratings of instructional quality at the student and class level (hypothesis 1). Second, with regard to the generalizability of student ratings, we investigated whether or not comparable inferences about the meaning of student ratings could be made across school classes. Although it is commonly assumed that this is the case, there is not much empirical support for this assumption. Even if measurement invariance is found for some constructs, one always has to examine if this assumption is met for other constructs as well. Specifically, we hypothesized that the measurement properties of the instrument could—at least for some of the constructs—be influenced by idiosyncratic interpretations of the items by the specific group of raters (i.e., the school class) who judge the instructional quality in their classroom (hypothesis 2). This means that systematic differences between student ratings across classrooms may not be completely attributable to true differences in teachers' instructional quality but might also be related to systematic student differences (e.g., proportion of boys, levels of abilities and competencies, or the proportion of students from minority groups) between classrooms. In such a case, the assumption of measurement invariance would not hold and this would make the meaningful interpretation of the student ratings across classrooms more difficult. In particular, constructs such as understandableness, motivation, and student involvement might be more affected by classroom composition such as prior achievement while indicators of structure and classroom management should be less “sensitive” to such rater characteristics.

Third, as in the case of the generalizability across classes, we investigated measurement invariance across English and German lessons. Again, we expected that the assumption of measurement invariance could be violated for those constructs, which might be particularly affected by inter-individual differences across students (e.g. understandableness, motivation, and student involvement), at the within class level (hypothesis 3a), the between class level (hypothesis 3b), and both levels combined (hypothesis 3c).

Fourth, the generalizabilty of the measurement properties of the instrument across classes and subjects was investigated simultaneously (hypothesis 4). We assumed that in this combined model—which provides a rigorous test for the general applicability of the instrument for each single construct in the assessment of teaching quality—all constructs would show psychometric properties similar to those of the classroom model (hypothesis 2) and the subject model (hypothesis 3).

Section snippets

Sample

This study is based on data from a large, nationally representative educational assessment (DESI, Deutsch Englisch Schülerleistungen International; German English Student Proficiency International) conducted by the German Institute for International Educational Research (DIPF) in Frankfurt am Main, Germany. It took place in the 2003/2004 school year. A total of 10,632 ninth grade students from 219 schools were assessed at the end of the school year.

In Germany, a system of stable class

Statistical modeling

A key procedure for collecting evidence for the comparability of measurement properties is referred to as testing for factorial invariance within multiple-group factor analysis (e.g., Nitsche et al., 2011). This involves estimating a factor model simultaneously in various contexts (e.g., grade-level cohorts, school tracks, or gender subgroups). Further restrictions are placed on the multiple-group solution to determine if an equivalent measurement model holds in all of these contexts. For

Students' shared perception of teachers instructional quality: intraclass correlations

A necessary precondition for MCFA is sufficient variability of indicators at both levels, and particularly the class level. Inspection of the intraclass correlation (ICCs; i.e., the proportion of between level variance of the total variance of a given variable; Table 1) showed that a rather large component of the variance in the indicators was due to shared perceptions, ranging from 10% to 27% in the two subjects (German and English). This is in line with previous research (Kunter et al., 2008;

Discussion

The present study investigated the dimensionality of ratings at the student and class level and their generalizability across classes and two school subjects as two key aspects of construct validity in student ratings. The results yielded support for the ability of students to describe their teachers' instructional practices based on five theoretically driven constructs of instruction in two subjects as expected (hypothesis 1). In terms of generalizability, the assumption of equal measurement

References (51)

  • K.A. Bollen

    Structural equations with latent variables

    (1989)
  • N.M. Bradburn et al.

    Answering autobiographical questions: the impact of memory and inference on surveys

    Science

    (1987)
  • P. den Brok et al.

    Multilevel issues in research using students' perceptions of learning environments: the case of the questionnaire on teacher interaction

    Learning Environments Research

    (2006)
  • F.B. Bryant et al.

    Principles and practice of scaled difference chi-square testing

    Structural Equation Modeling: A Multidisciplinary Journal

    (2012)
  • M. Clausen

    Qualität von Unterricht: Eine Frage der Perspektive?

    (2002)
  • L.J. Cronbach

    Research on classrooms and schools: Formulation of questions, design and analysis

    (1976)
  • L.J. Cronbach et al.

    Construct validity in psychological tests

    Psychological Bulletin

    (1955)
  • S. D'Apollonia et al.

    Navigating student ratings of instruction

    American Psychologist

    (1997)
  • S.A. Fisicaro et al.

    Implications of three causal models for the measurement of halo error

    Applied Psychological Measurement

    (1990)
  • J. Hattie

    Visible learning: A synthesis of meta-analyses relating to achievement

    (2009)
  • A. Helmke

    Unterrichtsqualität und Lehrerprofessionalität. Diagnose, evaluation und Verbesserung des Unterrichts

    (2010)
  • J.L. Horn et al.

    A practical and theoretical guide to measurement invariance in aging research

    Experimental Aging Research

    (1992)
  • J. Hox

    Multilevel analysis: Techniques and applications

    (2010)
  • J. Kennedy et al.

    Overcoming some impediments to the study of teacher effectiveness

    Journal of Teacher Education

    (1976)
  • E. Klieme et al.

    Alltagspraxis, Qualität und Wirksamkeit des Deutschunterrichts

  • Cited by (0)

    View full text