1 Introduction
2 A conceptualization of teaching and its measurement
Type of rater error | Definition | Complications arising from this rater error |
---|---|---|
Shared bias | All raters systematically make the same error when scoring. For example, confusions in training may lead all raters to misunderstand some rubric dimensions. Alternatively, shared prior experiences in classrooms may lead to scores that are based on local understandings of teaching quality rather than the understanding of teaching quality embedded in the rubric | • This error could systematically bias estimates of the reliability and concurrent validity of observation scores • Error is invisible when only looking at inter-rater agreement |
Rater bias | One rater systematically makes the same error when scoring specific types of instruction. For example, a misunderstanding of the rubric may lead a rater to systematically inflate scores for instruction containing independent seatwork | • This error could systematically bias estimates of the reliability and concurrent validity of observation scores • Absent efforts to specifically test for rater bias across features of instruction that are related to rater bias, rater bias is likely to be misinterpreted as random error |
Rater random error | One rater systematically makes the same error when scoring all instances of instruction (i.e., rater leniency/severity). For example, a misunderstanding of the thresholds between score categories may lead a rater to systematically score all instruction higher than deserved | • Assuming the random assignment of raters, this error introduces random error to observation scores |
Random error | One rater non-systematically makes errors when scoring. For example, a rater may mishear or not hear something said by a student. Here, non-systematic implies that raters may not make the same mistake if they were to score the lesson again while systematic implies that raters are likely to make the same mistake in every viewing of the lesson | • Reduces power to study teaching quality |
2.1 Different uses of scores involve different interpretations of what scores represent
2.2 Observation scores as representations of specific understandings of teaching quality
2.3 Definitions
3 Multiple uses of observation scores
3.1 Communicating a standard of practice and providing feedback relative to this standard
Specified use | Sources of signal | Sources of error | Sources of bias | Notes on use |
---|---|---|---|---|
Communicating a standard of practice and providing feedback relative to this standard | The variation associated with the quality of enacted instruction as interpreted through the ObsSys’s lens | The variation associated with non-systematic rater error | The variation associated with systematic rater errors | We assume that the goal is to provide feedback on the ObsSys’s operationalized understanding of teaching quality |
Identifying teachers for PD | The variation associated with teachers’ knowledge or skills as interpreted through the ObsSys’s lens | The variation associated with the instructional context and non-systematic rater error | The variation associated with teachers’ choice in instructional practice and systematic rater errors | We assume PD is focused on building teachers’ capacity to enact instruction |
Ensuring equitable access to teaching quality | The variation associated with average enacted instruction at the student sub-group level as interpreted through the ObsSys’ lens, including variation driven by differences in the enacted curriculum across sub-groups | The variation associated with (1) biases in sampling the instructional context; (2) sampling of lessons, classrooms, and teachers in general; and (3) non-systematic rater error | The variation associated with systematic rater errors | |
Making employment decisions | The variation associated with teachers’ capacities as interpreted through the ObsSys’s lens | The variation associated with (1) the instructional context and (2) non-systematic rater error | The variation determined by (1) the students being taught and (2) systematic rater errors | We assume a system-level effort to identify the teachers with the highest overall level of capacity (e.g., skills and knowledge) |
3.2 Identifying teachers for professional development (pd)
3.3 Ensuring equitable access
3.4 Employment decisions: hiring, bonuses, tenure, etc.
3.5 Section summary
4 Methods
4.1 Understanding Teacher Quality (UTQ)
4.2 Augmented generalizability theory (GT) models
5 Results
5.1 Variance decomposition of scores
Obs. system | Facet (type of rater error) | Variance associated with each facet | Percentage of the facet in the full data model associated with the proxy measures | |||||
---|---|---|---|---|---|---|---|---|
Full data | Calibration data | Instruct context | Only student | Only teacher | Both student and teacher | Residual (i.e., not associated with any factor) | ||
CLASS | School | 0.032 | 13% | 41% | 0% | 19% | 28% | |
CLASS | Teacher | 0.045 | 18% | 5% | 2% | 2% | 73% | |
CLASS | Classroom | 0.007 | 0% | 0% | 14% | 0% | 86% | |
CLASS | Lesson | 0.041 | 39% | 2% | 0% | 0% | 59% | |
CLASS | Rater (rater random error, or leniency) | 0.061 | 0.015 | 11% | 0% | 0% | 0% | 89% |
CLASS | Residual (random error) | 0.158 | 0.022 | 0% | 0% | 0% | 0% | 100% |
CLASS | Rater shared bias (lesson and segment facets) | 0.042 | ||||||
CLASS | Rater-by-lesson (rater bias) | 0.081 | ||||||
FFT | School | 0.011 | 10% | 40% | 0% | 20% | 30% | |
FFT | Teacher | 0.018 | 11% | 11% | 0% | 5% | 74% | |
FFT | Classroom | 0.002 | 100% | 0% | 0% | 0% | 0% | |
FFT | Lesson | 0.008 | 25% | 0% | 0% | 0% | 75% | |
FFT | Rater (rater random error, or leniency) | 0.012 | 0.006 | 0% | 0% | 0% | 0% | 100% |
FFT | Residual (random error) | 0.061 | 0.040 | 0% | 0% | 0% | 0% | 100% |
FFT | Rater shared bias (lesson and segment facets) | 0.062 |