Seven years ago, the Every Student Succeeds Act (ESSA) went into effect, giving states more flexibility to create individual education accountability plans that address their specific needs. ESSA replaced the No Child Left Behind Act, which took effect in 2002 and required states to implement educational accountability systems. Accountability systems are intended to “close achievement gaps, increase equity, improve the quality of instruction, and increase outcomes for all students,” according to the U.S. Department of Education. It is time to evaluate progress toward those goals and to examine the effectiveness of state accountability systems.
When it comes to evaluating and validating assessments, including those used within statewide education accountability systems, the measurement community does many things well: We scrutinize the development of test blueprints and test items. We examine how effectively items function individually and how they behave as a group to generate estimates of students’ ability in reading, math, science, and other content areas. We evaluate test alignment to standards, and how accessible tests are for all students. We compare results of the assessments across student groups to seek out and address potential biases.
Despite these laudable efforts, gaps remain. Most critically, few resources are devoted to identifying data-driven ways to ensure the correct schools and districts are identified by the accountability system, and even fewer resources are spent determining whether the accountability system is improving students’ educational outcomes. However, this type of outcomes-focused research is exactly what is needed to support well-designed interventions to assist students in the identified schools and districts.
Ultimately, the purpose of state accountability systems is to identify schools and districts where assistance is most needed so that the state can allocate very targeted—and often very limited—resources. Recent state evaluations conducted by HumRRO can help identify best practices that should be used when gauging the effectiveness of an accountability system.
Write laws with evaluation and measurement in mind.
Through legislation, states establish their own requirements for accountability systems. When effective, these requirements include metrics, methods, and even goals. For example, some states require that the accountability system account for student-level academic growth. They may specify the components that are included in the accountability computation and the weights associated with each component. States also include rules regarding participation, reporting, or exceptions for small or nontraditional schools. States establish guidelines for determining which schools and districts receive additional assistance, and they define the nature of the assistance provided. When legislation is written with this in mind, ideally with the help of measurement experts, then determining if an accountability system meets legislated goals is clear and transparent.
Thoroughly define key terms.Legislation is often written using such terms as accurate, valid, reliable, diagnostic, informative, precise, predictive, on grade level, college/career ready, etc. One key challenge for an evaluation of the system is to define the terms from the legislation in ways that can be investigated and/or rated. The legislation should define these terms in the context of assessment and accountability. Established parameters or acceptability thresholds for specific indicators should be informed by the uses of and inferences drawn from the scores.
Evaluators should work with the state department of education and other agencies to define context. Assessment systems are designed to meet legal requirements, but they do so in varied ways. For example, if the law requires a measure of “college/career readiness,” a state may meet that mandate by using a college entrance exam. The same requirement in another state may include a different indicator of workforce readiness, such as Career and Technical Education (CTE) certification, Advanced Placement (AP) or dual credit course completion, or an acceptable score on the Armed Services Vocational Aptitude Battery (ASVAB).
Any evaluation of the adherence of an accountability system to the associated state statute should begin with a clear understanding of the components used to generate each mandated indicator, along with how those indicators fit within the broader context of the accountability system. State education agencies implement the law within an assessment and accountability system and can help interpret it for evaluators.
Evaluate whether the correct schools and districts are identified.
A well-functioning accountability system is able to identify the schools and districts experiencing the greatest need, but defining “greatest need” is a vital and often over-simplified idea. Most accountability systems use a formula that combines multiple performance indicators to generate an accountability “index” score. Schools and districts with high scores are praised, while those with the lowest scores are targeted for intervention.
In our experience, however, most if not all schools and districts could benefit from some targeted assistance. Even in the best of circumstances, schools and districts operate on limited budgets, stretching their capacity to dedicate time and resources to the academic struggles of each student. For schools and districts with students who may also struggle with non-academic challenges impacting their academic performance, resources may be stretched even thinner.
If every school/district could benefit from additional resources, then how does a state decide who gets what and how much? The answer depends on the theory-of-action for the accountability system and the resources available.
Attend to positive and negative incentives in the accountability system.
Accountability systems often function as both “carrot” and “stick,” using a combination of positive and negative reinforcement to promote performance. High-performing schools/districts are clearly labeled and celebrated as such. Low-performing schools/districts are also labeled, though, which can be viewed as punitive by education staff. Assistance may well bring additional funding and opportunity to the school/district, but it can come with restrictions on how those resources may be used, and schools/districts can lose some of their autonomy in the process.
Emphasizing the “carrot” aspect of accountability might cause a state to severely limit the number of schools/districts identified for assistance, because doing so would mean that few schools suffer the stigma of being labeled as “low-performing,” and the state can divert more of its limited resources to a small number of identified schools/districts. This may lessen the sting of the low-performance label.
Conversely, a state might identify many schools/districts for assistance, counting on the stigma of the label to motivate educators (and perhaps students) to improve. The resources that can be diverted to assist any one school/district will be limited if many are identified and the resources may not have as large an impact.
Attend to school/district size.
Sample size can impact classification accuracy. Accountability systems rely on aggregated student-level test scores, and score aggregations tend to be more stable and accurate when they are generated from larger groups of students. A smaller school’s scores, therefore, will tend to be more volatile than a larger school’s scores. The error associated with aggregation is larger in smaller schools, with scores more susceptible to variability caused by even one or two very high- or very low-performing students matriculating from grade to grade. This makes it more likely that a small school could be incorrectly classified, for example, and labeled low performing when it is meeting standards. This type of misclassification matters because it could divert assistance from where it is most needed.
When small schools are identified as needing assistance, the volatility of their scores also makes it more likely that such scores will show improvement in the next year. Measurement phenomena like regression to the mean become more important factors to consider for small schools. Accountability systems should recognize the limitations of their measurement models and make adjustments or accommodations to address those limitations wherever feasible. For example, to account for the volatility of data in small schools, Colorado aggregates their results across three years.
Attend to fairness and equity in the evaluation of the accountability system.
It may not be surprising to learn that schools with larger percentages of traditionally underrepresented students perform worse than their counterparts.
An effective accountability system should do more than simply identify the students served by the schools. For example, accountability systems may have specific incentives for reducing traditional performance “gaps.” These incentives may take multiple forms and can include “bonus points” for reducing specific identified gaps; they may also introduce penalties when those gaps increase. Such systems add complexity to the accountability formula and may not be perceived as fair by schools serving very different distributions of students.
States may also use multiple measures, specifically including those that show less differential impact by student groups. Differential impact is typically operationalized using effect sizes, which describe the differences between groups in standard deviation units. Effect sizes are common in meta-analyses where multiple measures must be compared but can be challenging to integrate into accountability systems. Most accountability indexes do not account for the error associated with each measure included in the index. Measures with more error, and higher standard deviations, can show less differential impact than those with less error because they are less reliable.
Ultimately, an accountability system should monitor achievement by student group and address performance gaps wherever they occur. This means that even among higher-performing schools or districts, persistent or increasing gaps are a cause for concern and should trigger scrutiny and/or intervention. Efforts to improve fairness and equity should be evaluated in terms of student performance as well as through non-academic measures (e.g., safety, learning environment).
Demonstrate that the accountability system leads to improvements for students and educators.
Finally, when a school/district receives assistance from the state through the accountability system, it is vital to evaluate the effectiveness of that assistance: Most states evaluate whether students’ test scores improve in the year after receiving assistance, but test scores alone, especially when limited to short term gains, tell an incomplete story. State assistance should have long-term impact, allowing struggling schools/districts to make lasting improvements rather than cycling between improvement and decline without making meaningful changes for students.
One key to lasting improvement is to change the behaviors of educators in the school/district. It is unrealistic to expect lasting changes while conducting business-as-usual. Providing students with a test-taking strategy session is unlikely to lead to lasting improvements in student performance. Assistance that leads to more effective instruction, improved fairness and equity, better access to services inside and outside school buildings, or other systemic changes are more likely to have lasting impact on schools/districts.
Focus on Outcomes
Ultimately, the evaluation of an accountability system must address the intended purposes and intended outcomes of that system. An evaluation must address aspects of assessment quality. However, while a valid and reliable test is necessary, it is only one component of an effective accountability system.
Addressing the ways test scores are used and combined with other indicators is vital. Understanding how the components of the accountability system are intended to function, defining proximal and distal goals and intended outcomes, and establishing reasonable indicators to monitor progress toward those goals are vital for a fair and comprehensive evaluation.