Improving Standards-Based Educational Testing: Classifying Students Based on Test Scores

Article Image

The Challenge

Standards-based testing, which assigns students to a small number of discrete performance categories, has become an important mode of communicating student assessment results for state school accountability programs. Under the federal No Child Left Behind Act, students’ categorical scores are used to assess schools’ performance with “proficiency” being the key category. As a result, the analysis of potential student classification errors is important both for students and schools.
Test theory is based on the concept of a true score for each examinee, defined as the expected or average score across an infinite number of repeated administrations of the same (or a similar) test. In most cases, we have only a score from a single administration of the test in question. The difference between this single observed score and the underlying true score is measurement error. We are concerned not just with the size of these errors, but with the impact of these errors on classifying students into performance categories. Classification accuracy can be defined as the likelihood that a student’s classification based on his/her observed score results in the same classification as their corresponding true score.

What We Do

Teachers and parents presented with students’ observed test score may wonder about the likelihood that their students’ true scores are in performance classifications that are the same or different from their observed scores. To address this, HumRRO has introduced a somewhat different perspective on classification accuracy.
 
Our perspective focuses on accuracy as a function of particular observed scores rather than as a function of particular true scores. Our question is: “How likely is a student’s unknowable true score to be in the same category as the student’s observed score?” This perspective is important because it expresses error in a meaningful way for individual students. We will refer to our perspective as a question about observed score classification accuracy.

What We Have Found

An interesting way of looking at test results is by the probability of true scores being in the same performance category as a set of observed scores. The figure below is one illustration. In this example, there are four performance categories; each marked by the dips in observed score classification accuracy. Cut points between performance categories are 140, 150, and 164.
 
As shown in the figure, observed score classification accuracy is no better than 50% for observed scores at any cutpoint, meaning that half of the time the true score is on the other side of the cutpoint. Notice also that accuracy for observed scores in the second performance category is never better than about 75% and that accuracy for observed scores in the third category can reach over 90%. Two factors account for this difference. First, measurement error happens to be slightly smaller in the third category than the second, owing to the fact that measurement error varies at different points on the scale depending on the difficulties of the test questions. Second, and more importantly, the width of the third performance category is larger than the width of the second category, providing more room for correctly matching true scores.
observed score classification accuracy chart
Making Better Tests: What HumRRO’s Approach Can Do
HumRRO’s view of student classification accuracy can help policy makers understand that important decisions, such as passing a high school exit exam, can be adjusted to reduce damaging errors. In our example, 150 is the proficiency standard. If 150 is also used as the passing decision point, 50% of the students at 150 will be classified incorrectly. If reducing the number of students inappropriately failed is a high concern, then a decision point of 147 might be used. This would reduce the maximum error rate to 23% (the probability that a student at 147 had a true score at or above 150).
In addition, HumRRO is developing an index which describes the steepness and depth of the “valleys.” All tests will have low accuracy at the cut points – better tests will have more narrow valleys with observed scores that more rapidly increase in accuracy as they get further from the cut. Having these data will result in better and more informed evaluations of standards-based tests and their implications for classifying students into performance categories.
 
For more information, contact
Dr. Gene Hoffman  (502-339-9331, ext 105),
www.HumRRO.org
RESEARCH NOTESare intended to update our friends and colleagues about recent work performed by the Human Resources Research Organization (HumRRO).
 
Established in 1951, HumRRO is an independent, nonprofit corporation dedicated to the development and application of  state-of-the-art scientific principles and technologies to solve the real-world challenges facing private and public sector organizations and educational institutions. Our professional staff is composed of psychologists with diverse expertise in strategic human resource management, personnel selection and classification, performance assessment and compensation, training and instructional design, educational research and evaluation, survey design and analysis, credentialing, and program and policy analysis. Our client base includes the military, government agencies, private industry, and professional associations. We are proud of the reputation we have for providing responsive, high quality, and cost-effective services.
 
Further information regarding HumRRO can be obtained by calling Dr. Bill Strickland, President and CEO, at 703-549-3611.