State education assessment programs are required to conduct alignment studies and include the results in their U.S. Department of Education peer review submissions. These time-intensive validity investigations typically involve human judgment on how well test items match a content standard and/or cognitive level of complexity, or how one set of content standards maps to another.
Broadly speaking, this typically involves comparing one set of text statements, such as the item stem or task, to another set of text statements, such as content standards. Content experts evaluate these separate pieces of text and use their expert judgment to determine the degree of similarity—or alignment—between them.
The task of comparing a large number of text statements can quickly become overwhelming. Imagine, for example, the task of mapping the National Assessment of Educational Progress (NAEP) content standards to a particular state’s content standards. The number of comparisons would easily reach many thousands. Or, consider an item bank consisting of thousands of items and evaluating those items against a set of content standards. Again, this would involve thousands of comparisons. This cognitive load placed on content experts often leads to fatigue—raising a clear risk that judgment errors will result.
Fortunately, advances in natural language processing (NLP) offer a means to bring efficiencies and cost-savings to traditional, highly labor-intensive alignment processes. One advancement is the growing set of Large Language Models (LLMs), like ChatGPT, or many other open-source versions that can be used to classify, label, or compare text statements quickly and efficiently.
We do not propose that NLP algorithms would perform the alignment in place of content experts. Judgments like these should always involve human experts. However, NLP-based methods can identify the text statements that most closely resemble each other, and content experts can then review a smaller set of candidate standards that may be similar instead of reviewing the entire set of content standards. The idea is to use NLP to make big problems smaller, and then ask content experts to review a subset of standards rather than an entire collection of standards.
HumRRO experts have shared research and insights over the past five years aimed at leveraging NLP methods to bring new levels of efficiency to what have traditionally been human-driven processes:
Our experts are now applying similar NLP concepts to challenges in education, such as alignment, standard setting, and identifying item enemies. Below, we briefly describe how alignment and standard setting methods benefit from NLP techniques.
Leveraging NLP in Educational Test Alignment Processes
Large Language Models are increasingly capturing the attention and imaginations of social scientists, and we are finding new ways almost daily to apply them. Here are a few ways HumRRO can leverage the power of NLP and LLMs in the education context, specifically in the area of alignment:
Creating crosswalks between different content standards. States often map one set of content standards to another set of standards. For example, a 2016 National Center for Education Statistics report maps Next Generation Science Standards to the NAEP science framework standards. In this study, the research team began with a content mapping activity, which involved forming subsets of standards that could then be evaluated by content experts.
In this context, NLP tools can be leveraged to organize text statements into groups using text classification methods, for example, based on their similarity, or to rank order the text pairs based on their NLP similarity index. Like the NCES content mapping activity, the result would be a subset of text pairs that experts evaluate—significantly reducing the human level of effort and potentially the overall cost associated with the tasks.
Comparing test items to state content standards. State assessment programs must demonstrate in their peer review submissions that their test items match the state’s content standards. This task typically involves content experts reviewing items (either on a test form or in the entire bank) and evaluating the task demand against the state content standards. Experts make decisions on which standard the test item is best aligned with.
The challenge here is that content experts must review items against hundreds if not thousands of content standards. Very often, though, it is information contained directly within the item text itself that drives which standard the item seems best aligned with. Rather than human content experts comparing every item to every standard, text similarity methods can be used to match test items with the most likely subset of standards the item aligns with, thus reducing the search space. This would bring substantial time and cost efficiencies into the alignment process.
This is yet another example where text similarity approaches can be useful to assist test developers. As in the examples above, NLP-based approaches can use the language contained in the test items and compare it against state standards or the achievement level descriptors. This would significantly reduce the human level of effort and allow for test developers to focus on a subset of standards instead of comparing items to an entire set of standards.
The Bottom Line
These examples illustrate areas in traditional alignment processes where NLP methods can provide efficiencies and, in turn, lower costs for state assessment programs. That said, we are not advocating that such methods wholly replace human judgment in alignment studies. Instead, we advocate for using NLP and other AI methods in combination with human judgment to create more efficient—and potentially, more effective and accurate—processes than can be accomplished by human experts alone.