Assessment

Unveiling Scoring Processes: Dissecting the Differences Between LLMs and Human Graders in Automatic Scoring

Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. It is also uncertain how closely AI’s scoring process mirrors that of humans or if it adheres to the same grading criteria. To address this gap, this paper uncovers the grading rubrics that LLMs used to score students’ written responses to science tasks and their alignment with human scores.

Author/Presenter

Xuansheng Wu

Padmaja Pravin Saraf

Gyeonggeon Lee

Ehsan Latif

Ninghao Liu

Xiaoming Zhai

Lead Organization(s)
Year
2025
Short Description

Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. It is also uncertain how closely AI’s scoring process mirrors that of humans or if it adheres to the same grading criteria. To address this gap, this paper uncovers the grading rubrics that LLMs used to score students’ written responses to science tasks and their alignment with human scores. We also examine whether enhancing the alignments can improve scoring accuracy.

Unveiling Scoring Processes: Dissecting the Differences Between LLMs and Human Graders in Automatic Scoring

Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. It is also uncertain how closely AI’s scoring process mirrors that of humans or if it adheres to the same grading criteria. To address this gap, this paper uncovers the grading rubrics that LLMs used to score students’ written responses to science tasks and their alignment with human scores.

Author/Presenter

Xuansheng Wu

Padmaja Pravin Saraf

Gyeonggeon Lee

Ehsan Latif

Ninghao Liu

Xiaoming Zhai

Lead Organization(s)
Year
2025
Short Description

Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. It is also uncertain how closely AI’s scoring process mirrors that of humans or if it adheres to the same grading criteria. To address this gap, this paper uncovers the grading rubrics that LLMs used to score students’ written responses to science tasks and their alignment with human scores. We also examine whether enhancing the alignments can improve scoring accuracy.

Characterizing Teacher Knowledge Tests and Their Use in the Mathematics Education Literature

We present findings from an analysis of tests of teacher mathematical knowledge identified over a 20-year period of mathematics education literature. This analysis is part of a larger project aimed at developing a repository of instruments and their associated validity evidence for use in mathematics education. We report on how these tests are discussed in the literature, with a focus on validity arguments and evidence. A key finding is that these tests are often presented in ways that do not support their use by the mathematics education community.

Author/Presenter

Pavneet Kaur Bharaj

Michele Carney

Heather Howell

Wendy M. Smith

James Smith

Year
2025
Short Description

We present findings from an analysis of tests of teacher mathematical knowledge identified over a 20-year period of mathematics education literature, and report on how these tests are discussed in the literature, with a focus on validity arguments and evidence.

NLP-Enabled Automated Assessment of Scientific Explanations: Towards Eliminating Linguistic Discrimination

As use of artificial intelligence (AI) has increased, concerns about AI bias and discrimination have been growing. This paper discusses an application called PyrEval in which natural language processing (NLP) was used to automate assessment and provide feedback on middle school science writing without linguistic discrimination. Linguistic discrimination in this study was operationalized as unfair assessment of scientific essays based on writing features that are not considered normative such as subject-verb disagreement.

Author/Presenter

ChanMin Kim

Rebecca J. Passonneau

Eunseo Lee

Mahsa Sheikhi Karizaki

Dana Gnesdilow

Sadhana Puntambekar

Year
2025
Short Description

As use of artificial intelligence (AI) has increased, concerns about AI bias and discrimination have been growing. This paper discusses an application called PyrEval in which natural language processing (NLP) was used to automate assessment and provide feedback on middle school science writing without linguistic discrimination.

NLP-Enabled Automated Assessment of Scientific Explanations: Towards Eliminating Linguistic Discrimination

As use of artificial intelligence (AI) has increased, concerns about AI bias and discrimination have been growing. This paper discusses an application called PyrEval in which natural language processing (NLP) was used to automate assessment and provide feedback on middle school science writing without linguistic discrimination. Linguistic discrimination in this study was operationalized as unfair assessment of scientific essays based on writing features that are not considered normative such as subject-verb disagreement.

Author/Presenter

ChanMin Kim

Rebecca J. Passonneau

Eunseo Lee

Mahsa Sheikhi Karizaki

Dana Gnesdilow

Sadhana Puntambekar

Year
2025
Short Description

As use of artificial intelligence (AI) has increased, concerns about AI bias and discrimination have been growing. This paper discusses an application called PyrEval in which natural language processing (NLP) was used to automate assessment and provide feedback on middle school science writing without linguistic discrimination.

A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14).

Author/Presenter

Sophie Grace King

Jonathan David Bostic

Toni A. May

Gregory E. Stone

Year
2025
Short Description

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14). The second aim of this mixed-methods research was to unpack consequences of testing validity evidence related to the results and test interpretations, leveraging the voices of participants.

A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14).

Author/Presenter

Sophie Grace King

Jonathan David Bostic

Toni A. May

Gregory E. Stone

Year
2025
Short Description

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14). The second aim of this mixed-methods research was to unpack consequences of testing validity evidence related to the results and test interpretations, leveraging the voices of participants.

A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14).

Author/Presenter

Sophie Grace King

Jonathan David Bostic

Toni A. May

Gregory E. Stone

Year
2025
Short Description

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14). The second aim of this mixed-methods research was to unpack consequences of testing validity evidence related to the results and test interpretations, leveraging the voices of participants.

A Usability Analysis and Consequences of Testing Exploration of the Problem-Solving Measures–Computer-Adaptive Test

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14).

Author/Presenter

Sophie Grace King

Jonathan David Bostic

Toni A. May

Gregory E. Stone

Year
2025
Short Description

Testing is a part of education around the world; however, there are concerns that consequences of testing is underexplored within current educational scholarship. Moreover, usability studies are rare within education. One aim of the present study was to explore the usability of a mathematics problem-solving test called the Problem Solving Measures–Computer-Adaptive Test (PSM-CAT) designed for grades six to eight students (ages 11–14). The second aim of this mixed-methods research was to unpack consequences of testing validity evidence related to the results and test interpretations, leveraging the voices of participants.

Expanding Uses of the STEM Observation Protocol (STEM-OP): Secondary Science Teachers’ Reflections on Integrated STEM Practice

There are few guidelines related to how to implement integrated STEM education in the K-12 science classroom. It is important that teachers have opportunities to reflect on integrated STEM instruction when implemented so that they may further develop their practice. This research aimed to understand how the STEM Observation Protocol (STEM-OP) may be used as a way for teachers to reflect on their integrated STEM practice.

Author/Presenter

Emily Dare

Joshua Ellis

Christopher Irwin

Lead Organization(s)
Year
2025
Short Description

There are few guidelines related to how to implement integrated STEM education in the K-12 science classroom. It is important that teachers have opportunities to reflect on integrated STEM instruction when implemented so that they may further develop their practice. This research aimed to understand how the STEM Observation Protocol (STEM-OP) may be used as a way for teachers to reflect on their integrated STEM practice. This exploratory case study was designed to better understand secondary science teachers’ reflections on the STEM-OP by addressing the following research questions: 1) What are secondary science teachers’ reflections on integrated STEM practices as measured by the STEM-OP? and 2) In what ways do secondary science teachers envision using the STEM-OP as a tool in their practice?