Engaging students in scientific modeling practice is critical for developing their competence in using scientific knowledge to explain phenomena and design solutions. Student-drawn models are frequently used to investigate students’ proficiency in scientific modeling. However, scoring student-drawn models is time-consuming and requires technical expertise. The recently released GPT-4V(ision) provides a unique opportunity to facilitate the automatic scoring of scientific models with its image classification capability. To utilize GPT-4V for automatic scoring, we developed Notation-Enhanced Rubric Instruction for Few-Shot Learning (NERIF), a method that employs instructional notes and scoring rubrics. We randomly sampled a set of balanced data (N = 900) of models drawn by middle school students for six science modeling tasks. Each model was classified by GPT-4V into one of three categories: “Beginning,” “Developing,” Or “Proficient.” GPT-4V’s classifications were compared with human experts’ consent labeling. Results show that GPT-4V’s average scoring accuracy was mean = 0.51 (SD = 0.037). Specifically, average scoring accuracy was 0.64 for the Beginning category, 0.62 for the Developing class, and 0.26 for the Proficient class, indicating that the more complex student-drawn models become, the more challenging they are to score. A qualitative analysis further revealed how GPT-4V (a) retrieves information from image input, (b) identifies characteristics of student-drawn models and describes them, and (c) refers to the scoring rubric and instructional notes to assign scores. This study suggests that utilizing GPT-4V with the NERIF method for the automatic scoring of student-drawn models in science education is promising, though improving scoring accuracy remains an important challenge.
Lee, G. & Zhai, X. (2025). NERIF: GPT-4V for automatic scoring of drawn models. Journal of Science Education and Technology. https://doi.org/10.1007/s10956-025-10262-9