Can Generative AI and ChatGPT Outperform Humans on Cognitive-Demanding Problem-Solving Tasks in Science?

This study aimed to examine an assumption regarding whether generative artificial intelligence (GAI) tools can overcome the cognitive intensity that humans suffer when solving problems. We examine the performance of ChatGPT and GPT-4 on NAEP science assessments and compare their performance to students by cognitive demands of the items. Fifty-four 2019 NAEP science assessment tasks were coded by content experts using a two-dimensional cognitive load framework, including task cognitive complexity and dimensionality. ChatGPT and GPT-4 answered the questions individually and were scored using the scoring keys provided by NAEP. The analysis of the available data for this study was based on the average student ability scores for students who answered each item correctly and the percentage of students who responded to individual items. The results showed that both ChatGPT and GPT-4 consistently outperformed most students who answered each individual item in the NAEP science assessments. As the cognitive demand for NAEP science assessments increases, statistically higher average student ability scores are required to correctly address the questions. This pattern was observed for Grades 4, 8, and 12 students respectively. However, ChatGPT and GPT-4 were not statistically sensitive to the increase of cognitive demands of the tasks, except for Grade 4. As the first study focusing on comparing cutting-edge GAI and K-12 students in problem-solving in science, this finding implies the need for changes to educational objectives to prepare students with competence to work with GAI tools such as ChatGPT and GPT-4 in the future. Education ought to emphasize the cultivation of advanced cognitive skills rather than depending solely on tasks that demand cognitive intensity. This approach would foster critical thinking, analytical skills, and the application of knowledge in novel contexts among students. Furthermore, the findings suggest that researchers should innovate assessment practices by moving away from cognitive intensity tasks toward creativity and analytical skills to more efficiently avoid the negative effects of GAI on testing.

Zhai, X., Nyaaba, M., & Ma, W. (2024). Can generative AI and ChatGPT outperform humans on cognitive-demanding problem-solving tasks in science? Science & Education.