Skip to main content
Top

Open Access 05-05-2025

Human vs. Machine Marking: A Comparative Study of Chemistry Assessments

Authors: Abejide Ade-Ibijola, Ijeoma Joy Chikezie, Solomon Sunday Oyelere

Published in: Journal of Science Education and Technology

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The article explores the evolving landscape of educational assessment, focusing on the integration of artificial intelligence (AI) in marking chemistry assessments. It begins by discussing the traditional methods of human expert marking, which, while nuanced and contextually sensitive, are fraught with inconsistencies and biases. The study then delves into the advantages of AI-driven marking, highlighting its potential for efficiency, consistency, and timely feedback. A comparative analysis is presented, examining the correlation between scores assigned by human experts and those generated by ChatGPT, an advanced AI model. The findings reveal that while AI marking shows promise, it is not without limitations, particularly in evaluating complex, open-ended responses. The article also discusses the theoretical framework underpinning the study, Vygotsky’s constructivist learning theory, and its implications for AI-integrated education. It concludes by advocating for a hybrid approach that leverages the strengths of both human and machine marking to enhance assessment practices and support student learning. The study provides a detailed examination of the accuracy and reliability of automated grading systems, offering valuable insights into the future of educational assessment.
Notes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

The evolving artificial intelligence (AI) technology, particularly in assessment, has opened avenues for innovation for educators to have more time for teaching and student interaction while students receive timely feedback on assessment. Assessment is an integral component of teaching and learning in a chemistry class. It contributes to estimating students’ understanding of chemical concepts and their ability to apply them in practical situations (Stowe & Cooper, 2019). Assessments in secondary school chemistry classes range from traditional pen-and-paper tests (including multiple-choice questions, short/free text question, and essay questions) to laboratory reports and problem-solving exercises. Traditionally, students’ responses to chemistry questions are marked manually by human experts who can understand the nuances and complexities of human creative responses to questions (McNulty, 2023). However, human expert marking, fraught with individual biases, marking inconsistencies, volume of scripts to mark, and time constraints, can influence the assessment process and the quality of feedback given to students (Girgla et al., 2021). In contrast, with AI-driven marking, human error and bias are eliminated from marking, and it affords a reliable and efficient way of evaluating student responses to essay-type questions (Menya, 2018; Pearson & Penna, 2023). Girgla et al., (2021:8) reiterated that “digital assessments can improve efficiency in marking, moderating, and storing information, enabling teachers to use their resources better. They provide opportunities to assess complex knowledge and reasoning that may not be possible to assess through traditional, paper-based methods.” Over the years, machine marking has been limited to multiple choice questions (MCQ) that narrowly restrict students’ responses. Automation or machine marking using various AI tools has gone beyond MCQ and is not limited to short answer questions, problem-solving tasks, reporting laboratory procedures, etc. Machine marking is being recognized by educators in various disciplines, including chemistry education (Shin & Gierl, 2020; Kazmi, 2022; Zhang et al., 2022; Hollis-Sando et al., 2023). This increased enthusiasm is due to the growing student population and availability of AI tools to automate marking of numerous students’ scripts that is time-saving, eliminates bias, promotes objectivity, and has timely feedback (Menya, 2018; Hollia-Sando et al., 2023). Menya (2018) explored the accuracy levels of past exam papers marked by human instructors and developed an improved algorithm-based solution for efficient marking automation. The study revealed that the algorithm-based model achieves an improved marking accuracy by a margin of 16% from 73 to 89%. The model when marking answers to short answer questions (SAQs) achieved 99.9% accuracy. Although various AI tools were applied in marking of SAQs in these studies, machine marking will also be applied in chemistry SAQs.
Whereas the works reviewed so far are done in higher education and mostly outside Nigeria, the objective of this present study is to carry out a comparative analysis of human expert marking and machine marking of chemistry SAQs in NDSS, a Nigerian context. The study will also compare the accuracy of human marking versus machine marking. The study addressed the following research questions:

Research Questions

i.
What is the relationship between the scores assigned to the chemistry SAQs by human experts A and B?
 
ii.
What is the relationship between the scores assigned to the chemistry SAQs by human expert A and ChatGPT?
 
iii.
What is the relationship between the scores assigned to the chemistry SAQs by human expert B and ChatGPT?
 
To answer the research questions, SAQs in chemistry were administered to Senior Secondary Two students (SS 2) of NINLAN Demonstration Secondary School (NDSS), Aba, Nigeria. Their responses were subjected to two human experts for marking, human experts A and B. The human experts’ markings were correlated, and each of the human marking was correlated with the marking of ChatGPT. There are several studies on comparing human expert marking and automated essay scoring (AES) using AI models (ML ANNs, CNNs, RNNs, etc.) in higher education mainly in other disciplines and outside the Nigerian context. (Bridgeman et al., 2012; Taghipour & Ng, 2016; Abbaspour et al., 2020; Rameh & Sanampudi, 2022, Zhang et al., 2022). Other studies explored automated SQAs marking AI models and ChatGPT (Hollis-Sando et al., 2023; Li et al., 2024; Morjaria et al., 2024). From the literature available for this study, claims can be made that this is the first study in Nigeria on the comparative analysis of human marking and machine (ChatGPT) marking of SAQs in secondary school chemistry.

Literature Review

Theoretical Framework

Lev Vygotsky’s constructivist learning theory served as the foundation for this study. Vygotsky (1978) asserted that knowledge is developed through social interactions and that learning occurs within the Zone of Proximal Development (ZPD) through scaffolding. This study, which compares human and machine marking of chemistry assessments, aligns closely with constructivist learning theory, emphasizing the significance of active engagement and knowledge construction by learners (Blikstein & Worsley, 2016; Siemens & Long, 2011). Constructivist learning theory also underscores the value of personalized learning environments, acknowledging the diversity of learners and their distinct cognitive processes. It advocates for adapting content and instructional strategies to meet individual student needs (Dede, 2010; Russell & Norvig, 2010). Machine learning algorithms can analyze student data to identify strengths, weaknesses, and learning preferences (Jackson, 2024). AI-driven assessment tools function as virtual scaffolds by offering real-time feedback, hints, and guidance tailored to a student’s current level of understanding. Vygotsky highlighted the role of social interaction in learning. In this context, AI technologies can facilitate active learning by providing interactive, problem-solving scenarios, immediate feedback, adaptive assessments, and scaffolded learning experiences. These tools enable students to actively engage in the learning process, constructing their understanding through exploration and collaboration (Anderson et al., 1995, cited in Jackson, 2024). Additionally, AI can enhance social interactions by incorporating intelligent agents such as chatbots and virtual tutors that engage students in meaningful discussions, guide them through problem-solving tasks, and encourage collaborative knowledge construction (D’Mello & Graesser, 2014). Overall, Vygotsky’s constructivist learning theory provides a solid framework for integrating AI into education, particularly in assessment. AI-driven tools can strengthen formative assessment by identifying students’ ZPD, offering scaffolding, and fostering interactive learning experiences.

Concept of Marking and Human Expert Marking in Assessment

Assessment plays a crucial role in education with marking as a critical component. Marking plays a vital role in the teachers’ involvement in the assessment process since written responses are an essential method for providing feedback to students and assisting teachers in evaluating students’ comprehension (Elliott et al., 2016). The process of marking is essential for accuracy in assessing student learning, providing feedback, and guiding instructional decisions. Marking is defined as the process of assessing the quality of written work of students by assigning values for correct responses (Kumari, 2022). Williamson (2018) sees marking as a process where assessors assign numerical scores to candidates’ responses, guided by specific marking criteria. Marking provides both teachers and students with an opportunity to assess and evaluate academic progress in a structured, supportive, and personalized manner (Huddersfield Grammar School, 2016). The researcher defines marking as a measurement process of assigning numerical values to written responses of students to questions using a marking guide, which provides the correct answers and descriptors that clarify how the marks should be distributed. Although marking serves several purposes (Brookhart, 2013) not excluding provision feedback to students on their academic achievement and progress and challenging students’ critical thinking to promote learning. Additionally, it serves as a basis for evaluating students’ attainment of learning objectives, and making decisions about their academic progression, such as promotion, graduation, or certification; it has the potential to be hugely time-consuming and laborious.
Marking is done manually by human experts. Human expert marking involves the evaluation of student work by teachers or other qualified educational professionals providing feedback on student learning outcomes and informing instructional decisions (Elliott et al., 2016). Its role includes ensuring that assessments are fair, reliable, and valid by taking into account the learning context, student entry behavior, and potential mitigating circumstances, interpreting complex responses, recognizing subtle distinctions in student understanding, and using explicit feedback to guide student improvement and learning (Ramesh & Sanampudi, 2022; Black & Williams, 1998 cited in Rawi, 2023). Teachers use their professional judgment to evaluate student performance based on established criteria, which helps in understanding students’ strengths and areas for improvement. Human expert marking relies on the judgment of trained educators to mark students’ work, providing detailed and contextually sensitive assessments (Sadler, 2012). Chartered Institute of Educational Assessors (2019) asserted that human expert markers typically consider student accuracy in demonstrating a correct understanding of the concepts and skills being assessed, application of knowledge and skills to solve problems or complete tasks, work presentation clearly, concisely, and organized, and student critically analyzing information and drawing well-supported conclusions.
In all, effective assessment still requires human expertise to provide insightful feedback, judge complex tasks, and ensure fair and equitable outcomes for all students. Regardless of these advantages of human expert marking, major challenges and inefficiencies have confronted the marking process (Menya, 2018). Human marking is fraught with marker’s subjectivity, marking consistencies, individual marker’s biases, time constraints, and scalability issues (Kazmi, 2022; Ramesh & Sanampudi, 2022). These factors can influence the assessment process and the quality of feedback given to students; hence, the need is for machine marking to overcome the challenges.

Machine Marking

Machine marking, also known as automated marking or AI-driven assessment, involves using machines to mark students’ assessments as a result of machine learning (ML) technology (Pearson & Penna, 2023). Machine learning is a subset of AI, which is a process of computers learning and acting like humans. It uses data and algorithms to mimic human perception, decision-making, and the way that humans learn, gradually providing its accuracy in completing a task (Jimenez & Boser, 2021). In other words, ML enables machines to automatically learn from large datasets (such as student responses) and past experiences, recognizing patterns to make predictions with minimal human involvement. Automated marking systems employ ML techniques such as supervised machine learning (SML) and unsupervised machine learning (UML) (McNulty, 2023). The SML involves training algorithms using labeled datasets, allowing them to classify data or make predictions accurately. The model adjusts itself as new input data are introduced until it finds the best fit. In contrast, UML employs ML algorithms to examine and categorize unlabeled datasets, detecting patterns or clusters without the need for labeled data or human guidance. These approaches enable AI-driven automated marking tools to enhance their accuracy and provide more precise feedback, as the systems can learn and improve over time, especially when processing large, complex datasets (Jimenez & Boser, 2021; McNulty, 2023). Although AI based on machine learning is still in its early stages, it has already demonstrated remarkable effectiveness in handling complex tasks that are not rule-based, such as evaluating students’ written answers or analyzing extensive, intricate datasets (Jimenez & Boser, 2021).
There are key benefits of machine marking. First, machine marking is efficient, marking large volumes of assessments rapidly, which significantly saves time and can lead to substantial cost reductions for institutions (Pearson & Penna, 2023; Slimi, 2023). Second, Ramaswami et al. (2023) contended that machine marking applies consistent criteria across all assessments, ensuring uniformity in marking and highlighting that learner-facing dashboard and automated data storytelling tools can provide consistent feedback, reducing the subjectivity associated with human marking. Last, sophisticated automated grading systems can deliver comprehensive and actionable feedback by utilizing natural language processing (NLP) and other AI techniques to examine student responses. They generate feedback that not only helps students recognize their errors but also provides suggestions for improvement, thereby enhancing the effectiveness of student learning support (Paiva et al., 2022). Despite the advantages, there are challenges associated with the implementation of machine marking. One major concern is the accuracy of these systems in understanding and evaluating complex, open-ended responses. While automated systems perform well with structured SAQs and objective questions, their ability to accurately mark essays, creative writing, or nuanced answers is still developing and they cannot provide personalized feedback (McNulty, 2023). Boud and Dawson (2021) pointed out that teacher feedback literacy is crucial in effectively integrating automated tools, as these tools can sometimes misinterpret student responses, leading to incorrect marking. Machine marking algorithms may exhibit bias and inequity depending on how they were trained (McNulty, 2023). Acceptance and trust in machine marking systems to accurately and fairly mark assessments may vary among stakeholders; students need to be assured that their work is being evaluated fairly and that the feedback they receive is meaningful (McNulty, 2023; Pearson & Penna, 2023).
Both human expert marking and machine marking offer unique strengths and drawbacks in assessment; whereas human markers provide nuanced and contextually sensitive evaluations, machine marking systems offer efficiency, consistency, and scalability advantages. Considering the complementary nature of these approaches to marking and addressing the challenges related to validity, reliability, and fairness, educators can leverage both human expertise and technological innovations to enhance assessment practices and support student learning.

Machine Marking in Chemistry Education

Chemistry education is an aspect of STEM education that studies the teaching and learning of chemistry (National Academic Press, 2012). Assessment is essential to effective teaching and learning of chemistry that in turn involves presenting students with tasks and giving feedback. With the increasing population of students in secondary schools, it becomes imperative to introduce machine marking to speed up the process of marking and offer immediate feedback. Machine marking or automated marking, primarily, at the earlier stage of large stake assessment was applied only to objective-type questions. However, AI-driven models in NLP and ML have taken the center stage as SAQs, problem-solving task, and essay can be understood and evaluated seamlessly with automated marking systems (Zhai, 2021.).
Several studies explored different model machine marking in different disciplines including chemistry in higher education. The current study is in secondary education. Zhang et al. (2022) worked on auto-grading of off-line handwritten organic cyclic compound structure. The researchers highlighted that achieving automatic grading of handwritten chemistry assignments necessitates solving the issue of recognizing chemical notations. To address this, they proposed a method based on component detection, which includes detecting components, recognizing text components, and interpreting structures. The model initially identifies the components of offline handwritten organic cyclic compound structures as objects and detects them using the deep learning detector YOLOv5. This is followed by an enhanced attention-based encoder-decoder model for text recognition and a comprehensive algorithm for interpreting the spatial structure of the handwritten content. Similarly, Rahaman and Mahmud (2022) suggested a deep-learning framework for the automated grading of handwritten essay scripts. This architecture combines a CNN with BiLSTM, enabling it to recognize and grade handwritten answers with the same accuracy as a human expert. Zhai (2021) presented ML-based science assessments as cutting-edge technologies that are increasingly involved in innovative assessments in science education including construct, functionality, and automaticity. Zhai argued that automated assessment enables the evaluation of intricate, varied, and structural elements. It facilitates the elicitation of performance, provides methods for gathering, observing, and interpreting evidence, and ultimately aids in prompt and complex decision-making and actions. Reina et al. (2024) developed PLATA, an online automated platform for chemistry undergraduates that allows students to engage in problem-solving practice. “PLATA is programmed to randomly modify not only numeric values but also chemical substances, chemical reactions, and equilibrium constants that are related with one another and cannot vary independently” (Reina et al., 2024:1033). PLATA offers several advantages that include but not limited to considering problems with numerical values representing chemical quantity as correct only when presented with proper units, offering detailed feedback for each attempt, and enabling students to identify the origins of their errors. Moore et al. (2022) assessed the quality and cognitive level of student-generated online short answer questions (SAQs) in college-level chemistry using both human and automated methods. The findings revealed that experts rated 32% of the student-generated questions as high quality and 23% as evaluating higher cognitive processes. However, automatic evaluation with the ChatGPT- 3 model was considered suboptimal because of the overestimation of student-generated SAQs and misclassifying the cognitive processes of Bloom’s taxonomy. These studies are relevant to the current study which explored chemistry test items comprising SAQs.
In another study recently conducted by Hollis-Sando et al. (2023) to investigate medical students’views on computer-based grading and evaluation of the accuracy of deep learning (DL), a subset of machine learning, in assessing medical short answer questions (SAQs). After completing an online survey to gather their opinions on computer-based marking, the responses were initially graded by human evaluators, followed by a deep learning analysis utilizing convolutional neural networks, which are a deep learning algorithm used for image recognition, object detection, and image segmentation. The study found that automated marking of 1-mark SAQs achieved consistently high classification accuracy, with a mean accuracy of 0.98. However, for 2-mark and 3-mark SAQs, which required multi-class classification, the accuracy was lower, with mean accuracies of 0.65 and 0.59, respectively. The use of deep learning in Hollis-Sando et al.’s study demonstrates the potential of advanced machine learning techniques for improving automated marking precision. This can be relevant to the current study, as it suggests that more sophisticated machine learning models might be able to overcome some of the limitations detected in ChatGPT. In addition, Hollis-Sando et al.’s recommendation to combine human and machine marking aligns with the potential benefits of a hybrid approach in assessment as posited in the current study.
Two separate studies found that machine-generated scores generally align with human scores just as closely as scores from one human expert align with those from another human expert when using e-rater and ChatGPT, respectively (Bridgeman et al., 2012; Morjaria et al., 2024). Likewise, Pearson and Penna (2023) employed the NUMBAS e-assessment tool to explore best practices for designing and marking longer questions in surveying modules taken by engineering students at Newcastle University. Their investigation focused on two methods for automated marking of extended computational questions: awarding follow-through marks and breaking down a question into interim steps. The study found that follow-through marks should constitute 25% or 50% of the total available marks to ensure a normal distribution. Additionally, breaking longer calculation questions into too many parts resulted in unnaturally high marks due to excessive guidance provided to students.
From the foregoing review, automated or machine marking is applied to every discipline for marking of SAQs, mathematical tasks, and essays using various AI-driven models. Studies that explored machine marking of handwritten responses first subjected them to a text recognition platform before the actual machine that does the marking. The studies available in the literature to the author were all in higher education and outside the Nigerian context. The current study aimed at comparing machine marking and human expert marking of SAQs of secondary school chemistry.

Marking Framework

Figure 1 is the marking framework, which presents the paths that compare the machine and human marking.
Fig. 1
Marking framework
Full size image
Figure 1 is meant to compare the human expert marking and machine marking of SAQs in chemistry. In comparative marking according to Saunders and Topping (2023), human experts mark assessments guided by a marking rubric. Comparative analysis of the submissions of the markers can account for quality feedbacks and account for the nuances in the markings in a more effective way. The current study compared human experts marking submissions on an assessment task with the submissions of ChatGPT using the same marking guide, a method of assigning marks for correct responses. “A marking guide provides broad outlines for success and allocates a range of marks for each component” (Federation University Australia, 2024:1). The marking framework as designed involves three steps. First is the preparation of the marking guide indicating where and how marks will be awarded. Second, the responses to the chemistry SAQs and the marking guides are given to human experts for marking; similarly, ChatGPT is prompted to perform the same task. The last stage is comparing the feedback from human marking with feedback from machine marking using ChatGPT by correlating the scores.

Methods

The study adopts a comparative research design that aims at exploring the similarities and differences between human experts’ and machine’s markings of short answer chemistry questions. The participants comprised 30 Senior Secondary Two (SS2) students randomly drawn from a population of 98 students offering chemistry in NINLAN Demonstration Secondary School, Abia State, Nigeria. This sample size, which is 31% of the population, was chosen to ensure a manageable and a representative sample for the study. Two chemistry experts were also involved. The instrument for data collection comprised three SAQs adopted from the National Examination Council Senior School Certificate Examination (NECOSSCE) past question paper as follows: 1. What is activation energy? (2 marks). 2. What is electronegativity? (2 marks). 3. Describe the process involved in the production of ammonia gas in the laboratory (6 marks). Students are expected to make two key points in defining each of the concepts in items 1 and 2. The distribution of marks was 1 mark each point to a maximum of 2 marks in each case. For item 3, six key procedures involved in the production of ammonia gas in the laboratory were expected. Thirty students sat for the test. The responses were duplicated, and two sets of 30 scripts each were provided to two human chemistry experts for marking according to the marking scheme. The machine marking was done using ChatGPT. The chemistry SQAs, marking guide indicating points for correct answers to the questions, and students’ responses were fed into ChatGPT. The ChatGPT was prompted to mark the responses to the chemistry SAQs based on the marking guide. The responses, along with a prompt for grading each question based on the marking guide, were submitted to ChatGPT for evaluation. The ChatGPT generated scores and human experts assigned scores were collated as output for further analysis.
Pearson product moment correlation (PPMC) statistics was used to assess the relationship between the scores generated by ChatGPT and those assigned by each human expert, as well as the correlation between the two sets of human expert-assigned scores, all at a 0.05 alpha level. To determine the extent of the difference between human expert-assigned scores and ChatGPT-generated scores, the mean and standard deviation were calculated.

Results

Research question 1: What is the relationship between the scores assigned to the chemistry SAQs by human experts A and B?
Research question one was answered by calculating the mean and standard deviation of scores assigned to the SAQs by the two human experts. The strength of the relationship between the two sets of scores was computed. The result is presented in Table 1.
Table 1
Descriptive statistics and correlation of between human experts A and B assigned scores
Variables
N
\(\overline{\upchi }\)
SD
r
p-value
Human expert A
30
6.17
1.53
  
    
0.75
0.000
Human expert B
30
5.23
1.07
  
Table 1 indicates that the mean scores and standard deviations for human experts A and B were ( \(\overline{\upchi }\) = 6.17, SD = 1.53) and ( \(\overline{\upchi }\) = 5.23, SD = 1.07), respectively. The scores of human experts revealed a significant correlation between the scores of human experts A and B (r = 0.75). It can be explained that there was a strong positive relationship between the two sets of scores by the human experts.
Research question 2: What is the relationship between the scores assigned to the chemistry SAQs by human expert A and ChatGPT?
Research question two was answered by comparing the mean and standard deviation of scores assigned to chemistry SAQs by human expert A and ChatGPT. The strength of the relationship between the human marking and ChatGPT was computed. The result is presented in Table 2.
Table 2
Descriptive statistics and correlation of between human expert A and ChatGPT assigned scores
Variables
N
\(\overline{\upchi }\)
SD
r
p-value
Human expert A
30
6.17
1.53
  
    
0.56
0.001
ChatGPT
30
5.13
1.14
  
Table 2 revealed the mean scores and standard deviations of human expert A scores and ChatGPT scores to be ( \(\overline{\upchi }\) = 6.17, SD = 1.53) and ( \(\overline{\upchi }\) = 5.13, SD = 1.14), respectively. The correlation between the human marker and ChatGPT returned a coefficient of r = 0.56. This can be interpreted to mean that a statistically significant moderate positive relationship exists between the two sets of scores of human marker and ChatGPT.
Research question 3: What is the relationship between the scores assigned to the chemistry SAQs by human expert B and ChatGPT?
Research question three was answered by comparing the mean and standard deviation of scores assigned to chemistry SAQs by a human expert B and ChatGPT. The strength of the relationship between the human marking and ChatGPT was computed. The result is presented in Table 3.
Table 3
Descriptive statistics and correlation of between human expert B and ChatGPT assigned scores
Variables
N
\(\overline{\upchi }\)
SD
r
p-value
Human expert B
30
5.23
1.07
  
    
0.57
0.001
ChatGPT
30
5.13
1.14
  
As can be seen in Table 3, the mean scores and standard deviations of human expert A-assigned scores and ChatGPT-assigned scores are as follows: ( \(\overline{\upchi }\) = 5.23, SD = 1.07) and ( \(\overline{\upchi }\) = 5.13, SD = 1.14), respectively. The correlation between the human marker B and ChatGPT yielded a coefficient of r = 0.57. This shows that a statistically significant moderate positive relationship exists between the scores assigned by human marker B and ChatGPT.

Discussion

The purpose of the study was to compare human expert marking of secondary school SAQs in chemistry in Nigeria. The comparative analyses involved scores given by two human experts and compared these with scores assigned by ChatGPT and each of the human experts, as illustrated in Tables 1, 2, and 3, respectively. The analyses were based on the total scores assigned to the three chemistry SAQs. The correlation of human expert with human expert marking of the chemistry SAQs is quite substantial (r = 0.75), whereas human experts marking with ChatGPT marking correlations were consistently lower than human expert marking with human expert marking correlation (human experts A and B, r = 0.56 and 0.57, respectively).
These findings indicate that while human experts show agreement in their scoring, ChatGPT’s performance is less consistent with human marking standards. Furthermore, most responses (56–74%) indicated a difference in score by at least one point between human experts’ and ChatGPT’s scores, but score differences by two or three points were less frequent (26–34%), explaining that while ChatGPT may not always match the specific score given by a human expert, but it generally falls within an admissible range. Overall, ChatGPT tended to assign similar scores with human expert A than with human expert B. This discrepancy could be attributed to several contextual factors unique to Nigeria, such as variations in marking experience and exposure to automated assessment tools. The findings are consistent with earlier studies that support the idea that differences observed in the scores assigned by human expert marking can be due to the subjectivity, individual biases, marking inconsistencies, etc. that fraught human marking and even affect the assessment process and the quality of feedback given to students (Girgla et al., 2021; Ramaswami et al., 2023; Reina et al., 2024). In contrast, Moore et al. (2022) reported that the students’ responses evaluated by experts were of high quality. Admittedly, students’ responses to chemistry questions, which contain symbols, formula, equations, and structures, are marked manually by human experts who can understand the nuances and complexities of human creative responses to questions (McNulty, 2023). More so, the SAQs administered were handwritten; the legibility of the students’ handwriting can account for the differences in total scores obtained by individual human expert markers (Rahaman & Mahmud, 2022; Zhang et al., 2022). In Nigeria, where handwritten examinations are the norm and access to digital tools is limited, this factor is particularly relevant.
The findings also revealed that human experts-ChatGPT’s correlations are consistently lower than human expert-human expert correlation. Furthermore, the ChatGPT-assigned score is very similar to the scores assigned by human expert B than to human expert A. This suggests that predicting a score that a human would assign is better done considering a machine-assigned score than a score assigned by a human (Bridgeman et al., 2012). The mean difference between human expert A versus ChatGPT ( \(\overline{\upchi }\) = 6.17 and \(\overline{\upchi }\) = 5.13, respectively) indicates that human expert A gave higher scores than ChatGPT. In contrast, two independent studies found that machine scores generally correlate as highly with human scores as scores from one human expert correlate with scores from another human expert with e-rater and ChatGPT, respectively (Bridgeman et al., 2012; Morjaria et al., 2024). Despite the inconsistency of the current finding with earlier studies, it is worth noting that machine marking offers a reliable and efficient way of assessing student responses to not only SAQs but essay-type questions (Menya, 2018; Person & Penna, 2023; Girgla et al., 2021). It also applies consistent criteria, hence ensuring uniformity in marking (Ramawami et al., 2023). However, there is an advocacy for a blend of machine and human expert marking as an effective approach for assessing students’ responses to SAQs (Hollis-Sandos et al., 2023).
The item-by-item analysis of the human experts and the machine marking of chemistry SAQs in Table 4 indicated mean score differences in items 1, 2, and 3. For item 1, ChatGPT and human expert A have a difference of 0.40, and ChatGPT and human expert B have a difference of 0.14. Similarly, for item 2, ChatGPT and human expert A have a difference of 0.43, and ChatGPT and human expert B have a difference of 0.30. Finally, for item 3, ChatGPT and human expert A have a difference of 0.23, and ChatGPT and human expert B have a difference of 0.40. The observed trend supported Hollia-Sando et al. (2023) study, which found that automated marking of 1-mark SAQs consistently demonstrated high accuracy, whereas for more complex 2-mark and above, accuracy was lower.
Table 4
Item-by-item analysis of SAQs
SAQs items
N
ChatGPT x̄/SD \(\overline{\upchi }/SD\)
Human A x̄/SD \(\overline{\upchi }/SD\)
r
ChatGPT x̄/SD \(\overline{\upchi }/SD\)
Human B x̄/SD \(\overline{\upchi }/SD\)
r
Human A x̄/SD \(\overline{\upchi }/SD\)
Human B x̄/SD \(\overline{\upchi }/SD\)
r
1
30
1.43/0.50
1.83/0.38
0.21
1.43/0.50
1.57/0.77
0.41
1.83/0.38
1.57/0.77
0.67
2
30
1.47/0.63
1.90/0.31
0.07
1.47/0.63
1.77/0.57
0.03
1.90/0.31
1.77/0.57
0.86
3
30
2.20/0.92
2.43/1.45
0.78
2.20/0.92
1.80/1.03
0.77
2.43/1.45
1.80/1.03
0.87
For the majority of responses (66–74%), there was at least a one-point difference between scores assigned by ChatGPT and those given by human experts. However, larger discrepancies of more than one point were less common (26–34%). This suggests that while ChatGPT’s scores may not always align exactly with those of human experts, they generally fall within an acceptable range. However, the analysis of individual items revealed that machine marking may struggle with more complex questions, as evidenced by larger discrepancies in scores for items with higher point values. This finding corroborates with Morjaria et al. (2024) that evaluated ChatGPT’s effectiveness in grading short-answer questions in an undergraduate medical program, concluding that ChatGPT is a useful, though not flawless, tool for assisting human assessors, performing similarly to a single expert assessor. The finding is slightly different from the report of previous work, for instance, Menya (2018) asserted that the algorithm-based model when marking responses to SAQs achieved 99.9% accuracy.

Conclusion

In conclusion, while machine marking offers significant potential for improving the efficiency and consistency of assessment, it is important to recognize its limitations and to carefully consider its implementation in educational settings. The lower correlation between ChatGPT and human experts could identify experts who are careless when it comes to marking. Further, it suggests that AI models need further adaptation to align with the specific marking criteria used in Nigerian secondary schools. This could involve training the model on a dataset of Nigerian student responses to better capture local language use, common errors, and response styles. ChatGPT’s ability to consistently provide scores within an acceptable range suggests it could be a useful tool to support human assessors, particularly in large-scale assessments where grading consistency is crucial. However, its role should be complementary rather than replacing human markers entirely, especially for more complex or context-dependent responses.

Limitations

A major limitation of this study is the potential issue of ChatGPT marking more complex responses for open-ended question as is the case in the third question. The training data used to develop ChatGPT may contain biases, which could lead to biased scoring outcomes. If, for instance, the training data are predominantly from a certain demographic or educational background, ChatGPT may be more likely to favor responses that align with those patterns. This suggests that more sophisticated machine learning models might be able to overcome some of the limitations observed in ChatGPT. The lower accuracy seen in AI-driven evaluation of the responses can be attributed to the fact that SAQs often require subjective judgment, such as evaluating the quality of arguments, the depth of understanding, or the creativity of a response and ChatGPT may struggle to accurately assess these subjective elements. Another limitation of the study is the small sample size. A larger sample size could have led to substantially higher accuracy in marking responses, especially for question number 3, which carries a total of 6 marks. An increase in sample size would have proved more data for the algorithms to learn from.

Future Research

Future research focused on refining or enhancing AI models to better understand and evaluate responses to more complex SAQs and for different types of chemistry questions, such as essay questions or problem-solving tasks may be valuable. Future studies could explore the integration of AI with human marking for other educational assessment tools, such as formative assessment or adaptive learning systems, and investigate the long-term impact of such tools on teaching and learning outcomes.

Declarations

Ethical Approval

This research was carried out in full compliance with ethical standards. It involved human participants, and informed consent was obtained from all participants before involvement in the study. They were informed of the nature of the study and the objectives. They were assured of confidentially and anonymity throughout the study. This was upheld.
All the authors read and approved the final manuscript and agreed to publish it in the Journal of Science Education and Technology.

Competing interests

The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literature
go back to reference Abbaspour, S., Fotouhi, F., Sedaghatbaf, A., Vahabi, M. & Linden, M. (2020). A comparative analysis of hybrid deep learning models for human activity recognition. Sensor, 20, 1–14. https://www.mdpi.com. Accessed 26 May 2024. Abbaspour, S., Fotouhi, F., Sedaghatbaf, A., Vahabi, M. & Linden, M. (2020). A comparative analysis of hybrid deep learning models for human activity recognition. Sensor, 20, 1–14. https://​www.​mdpi.​com. Accessed 26 May 2024.
go back to reference Elliott, V., Baird, J. A., Hopfenbeck, T. N., Ingram, J., Thompson, I., Usher, N., Zantout, M.,Richardson, J. & Coleman, R. (2016). A marked improvement? A review of the evidence on written marking. EducationEndowment Foundation. https://www.angelsolutions.co.uk. Accessed 30 May 2025. Elliott, V., Baird, J. A., Hopfenbeck, T. N., Ingram, J., Thompson, I., Usher, N., Zantout, M.,Richardson, J. & Coleman, R. (2016). A marked improvement? A review of the evidence on written marking. EducationEndowment Foundation. https://​www.​angelsolutions.​co.​uk. Accessed 30 May 2025.
go back to reference Girgla, A., Good, L., Krstic, S. McGinley, B., Richardson, S., Sneidze-Gregory, S. & Star, J.(2021).Developing a teacher assessment literacy and design competence framework.AustralianCouncil forEducational Research. http://ibo.org. Accessed 6 June 2024. Girgla, A., Good, L., Krstic, S. McGinley, B., Richardson, S., Sneidze-Gregory, S. & Star, J.(2021).Developing a teacher assessment literacy and design competence framework.AustralianCouncil forEducational Research. http://​ibo.​org. Accessed 6 June 2024.
go back to reference Hollis-Sando, L., Pugh, C., Franke, K., Zerner, T., Tan, Y., Carneiro, G., van den Hengel, A., Symonds, I., Duggan, P., & Bacchi, S. (2023). Deep learning in the marking of medical student short answer questionexaminations: Student perceptions and pilot accuracy assessment. Focus on Health Professional Education: A Multi-Professional Journal, 24(1), 38–48. https://doi.org/10.11157/fohpe.v24i1.531 Hollis-Sando, L., Pugh, C., Franke, K., Zerner, T., Tan, Y., Carneiro, G., van den Hengel, A., Symonds, I., Duggan, P., & Bacchi, S. (2023). Deep learning in the marking of medical student short answer questionexaminations: Student perceptions and pilot accuracy assessment. Focus on Health Professional Education: A Multi-Professional Journal, 24(1), 38–48. https://​doi.​org/​10.​11157/​fohpe.​v24i1.​531
go back to reference Kazmi, S. M. (2022). Integrating natural language processing techniques to enhance automated essay evaluation. Masters Thesis in Applied Computer Science. Ostfold UniversityCollege. https://hiof.brage.unit.no. Accessed 11/5/24. Kazmi, S. M. (2022). Integrating natural language processing techniques to enhance automated essay evaluation. Masters Thesis in Applied Computer Science. Ostfold UniversityCollege. https://​hiof.​brage.​unit.​no. Accessed 11/5/24.
go back to reference Li, K., Yang, Q. & Yang, X. (2024). Can autograding of student-generated questions quality by ChatGPT match human experts? IEEE Transactions on Learning Technologies, 17,1600–1610. https://dl.acm.org. Accessed 30 May 2024. Li, K., Yang, Q. & Yang, X. (2024). Can autograding of student-generated questions quality by ChatGPT match human experts? IEEE Transactions on Learning Technologies, 17,1600–1610. https://​dl.​acm.​org. Accessed 30 May 2024.
go back to reference Moore, S., Nguyen, H.A., Bier, N., Domadia, T. spsampsps Stamper, J. (2022). Assessing the quality of student-generated short answer questions using GPT-3. In: I. Hilliger, P. J. Muñoz-Merino, T. De Laet, A. Ortega-Arranz spsampspsT. Farrell, T. (Eds.), Educating for a New Future: Making Sense of Technology-Enhanced LearningAdoption. EC-TEL 2022. Lecture Notes in Computer Science, (pp. 243–257). Springer. https://doi.org/10.1007/978-3-031-16290-9_18 Moore, S., Nguyen, H.A., Bier, N., Domadia, T. spsampsps Stamper, J. (2022). Assessing the quality of student-generated short answer questions using GPT-3. In: I. Hilliger, P. J. Muñoz-Merino, T. De Laet, A. Ortega-Arranz spsampspsT. Farrell, T. (Eds.), Educating for a New Future: Making Sense of Technology-Enhanced LearningAdoption. EC-TEL 2022. Lecture Notes in Computer Science, (pp. 243–257). Springer. https://​doi.​org/​10.​1007/​978-3-031-16290-9_​18
go back to reference Morjaria, L., Burns, L., Bracken, K., Levinson, A. J., Ngo, Q. A., Mark Lee, M. & Sibbald, M. (2024). Examining the efficacy of chatGPT in marking short-answer assessments in an undergraduate medical program. International Medical Education. 3(1), 32–43. https://www.mdpi.com. Accessed 2/9/24. Morjaria, L., Burns, L., Bracken, K., Levinson, A. J., Ngo, Q. A., Mark Lee, M. & Sibbald, M. (2024). Examining the efficacy of chatGPT in marking short-answer assessments in an undergraduate medical program. International Medical Education. 3(1), 32–43. https://​www.​mdpi.​com. Accessed 2/9/24.
go back to reference Paiva, J. C., Leal, J. P. & Figueria, A. (2022). Automated assessment in computer scienceeducation: A state-of-the art-review. ACM Transactions on Computing Education, 22(3), 1 – 40. https://dl.acm.org. Accessed 26 May 2024. Paiva, J. C., Leal, J. P. & Figueria, A. (2022). Automated assessment in computer scienceeducation: A state-of-the art-review. ACM Transactions on Computing Education, 22(3), 1 – 40. https://​dl.​acm.​org. Accessed 26 May 2024.
go back to reference Rahaman, M. A. & Mahmud, H. (2022). Automated evaluation of handwritten answer script using deep learning approach. Transactions on Machine Learning and Artificial Intelligence, 10(4), 1–16. https://www.researchgate.net. Accessed 26 May 2024. Rahaman, M. A. & Mahmud, H. (2022). Automated evaluation of handwritten answer script using deep learning approach. Transactions on Machine Learning and Artificial Intelligence, 10(4), 1–16. https://​www.​researchgate.​net. Accessed 26 May 2024.
go back to reference Ramaswami, G., Susnjak, T. & Mathrani, A. (2023). Effectiveness of learning analyticsdashboard for increasing student engagement levels. Journal of Learning Analytics,10(3), 115 – 134. https://www.learning.analytics.info. Accessed 6 June 2024. Ramaswami, G., Susnjak, T. & Mathrani, A. (2023). Effectiveness of learning analyticsdashboard for increasing student engagement levels. Journal of Learning Analytics,10(3), 115 – 134. https://​www.​learning.​analytics.​info. Accessed 6 June 2024.
go back to reference Rawi, R. (2023). Embracing the shift: Heritage language teachers’ perspectives on accepting assessment forlearning as education reform. Open Journal of Social Sciences, 11, 128–151. https://www.scrip.org. Accessed 30 May 2024. Rawi, R. (2023). Embracing the shift: Heritage language teachers’ perspectives on accepting assessment forlearning as education reform. Open Journal of Social Sciences, 11, 128–151. https://​www.​scrip.​org. Accessed 30 May 2024.
go back to reference Reina, M., Guzmán-López, E.G., Guzmán-López, C., Hernández-Garciadiego, C., Olvera-León, M. A., GarciaCarrillo, M. A., Tafoya-Rodríguez, M. A., Ugalde-Saldívar, V. M., Guerrero-Ríos, I., Gasque, L., delCampo, J. M., Franco-Bodek, D., Bernal-Pérez, R., Medeiros., M., Marín-Becerra, A., García-Ortega, H.,Gracia-Mora, J., & Reina, A. (2024). PLATA: Design of an online platform for chemistry undergraduate fully automated assignments. Journal of Chemical Education, 101, (3), 1024–1035. https://pubs.acs.org. Accessed 20 May 2024. Reina, M., Guzmán-López, E.G., Guzmán-López, C., Hernández-Garciadiego, C., Olvera-León, M. A., GarciaCarrillo, M. A., Tafoya-Rodríguez, M. A., Ugalde-Saldívar, V. M., Guerrero-Ríos, I., Gasque, L., delCampo, J. M., Franco-Bodek, D., Bernal-Pérez, R., Medeiros., M., Marín-Becerra, A., García-Ortega, H.,Gracia-Mora, J., & Reina, A. (2024). PLATA: Design of an online platform for chemistry undergraduate fully automated assignments. Journal of Chemical Education, 101, (3), 1024–1035. https://​pubs.​acs.​org. Accessed 20 May 2024.
go back to reference Shin, J. & Gierl, M. J. (2020). More efficient processes for creating automated essay scoringframeworks: A demonstration of two algorithms. Language Testing. 38(2), 247 – 272. https://www.researchgate.net. Accessed 30 May 2024. Shin, J. & Gierl, M. J. (2020). More efficient processes for creating automated essay scoringframeworks: A demonstration of two algorithms. Language Testing. 38(2), 247 – 272. https://​www.​researchgate.​net. Accessed 30 May 2024.
go back to reference Slimi, Z. (2023). The impact of artificial intelligence on higher education: An empirical study.EuropeanJournal of Educational Sciences, 10(1), 17 – 33. https://www.researchgate.net. Accessed 6 June 2024. Slimi, Z. (2023). The impact of artificial intelligence on higher education: An empirical study.EuropeanJournal of Educational Sciences, 10(1), 17 – 33. https://​www.​researchgate.​net. Accessed 6 June 2024.
go back to reference Taghipour, K. & Ng. H. T. (2016). A neural approach to automated essay scoring. In J. Su, K. Duh & X. Carreras(Eds.), Empirical Methods in Natural Language Processing: Conference of Association for ComputationalLinguistics (pp.1882–1891). Association for Computational Linguistics. https://aclanthology.org. Accessed 29 May 2024. Taghipour, K. & Ng. H. T. (2016). A neural approach to automated essay scoring. In J. Su, K. Duh & X. Carreras(Eds.), Empirical Methods in Natural Language Processing: Conference of Association for ComputationalLinguistics (pp.1882–1891). Association for Computational Linguistics. https://​aclanthology.​org. Accessed 29 May 2024.
go back to reference Zhai, X. (2021). Practice and theories: How can machine learning assist in innovative assessment practices in science education? Journal of Science Education and Technology, 30(2), 139 – 149. https://www.link.springer.com. Accessed 30 May 2024. Zhai, X. (2021). Practice and theories: How can machine learning assist in innovative assessment practices in science education? Journal of Science Education and Technology, 30(2), 139 – 149. https://​www.​link.​springer.​com. Accessed 30 May 2024.
Metadata
Title
Human vs. Machine Marking: A Comparative Study of Chemistry Assessments
Authors
Abejide Ade-Ibijola
Ijeoma Joy Chikezie
Solomon Sunday Oyelere
Publication date
05-05-2025
Publisher
Springer Netherlands
Published in
Journal of Science Education and Technology
Print ISSN: 1059-0145
Electronic ISSN: 1573-1839
DOI
https://doi.org/10.1007/s10956-025-10223-2

Premium Partners